WO2020091848A1 - Matrix multiplier with submatrix sequencing - Google Patents

Matrix multiplier with submatrix sequencing Download PDF

Info

Publication number
WO2020091848A1
WO2020091848A1 PCT/US2019/037656 US2019037656W WO2020091848A1 WO 2020091848 A1 WO2020091848 A1 WO 2020091848A1 US 2019037656 W US2019037656 W US 2019037656W WO 2020091848 A1 WO2020091848 A1 WO 2020091848A1
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
submatrix
input register
multiply
gpu
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2019/037656
Other languages
English (en)
French (fr)
Inventor
Maxim V. KAZAKOV
Jian Mao
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced Micro Devices Inc
Original Assignee
Advanced Micro Devices Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced Micro Devices Inc filed Critical Advanced Micro Devices Inc
Priority to JP2021523783A priority Critical patent/JP7461945B2/ja
Priority to EP19880374.4A priority patent/EP3891626A4/en
Priority to CN201980077886.0A priority patent/CN113168430A/zh
Priority to KR1020217015589A priority patent/KR102586989B1/ko
Publication of WO2020091848A1 publication Critical patent/WO2020091848A1/en
Anticipated expiration legal-status Critical
Priority to JP2023065959A priority patent/JP2023089161A/ja
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of a power-saving mode
    • G06F1/3234Power saving characterised by the action undertaken
    • G06F1/3243Power saving in microcontroller unit
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • a processor can include a graphics processing unit (GPU).
  • the GPU includes specialized hardware to perform parallel processing for relatively large blocks of data. Accordingly, the GPU can support graphics applications, as well as other operations that require vector and matrix manipulation.
  • a GPU can include dedicated hardware to perform designated types of matrix operations, including matrix multiplication.
  • FIG. 1 is a block diagram of a GPU of a processor, the GPU configured to perform matrix multiplication by sequencing the application of submatrices to a matrix multiplier in accordance with some embodiments.
  • FIG. 2 is a diagram illustrating example matrices for multiplication at the GPU of FIG. 1 in accordance with some embodiments.
  • FIG. 3 is a diagram illustrating an example of sequencing the application of submatrices to the matrix multiplier of FIG. 1 in accordance with some embodiments.
  • FIG. 4 is a block diagram of additional aspects of the GPU of FIG. 1 supporting submatrix sequencing in accordance with some embodiments.
  • FIG. 5 is a flow diagram of a method of sequencing application of submatrices at a matrix multiplier of a GPU in accordance with some embodiments.
  • FIGs. 1-5 illustrate techniques for reducing power consumption at a graphics processing unit (GPU) of a processor by sequencing the application of submatrices at a matrix multiplier to reduce the number of input changes at an input register of the matrix multiplier.
  • the matrix multiplier is configured to perform a matrix multiplication for a relatively small matrix (e.g., a 4X4 matrix).
  • a relatively small matrix e.g., a 4X4 matrix.
  • the GPU decomposes the larger matrices into smaller submatrices and stores the submatrices at input registers of the matrix multiplier in a sequence, thereby calculating each column of a result matrix.
  • FIG. 1 illustrates a GPU 100 of a processor that configured to perform matrix multiplication by sequencing the application of submatrices in accordance with some embodiments.
  • the GPU 100 is part of a processor that is generally configured to execute sets of instructions in order to carry out operations on behalf of an electronic device.
  • the GPU 100 is part of an electronic device such as a desktop or laptop computer, a server, a handheld electronic device such as a smartphone or tablet, a game console, and the like.
  • the GPU 100 is generally configured to execute graphics and vector processing operations on behalf of the processor.
  • a central processing unit (CPU, not shown at FIG. 1 ) of the processor provides the GPU with sets of operations for execution, whereby the sets of operations are associated with graphics or vector processing.
  • CPU central processing unit
  • the GPU 100 includes a plurality of Single-Instruction Multiple-Data (SIMD) processing units (e.g., SIMD units 102 and 104). It will be appreciated that the GPU 100 also includes additional modules to support the SIMD units, such as fetch and decode logic to fetch and decode instructions for the SIMD units, a register file to store operands for the SIMD units, and the like.
  • SIMD units include additional modules to support the SIMD units, such as fetch and decode logic to fetch and decode instructions for the SIMD units, a register file to store operands for the SIMD units, and the like.
  • each SIMD unit includes a matrix multiplier together with corresponding input registers and a corresponding output register.
  • the SIMD unit 102 includes a matrix multiplier 1 10, input registers 106 and 107, and an output register 108.
  • the term“register” refers to any storage module that is configured to store matrices (including submatrices).
  • the matrix multiplier 1 10 is configured to multiply matrices stored at the registers 106 and 107 and store the resulting product at the register 108.
  • the generation of a single product for matrices at the input registers 106 and 102 is referred to herein as a“multiplication cycle” for the matrix multiplier 1 10.
  • the SIMD unit 102 is clocked by a clock signal (designated“CLK”) and a multiply cycle of the matrix multiplier 1 10 corresponds to a single clock cycle of the CLK clock signal. That is, for a single clock cycle of the CLK clock signal, the matrix multiplier 1 10 is configured to generate a product at the register 108 based on input operands stored at the input registers 106 and 107.
  • each multiply cycle of the matrix multiplier 1 10 requires multiple cycles of the CLK clock signal.
  • the matrix multiplier 1 10 is configured to generate a product for relatively small input matrices.
  • the matrix multiplier 1 10 is a 4X4X4 multiplier, such that the matrix multiplier 1 10 is configured to multiply a 4X4 matrix stored at the input register 106 with a 4X4 matrix stored at the input register 107 to generate a 4X4 product (result) matrix at the output register 108.
  • the CPU provides the GPU 100 with operations requiring multiplication of larger matrices, such as multiplication of 16X16 matrices.
  • the SIMD is configured to decompose the larger matrices into multiple smaller
  • the matrix multiplier 1 10 multiplies input matrices, designated matrix A, an MXK matrix, and matrix B, a KXN matrix, to calculate a result matrix R (an MXN) matrix.
  • the matrices A and B are stored at the input registers 106 and 107, respectively, and the result matrix R is stored at the output register 108.
  • the matrix multiplier 1 10 calculates the result matrix R by summing the K outer products of column k of the A matrix and row k of the B matrix, as set forth by the following formula:
  • the SIMD 102 decomposes the input matrices into smaller submatrices that are the specified input size for by the matrix multiplier 1 10, multiplies the submatrices at the matrix multiplier 1 10 to generate a set of intermediate results, and combines the intermediate results to determine the final result matrix R.
  • the matrix multiplier calculates the inner product
  • the SIMD 102 decomposes the input matrices into smaller submatrices, determines the products of different sets of the submatrices based on dot products of the different sets, then calculates the outer product for the resulting dot products to determine the final result matrix.
  • calculating the different intermediate results requires changing the submatrices stored at the input registers 106 and 107.
  • each change in data stored at an input register consumes power at the GPU 102.
  • each change in data at the inputs of the corresponding arithmetic logic units (ALUs) or other modules of the matrix multiplier 1 10 consumes additional power, relative to maintaining the input data in an unchanged state.
  • the SIMD 102 sequences the storage of submatrices at the input registers 106 and 107 such that a submatrix is maintained at one of the input registers (e.g., register 107) for a plurality of successive multiply cycles, until that submatrix is no longer needed for calculation of the result matrix R. That is, the SIMD 102 sequences application of input submatrices at the input registers 106 and 107 to reduce the amount of input switching at one of the registers and, as a result, at one of the inputs of the matrix multiplier 1 10, thereby conserving power.
  • FIG. 2 illustrates an example of two 16X16 matrices 220 and 222, designated matrix A and matrix B, respectively.
  • Each of the matrices A and B include 16 4X4 submatrices (e.g., submatrix 221 of matrix A).
  • the matrices A and B are multiplied at the GPU 102 to generate a result matrix 224, designated matrix R, which also includes a plurality of 4X4 submatrices.
  • the matrix R can be viewed as a set of columns of submatrices.
  • the first column of R is composed of submatrices R 0, o, Ri , o, R 2, o, and R 3, o.
  • the GPU 100 calculates the matrix R by calculating each column of submatrices of R, then concatenates the different columns to form the R matrix.
  • each column of submatrices of R is calculated concurrently at a different corresponding SIMD of the GPU 100, and one of the SIMDs then
  • the corresponding SIMD To calculate a column of submatrices of R, the corresponding SIMD employs its matrix multiplier to determine a set of inner (dot) products for corresponding submatrices of the matrices A and B, then calculates outer products over the inner product results. For example, to generate the submatrix Ro , o, the SIMD 102 performs the following calculations:
  • the SIMD 102 performs analogous calculations to generate the submatrices R 2, o and R 3 ,O.
  • the SIMD 102 To perform each multiplication for calculating a corresponding submatrix, the SIMD 102 loads the corresponding submatrices of matrix A and matrix B into the input registers 106 and 107, respectively, and the matrix multiplier 1 10 performs the multiplication, storing the result at the output register 108.
  • submatrices of the matrix B are reused to calculate different submatrices of the matrix R.
  • the SIMD 102 is configured to sequence the multiplications, so that the submatrices of the matrix B, as stored at the input register 107, remain unchanged over a plurality of successive multiplication cycles of the matrix multiplier 1 10. The SIMD 102 thereby reduces the number of loads to the input register 107 and changes of the input of the matrix multiplier 1 10, thus reducing power consumption.
  • FIG. 3 illustrates a set of successive multiplication cycles 301 -305 and the corresponding contents of each of the input registers 106 and 107.
  • the SIMD 102 loads the submatrices Ao , o and Bo , o to the input registers 106 and 107, respectively.
  • the matrix multiplier 1 10 multiplies the submatrices to calculate an intermediate result for the first column of the result matrix R.
  • the SIMD 102 loads the submatrix Ai , o into the input register 106, but maintains the submatrix Bo , o at the input register 107.
  • the matrix multiplier 1 10 multipliers the submatrices to calculate another intermediate result for the first column of the result matrix R.
  • the SIMD 102 For the next multiplication cycle 303 the SIMD 102 loads the submatrix A 2, o into the input register 106, but maintains the submatrix Bo , o at the input register 107.
  • the matrix multiplier 1 10 multipliers the submatrices to calculate still another intermediate result for the first column of the result matrix R.
  • the SIMD 102 loads the submatrix A 3, o into the input register 106, but maintains the submatrix Bo , o at the input register 107.
  • the matrix multiplier 1 10 multipliers the submatrices to calculate another intermediate result for the first column of the result matrix R.
  • all calculations that require the submatrix Bo , o have been completed.
  • the SIMD 102 loads the submatrix Ao ,i into the input register 106 and the submatrix Bi , o into the input register 107.
  • the SIMD 102 maintains the submatrix Bo , o at the input register 107 for four consecutive
  • the SIMD 102 continues executing multiplication operations at the matrix multiplier 1 10 and combining the resulting products to calculate the first column of the result matrix R.
  • the sequence of multiplications (including corresponding input matrices loaded and maintained at the input registers 106 and 107) is as follows:
  • the GPU 102 performs similar calculations to calculate the other columns of the result matrix R.
  • the GPU 102 employs a different SIMD to concurrently calculate a corresponding column of the result matrix R, and employs one of the SIMDs, or other module, to concatenate the different columns into the final result matrix R.
  • FIG. 4 illustrates additional aspects of the SIMD 102 of FIG. 1 to support sequencing of input submatrices for the matrix multiplier 1 10 in accordance with some embodiments.
  • the SIMD 102 includes a data store 435 connected to a sequencer 430.
  • the data store 435 is a buffer, cache, register file, or other memory structure configured to store submatrices (e.g., submatrix 433) for the matrix multiplier 1 10.
  • the sequencer 430 is a hardware module configured to decompose the input matrices 105 (matrix A and matrix B) into corresponding submatrices and store the submatrices at the data store 435.
  • the sequencer 430 is further configured to, for corresponding multiplication cycles, retrieve one or more submatrices from the data store 435 and load each retrieved submatrix to the corresponding input register 106 and 107.
  • the sequencer 430 thus controls the sequencing of input submatrices at the matrix multiplier 1 10 to carry out a matrix multiplication of a relatively large matrix.
  • FIG. 5 is a flow diagram of a method 500 of sequencing application of submatrices at a matrix multiplier of a GPU in accordance with some embodiments. For purposes of description, the method 500 is described with respect to an example implementation at the GPU 100 of FIG. 1.
  • the sequencer 430 loads the initial submatrices (e.g.
  • the matrix multiplier 1 10 multiplies the submatrices stored at the input registers 106 and 107 to generate a product and adds the result to the intermediate result for the corresponding column of the result matrix R, if any, as set forth above.
  • the method flow moves to block 506 and the sequencer 430 determines if the input submatrix at the input register 106
  • the method flow moves to block 508 and the sequencer 430 loads, to the input register 106, the submatrix of A corresponding to the current column (e.g., column 0) and the next row.
  • the submatrix of B stored at the input register 107 is maintained, thereby conserving power.
  • the method flow returns to block 504 and the matrix multiplier 1 10 executes the next multiply operation—that is, executes the next multiply cycle.
  • the method flow moves to block 510 and the sequencer 430 determines if the input submatrix stored at the input register 510 corresponds to the last row of the matrix B. If not, the method flow moves to block 512 and the sequencer 430 loads to the input register 107 the submatrix of B corresponding to the column of R that is being calculated. In addition, the sequencer 430 loads to the input register 106 the submatrix of A corresponding to the initial row (e.g., row 0) and the next column. The method flow returns to block 504 and the matrix multiplier 1 10 executes the next multiply operation.
  • the method flow moves to block 514 and the SIMD 102 stores the final result for the column of R.
  • the GPU 100 combines each of the calculated columns to generate the result matrix R.
  • the GPU 102 provides the result matrix R to a CPU for further processing.
  • the GPU 100 employs the result matrix R to, for example, generate one or more objects in a display frame, and provides the display frame to a frame buffer for display at a display device.
  • a method includes: for a first multiply cycle of a matrix multiplier of a graphics processing unit (GPU) multiplying a first matrix and a second matrix: multiplying a first submatrix of the first matrix stored at a first input register with a first submatrix of the second matrix stored at a second input register; for a second multiply cycle of the matrix multiplier, the second multiply cycle succeeding the first multiply cycle: multiplying the first submatrix of the first matrix stored at the first input register with a second submatrix of the second matrix stored at a second input register; and maintaining the first submatrix at the first input register for the first multiply cycle and the second multiply cycle.
  • GPU graphics processing unit
  • the method includes: for a third multiply cycle of the matrix multiplier, the third multiply cycle succeeding the second multiply cycle: multiplying the first submatrix of the first matrix stored at the first input register with a second submatrix of the second matrix stored at a second input register; and maintaining the first submatrix at the first input register for the first multiply cycle the second multiply cycle, and the third multiply cycle.
  • the first submatrix includes at least one non-zero element.
  • the method includes determining a product of the first matrix and the second matrix based on results of the first multiply cycle and the second multiply cycle, the product comprising a result matrix.
  • determining the product includes: determining a submatrix of the result matrix based on results of the first multiply cycle and the second multiply cycle.
  • the submatrix of the result matrix comprises one of a column and a row of the result matrix.
  • determining the product includes: determining an outer product based on results of the first multiply cycle and the second multiply cycle.
  • the method includes for a third multiply cycle of the matrix multiplier, the third multiply cycle succeeding the first multiply cycle: multiplying a second submatrix of the first matrix stored at the first input register with a second submatrix of the second matrix stored at the second input register; and changing the first submatrix of the first matrix to the second submatrix of the first matrix for the third multiply cycle.
  • a method includes: multiplying submatrices of a first matrix with submatrices of a second matrix at a matrix multiplier of a graphics processing unit (GPU) to determine a matrix product, wherein the multiplying includes: maintaining a first submatrix at a first input register of the matrix multiplier over a first plurality of multiply cycles.
  • the multiplying further includes: changing submatrices at a second input register of the matrix multiplier over the first plurality of multiply cycles.
  • the multiplying further includes: maintaining a second submatrix at the second input register of the matrix multiplier over a second plurality of multiply cycles.
  • at least one element of the first submatrix is a non-zero element.
  • a graphics processing unit includes: a first input register; a second input register; a matrix multiplier to multiply a submatrix stored at the first input register with a submatrix stored at the second input register; and a sequencer to control submatrices stored at the first input register and the second input register, the sequencer configured to: for a first multiply cycle of the matrix multiplier store a first submatrix of the first matrix at the first input register and a first submatrix of a second matrix stored at the second input register; for a second multiply cycle of the matrix multiplier, maintain the first submatrix of the first matrix at the first input register and store a second submatrix of the second matrix stored at the second input register, the second multiply cycle succeeding the first multiply cycle.
  • the sequencer if configured to: for a third multiply cycle of the matrix multiplier, the third multiply cycle succeeding the first multiply cycle: maintain the first matrix stored at the first input register and store a second submatrix of the second matrix stored at the second input register.
  • the first submatrix includes at least one non-zero element.
  • the GPU is configured to: determine a product of the first matrix and the second matrix based on results of the first multiply cycle and the second multiply cycle, the product comprising a result matrix.
  • the GPU is configured to determine the product by: determining a submatrix of the result matrix based on results of the first multiply cycle and the second multiply cycle.
  • the submatrix of the result matrix comprises one of a column and a row of the result matrix.
  • he GPU is configured to determine the product by: determining an outer product based on results of the first multiply cycle and the second multiply cycle.
  • the sequencer is configured to: for a third multiply cycle of the matrix multiplier, the third multiply cycle succeeding the first multiply cycle: store a second submatrix of the first matrix at the first input register and a second submatrix of the second matrix at the second input register.
  • a computer readable storage medium may include any non-transitory storage medium, or combination of non-transitory storage media, accessible by a computer system during use to provide instructions and/or data to the computer system.
  • Such storage media can include, but is not limited to, optical media (e.g., compact disc (CD), digital versatile disc (DVD), Blu-Ray disc), magnetic media (e.g., floppy disc , magnetic tape, or magnetic hard drive), volatile memory (e.g., random access memory (RAM) or cache), non-volatile memory (e.g., read-only memory (ROM) or Flash memory), or microelectromechanical systems (MEMS)-based storage media.
  • optical media e.g., compact disc (CD), digital versatile disc (DVD), Blu-Ray disc
  • magnetic media e.g., floppy disc , magnetic tape, or magnetic hard drive
  • volatile memory e.g., random access memory (RAM) or cache
  • non-volatile memory e.g., read-only memory (ROM) or Flash memory
  • MEMS microelectromechanical systems
  • the computer readable storage medium may be embedded in the computing system (e.g., system RAM or ROM), fixedly attached to the computing system (e.g., a magnetic hard drive), removably attached to the computing system (e.g., an optical disc or Universal Serial Bus (USB)-based Flash memory), or coupled to the computer system via a wired or wireless network (e.g., network accessible storage (NAS)).
  • system RAM or ROM system RAM or ROM
  • USB Universal Serial Bus
  • NAS network accessible storage
  • certain aspects of the techniques described above may implemented by one or more processors of a processing system executing software.
  • the software includes one or more sets of executable instructions stored or otherwise tangibly embodied on a non-transitory computer readable storage medium.
  • the software can include the instructions and certain data that, when executed by the one or more processors, manipulate the one or more processors to perform one or more aspects of the techniques described above.
  • the non-transitory computer readable storage medium can include, for example, a magnetic or optical disk storage device, solid state storage devices such as Flash memory, a cache, random access memory (RAM) or other non-volatile memory device or devices, and the like.
  • the executable instructions stored on the non-transitory computer readable storage medium may be in source code, assembly language code, object code, or other instruction format that is interpreted or otherwise executable by one or more processors.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Advance Control (AREA)
  • Complex Calculations (AREA)
PCT/US2019/037656 2018-10-31 2019-06-18 Matrix multiplier with submatrix sequencing Ceased WO2020091848A1 (en)

Priority Applications (5)

Application Number Priority Date Filing Date Title
JP2021523783A JP7461945B2 (ja) 2018-10-31 2019-06-18 部分行列の順序付けを伴う行列乗算器
EP19880374.4A EP3891626A4 (en) 2018-10-31 2019-06-18 MATRIX MULTIPLIER WITH SUBMATRIX SEQUENCING
CN201980077886.0A CN113168430A (zh) 2018-10-31 2019-06-18 带有子矩阵定序的矩阵乘法器
KR1020217015589A KR102586989B1 (ko) 2018-10-31 2019-06-18 부분 행렬 순서화를 이용하는 행렬 곱셈기
JP2023065959A JP2023089161A (ja) 2018-10-31 2023-04-13 部分行列の順序付けを伴う行列乗算器

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US16/176,449 2018-10-31
US16/176,449 US11093580B2 (en) 2018-10-31 2018-10-31 Matrix multiplier with submatrix sequencing

Publications (1)

Publication Number Publication Date
WO2020091848A1 true WO2020091848A1 (en) 2020-05-07

Family

ID=70327188

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2019/037656 Ceased WO2020091848A1 (en) 2018-10-31 2019-06-18 Matrix multiplier with submatrix sequencing

Country Status (6)

Country Link
US (1) US11093580B2 (https=)
EP (1) EP3891626A4 (https=)
JP (2) JP7461945B2 (https=)
KR (1) KR102586989B1 (https=)
CN (1) CN113168430A (https=)
WO (1) WO2020091848A1 (https=)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021262970A1 (en) * 2020-06-26 2021-12-30 Advanced Micro Devices, Inc. Processing unit with small footprint arithmetic logic unit

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109871236B (zh) * 2017-12-01 2025-05-06 超威半导体公司 具有低功率并行矩阵乘法流水线的流处理器
US20210303987A1 (en) * 2020-03-26 2021-09-30 Advanced Micro Devices, Inc. Power reduction for machine learning accelerator background
CN112429475B (zh) * 2020-09-29 2023-06-30 贵州大学 一种胶囊排序送料装置
CN112433760B (zh) * 2020-11-27 2022-09-23 海光信息技术股份有限公司 数据排序方法和数据排序电路
CN112632464B (zh) * 2020-12-28 2022-11-29 上海壁仞智能科技有限公司 用于处理数据的处理装置
US11556337B2 (en) 2021-04-12 2023-01-17 Analog Devices International Unlimited Company Parallel matrix multiplication technique optimized for memory fetches
CN117407640A (zh) * 2022-07-15 2024-01-16 华为技术有限公司 一种矩阵计算方法及装置
KR102640249B1 (ko) * 2023-06-12 2024-02-27 주식회사 하이퍼엑셀 대규모 언어 모델을 위해 멀티-디바이스에 기반한 추론을 수행하는 방법 및 시스템
CN119883379B (zh) * 2024-12-24 2025-11-18 深圳市鸿合创新信息技术有限责任公司 数据排序方法、装置、电子设备和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107622037A (zh) * 2017-09-27 2018-01-23 郑州云海信息技术有限公司 一种提高图形处理单元的矩阵乘计算性能的方法和装置
US10032247B2 (en) * 2016-06-22 2018-07-24 Palo Alto Research Center Incorporated System and method for speeding up general matrix-vector multiplication on GPU
US20180246855A1 (en) * 2017-02-28 2018-08-30 Texas Instruments Incorporated Reconfigurable matrix multiplier system and method
US10067910B2 (en) * 2016-07-01 2018-09-04 Palo Alto Research Center Incorporated System and method for GPU maximum register count optimization applied to general matrix-matrix multiplication
CN108491359A (zh) * 2016-04-22 2018-09-04 北京中科寒武纪科技有限公司 子矩阵运算装置及方法

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CH594477A5 (https=) * 1976-08-20 1978-01-13 Agie Ag Ind Elektronik
JPH05324700A (ja) * 1992-05-19 1993-12-07 N T T Data Tsushin Kk 行列乗算装置
JP3935678B2 (ja) * 2001-01-31 2007-06-27 富士通株式会社 Simd積和演算方法、積和演算回路、および、半導体集積回路装置
US6901422B1 (en) * 2001-03-21 2005-05-31 Apple Computer, Inc. Matrix multiplication in a vector processing system
US20040122887A1 (en) * 2002-12-20 2004-06-24 Macy William W. Efficient multiplication of small matrices using SIMD registers
US20050240646A1 (en) * 2004-04-23 2005-10-27 The Research Foundation Of State University Of New York Reconfigurable matrix multiplier architecture and extended borrow parallel counter and small-multiplier circuits
US8051124B2 (en) * 2007-07-19 2011-11-01 Itt Manufacturing Enterprises, Inc. High speed and efficient matrix multiplication hardware module
US9354944B2 (en) * 2009-07-27 2016-05-31 Advanced Micro Devices, Inc. Mapping processing logic having data-parallel threads across processors
US8577951B1 (en) * 2010-08-19 2013-11-05 Altera Corporation Matrix operations in an integrated circuit device
US8862653B2 (en) * 2011-04-26 2014-10-14 University Of South Carolina System and method for sparse matrix vector multiplication processing
US9886418B2 (en) * 2015-04-28 2018-02-06 Intel Corporation Matrix operands for linear algebra operations
US10929944B2 (en) * 2016-11-23 2021-02-23 Advanced Micro Devices, Inc. Low power and low latency GPU coprocessor for persistent computing
JP6912703B2 (ja) * 2017-02-24 2021-08-04 富士通株式会社 演算方法、演算装置、演算プログラム及び演算システム
US10521225B2 (en) * 2017-06-29 2019-12-31 Oracle International Corporation Matrix multiplication at memory bandwidth

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108491359A (zh) * 2016-04-22 2018-09-04 北京中科寒武纪科技有限公司 子矩阵运算装置及方法
US10032247B2 (en) * 2016-06-22 2018-07-24 Palo Alto Research Center Incorporated System and method for speeding up general matrix-vector multiplication on GPU
US10067910B2 (en) * 2016-07-01 2018-09-04 Palo Alto Research Center Incorporated System and method for GPU maximum register count optimization applied to general matrix-matrix multiplication
US20180246855A1 (en) * 2017-02-28 2018-08-30 Texas Instruments Incorporated Reconfigurable matrix multiplier system and method
CN107622037A (zh) * 2017-09-27 2018-01-23 郑州云海信息技术有限公司 一种提高图形处理单元的矩阵乘计算性能的方法和装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021262970A1 (en) * 2020-06-26 2021-12-30 Advanced Micro Devices, Inc. Processing unit with small footprint arithmetic logic unit
US11720328B2 (en) 2020-06-26 2023-08-08 Advanced Micro Devices, Inc. Processing unit with small footprint arithmetic logic unit

Also Published As

Publication number Publication date
JP2023089161A (ja) 2023-06-27
KR102586989B1 (ko) 2023-10-10
EP3891626A4 (en) 2022-08-10
US20200133991A1 (en) 2020-04-30
JP7461945B2 (ja) 2024-04-04
JP2022506418A (ja) 2022-01-17
KR20210071073A (ko) 2021-06-15
EP3891626A1 (en) 2021-10-13
US11093580B2 (en) 2021-08-17
CN113168430A (zh) 2021-07-23

Similar Documents

Publication Publication Date Title
US11093580B2 (en) Matrix multiplier with submatrix sequencing
JP7652507B2 (ja) Simd命令を用いた効率的な直接畳み込み
EP2521968B1 (en) Hardware for performing arithmetic operations
JP5866128B2 (ja) 算術プロセッサ
US11573765B2 (en) Fused convolution and batch normalization for neural networks
US12561393B2 (en) Pipelined matrix multiplication at a graphics processing unit
US11995149B2 (en) Sparse matrix-vector multiplication
US9436465B2 (en) Moving average processing in processor and processor
CN112446007B (zh) 一种矩阵运算方法、运算装置以及处理器
US11409840B2 (en) Dynamically adaptable arrays for vector and matrix operations
JP7646639B2 (ja) 柔軟な精度演算を用いた行列乗算器
CN114746840A (zh) 用于乘法和累加操作的处理器单元
US20230289191A1 (en) Vertical and horizontal broadcast of shared operands
CN117762492A (zh) 数据处理方法、装置、计算机设备及可读存储介质
JP6712052B2 (ja) 演算処理装置及び演算処理装置の制御方法
US20100115232A1 (en) Large integer support in vector operations
EP1936492A1 (en) SIMD processor with reduction unit
Bharathi et al. VLSI Synthesis of Multiply and Accumulate Structures Using Distributed Arithmetic
JP2020201659A (ja) 演算装置、演算方法、および演算プログラム
JP2010122741A (ja) データ処理装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19880374

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021523783

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 20217015589

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2019880374

Country of ref document: EP

Effective date: 20210531

WWG Wipo information: grant in national office

Ref document number: 202117021860

Country of ref document: IN