CN106681694A - Single-precision matrix multiplication optimization method and system based on NVIDIA Kepler GPU assembly instruction - Google Patents

Single-precision matrix multiplication optimization method and system based on NVIDIA Kepler GPU assembly instruction Download PDF

Info

Publication number
CN106681694A
CN106681694A CN201611260732.XA CN201611260732A CN106681694A CN 106681694 A CN106681694 A CN 106681694A CN 201611260732 A CN201611260732 A CN 201611260732A CN 106681694 A CN106681694 A CN 106681694A
Authority
CN
China
Prior art keywords
matrix
register
smb
sma
read
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611260732.XA
Other languages
Chinese (zh)
Inventor
谭光明
张秀霞
周可人
王朝尉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chinese Academy Of Sciences State Owned Assets Management Co ltd
Institute of Computing Technology of CAS
Original Assignee
Chinese Academy Of Sciences State Owned Assets Management Co ltd
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chinese Academy Of Sciences State Owned Assets Management Co ltd, Institute of Computing Technology of CAS filed Critical Chinese Academy Of Sciences State Owned Assets Management Co ltd
Priority to CN201611260732.XA priority Critical patent/CN106681694A/en
Publication of CN106681694A publication Critical patent/CN106681694A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/3012Organisation of register space, e.g. banked or distributed register file

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Mathematics (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The invention relates to a single-precision matrix multiplication optimization method based on an NVIDIA Kepler GPU assembly instruction. The method comprises the steps that according to the column length bm of a matrix block A and a row length bn of a matrix block B, an original matrix is blocked, and each block is subjected to <bm,bn> processing; a matrix C is output dimensionally; four storage spaces smA, smB, smAx and smBx are created on a GPU secondary storage; a matrix of the size being smA is read from a matrix A on a GPU primary storage to the smA, and a matrix of the size being smB is read from a matrix B to the smB; a column matrix block A data is loaded from the smA to a register each time, a column matrix block B data is loaded from the smB to the register each time, the content of the register is read, a multiply-add-fused instruction is applied to matrix multiplication operation, and while the matrix multiplication operation is performed, a column of data of the next smA is read from the GPU primary storage to the smAx, and a column of data of the next smB is read to the smBx; after smA and smB matrix multiplication is performed, smA and smAx addresses are interchanged, and smB and smBx addresses are interchanged.

Description

Single precision Matrix Multiplication optimization method based on NVIDIA Kepler GPU assembly instructions With system
Technical field
The present invention relates to deep learning, high-performance calculation, GPGPU programming techniques field, more particularly to one kind is based on The single precision Matrix Multiplication optimization method and system of NVIDIA Kepler GPU assembly instructions.
Background technology
GPU graphic process unit is a kind of chip for being exclusively used in image and Video processing, due to its chip design it is special Property --- simplified logical process, increase computing unit, application programming early stage GPU only related for processing graph image, and with Becoming stronger day by day for GPU chips, GPU turns to GPGPU (calculating graphic process unit) development, i.e., its versatility has obtained substantial amounts of carrying Height, at present, in embedded system, intelligent terminal, PC is widely used GPU in the equipment such as work station, Tesla series GPU is that NVIDIA is released, specifically designed for the chip that numerical computations are quoted, compared to common GPU, and its floating-point fortune Calculate performance higher, such as Tesla K20m GPU, with 13 SM, there are 192 SP units on each SM, its peak performance can be with 3520gflops is reached, has exceeded most of main flow cpu chip.
SGEMM (single precision Matrix Multiplication) is an important function in BLAS numerical computations storehouse, on multi-core CPU, MKL, ATLAS, and the math library such as Openblas has done very careful tuning to SGEMM, and recently, increasing concern has focused on use The optimization of GEMM programs on GPU, on GPU, Cublas is the Blas high-performance mathematics on GPU that NVIDIA companies provide Storehouse, but Cublas has the disadvantage that:
1. do not increase income.The present invention cannot every assembly instruction of intuitivism apprehension implication, NVIDIA companies provide that PTX is pseudo- to converge Code is compiled, it can produce different assembly instructions, however, because it is not primary compilation according to the difference of GPU chips Instruction, the present invention can not control to generate each part of assembly instruction, also, a PTX instruction can generate a plurality of compilation sometimes Instruction, it has not been convenient to present invention optimization.
2. performance is low.General two matrixes according to input of SGEMM functions whether transposition, call different processing routines, Present invention provide that t represents transposition, n represents each subprogram such as not transposition, the nt of Cublas, tt, tn, nn only up to be reached 74% or so performance, minimum only 60% or so performance, therefore the present invention need to develop a SGEMM journey for higher performance Sequence.
The content of the invention
In view of the shortcomings of the prior art, the present invention proposes a kind of single essence based on NVIDIA Kepler GPU assembly instructions Degree Matrix Multiplication optimization method and system.
A kind of single precision Matrix Multiplication optimization method, wherein the method are based on NVIDIA Kepler GPU assembly instructions, bag Include:
Step 1, the row length bn of row length bm and the B partitioning of matrix according to the A partitionings of matrix carries out piecemeal to original matrix, Each block treatment<bm,bn>The output matrix C of dimension;
Step 2, creates 4 temporarily providing rooms smA, smB, smAx and smBx in GPU secondary storages;
Step 3, the matrix A from the storage of GPU one-levels reads the matrix of the smA sizes to the smA, and being read from matrix B should The matrix of smB sizes is to the smB;
Step 4, loads a row A partitionings of matrix data to register from the smA every time, and a line B matrixes point are loaded from the smB Block number reads the content of registers according to register is arrived, and does matrix multiplication with multiply-add fusion instruction, and is doing Matrix Multiplication fortune While calculation, a row of next smA are read to the smAx from GPU one-levels storage, and store up the one of the next smB of reading Row arrives the smBx;
Step 5, is finished after the Matrix Multiplication of the smA and the smB, and the smA and the mAx addresses are exchanged, by the smB and should SmBx addresses exchange.
The single precision Matrix Multiplication optimization method, the wherein step 4 include register bank conflict resolving steps, with eliminate from The smA loads a row A partitioning of matrix data and is posted to register, and when loading a line B partitionings of matrix data to register from the smB Conflict between storage, register bank conflict resolvings step are allocated specifically, being instructed for multiple source registers, will be many Individual source register instruction is arranged respectively in different register.
The single precision Matrix Multiplication optimization method, multiply-add fusion instruction wherein in the step 4 is penetrated and double hairs using single-shot Penetrate alternate mode and access operand.
The single precision Matrix Multiplication optimization method, the wherein step 4 also include register cache mechanism, by by the register Operand caching, for this it is multiply-add fusion instruction use.
The single precision Matrix Multiplication optimization method, different bit wide access instructions are used when register is read wherein in the step 4 Strategy process, the method include using LDG.128 instruction from global memory read data;Read from shared drive using LDS.64 Access evidence.
The invention allows for a kind of single precision Matrix Multiplication optimization system, the wherein system is based on NVIDIA Kepler GPU assembly instructions, including:
Piecemeal module, the row length bn for row length bm and the B partitioning of matrix according to the A partitionings of matrix enters to original matrix Row piecemeal, each block treatment<bm,bn>The output matrix C of dimension;
Creation module, for creating 4 temporarily providing rooms smA, smB, smAx and smBx in GPU secondary storages;
Read module, reads the matrix of the smA sizes to the smA, from matrix B for the matrix A from the storage of GPU one-levels Read the matrix of the smB sizes to the smB;
Computing module, for loading a row A partitionings of matrix data to register from the smA every time, a line is loaded from the smB B partitioning of matrix data read the content of registers to register, and do matrix multiplication with multiply-add fusion instruction, and are doing While matrix multiplication, a row of next smA are read to the smAx from GPU one-levels storage, and it is next to store up reading A line of the smB is to the smBx;
Exchange module, for after the smA and the smB Matrix Multiplications, the smA and the mAx addresses being exchanged, by the smB and The smBx addresses exchange.
The single precision Matrix Multiplication optimizes system, and the wherein computing module includes register bank conflict resolving modules, is used for Elimination loads a row A partitioning of matrix data and loads a line B partitionings of matrix data to deposit to register, and from the smB from the smA Conflict during device between register, register bank conflict resolving modules are allocated for the instruction of multiple source registers, will be many Individual source register instruction is arranged respectively in different register.
The single precision Matrix Multiplication optimizes system, the multiply-add fusion instruction of this wherein in the computing module using single-shot penetrate with it is double Launch alternate mode and access operand.
The single precision Matrix Multiplication optimizes system, and the wherein computing module also includes register cache mechanism, and the mechanism passes through The operand of the register is cached, so that the multiply-add fusion instruction is used.
The single precision Matrix Multiplication optimize system, wherein in the computing module read register when using LDG.128 instruction from Global memory reads data;Data are read from shared drive using LDS.64.
From above scheme, the advantage of the invention is that:The present invention can avoid the FFMA on Kepler GPU from instructing The register bank for bringing conflicts;97% instruction throughput can be reached;Maximum bandwidth can be utilized, and avoids shared Memory bank conflict;88% peak performance can be reached, 15% or so is realized more than Cublas optimal at present.
Brief description of the drawings
Fig. 1 is that double buffering section technique Matrix Multiplication flow chart is utilized on GPU;
Fig. 2 is register bank distribution maps on GPU;
Fig. 3 is the register map of 12x12 registers piecemeal distribution;
Fig. 4 is the double emission mode organizational form figures of FFMA instructions;
Fig. 5 is the instruction computation sequence figure that double transmittings that Scheduling Block 1-2-2-1 patterns are carried out are extended.
Specific embodiment
Specifically, network virtualization framework involved in the present invention is as shown in Figure 1.Double buffering Matrix Multiplication is utilized on GPU Algorithm steps are as follows:
Step 1:First according to bm (the row length of the A partitionings of matrix) and bn (the row length of the B partitionings of matrix) to original matrix Piecemeal is carried out, each block treatment<bm,bn>The output matrix C of dimension;
Step 2:Created on shared memory (secondary storage on GPU) 4 temporarily providing rooms smA, smB, smAx and smBx;
Step 3:Matrix A from global memory (one-level storage) on GPU read the matrix of smA sizes to SmA, from the matrix of matrix B reading smB sizes to smB;
Step 4:A row (A partitionings of matrix data) to register are loaded from smA every time, a line (B matrixes point are loaded from smB Block number evidence) to register, matrix multiplication is done with FFMA (multiply-add fusion) instructions, and while matrix multiplication is done, from Global memory read a line of next smA and smB to smAx and smBx;
Step 5:Finish after the Matrix Multiplication of smA and smB, smA and mAx addresses are exchanged, smB and smBx addresses is mutual Change.
During optimization, the present invention is main to be needed to consider the partitioning of matrix, register bank conflict resolvings, FFMA The double transmittings of (multiply-add fusion) instruction, and vector instruction use.
Register bank conflict resolvings.To a son row of A matrixes, the sub-block distribution of a sub-line and C matrixes of B matrixes is posted Storage, the present invention needs to consider three factors:Correctness, without bank conflict and compact register index.LDG.128 requirements The byte-aligned of register number 4.Because NVIDIA GPU without the register of 128, it is necessary to continuously be posted with 4 32 Storage RN, RN+1, RN+2, RN+3 carry out the register of combination replacement 128.Present invention discover that a NVIDIA is undocumented about Fixed, N moulds 4 (representing that N does the remainder of division to 4) are necessarily equal to 0, and the mistake of illegal instruction occurs when can otherwise run.This It can be appreciated that the requirement of LDG.128 nybbles alignment can make hardware logic simple, power consumption is further reduced.Due to nybble Align and N moulds 4 are equal to 0, register bank distributions on GPU are found out from following table.
Definition register bank conflict resolving steps of the present invention are as follows:
It is allocated for the instruction of multiple source registers, multiple source register instructions is respectively allocated in different deposit Specifically, being instructed for n source register in device, if these instructions are all on different register bank, without register Bank conflicts.
FFMA R0 as shown in Figure 2, R1, R4, R5 instruction, its source register are distributed in bank1, bank2, bank3, because This conflicts without bank.
Assuming that a son row distribution of A is [bank0, bank1], the sub-line distribution of B is [bank2, bank3].Also There are two kinds of selections to leave C for.[bank1bank2;Bank3bank0] and [bank3bank1;bank0bank2].A, B allocation model It is identical, therefore total 2x2=4 kind bank model selections, and these four patterns equivalence in performance.The present invention arbitrarily chooses it Middle one kind [bank0bank1] [bank2bank3] [bank1bank2;Bank3bank0] it is respectively allocated to A, B and C.Fig. 3 is this It is the register of 12x12 registers piecemeal distribution to invent, and the numeral in figure is the index of register.Arbitrary Cij can be verified, The FFMA bank conflicts of Ai, Bj are completely eliminated.In addition distribution register when, as far as possible it is compact come avoid register exceed 255 Limitation.
The double transmittings of FFMA instructions.Kepler frameworks support the double transmittings of FFMA instructions, and this needs the software and hardware combining could to maintain. Allow the double transmittings of warp schedulers unrealistic always, because hardware is shared, it is necessary to which single-shot is penetrated and double transmittings replace.Kepler A SM have 4 warp schedulers, 32 cores of each scheduler point, such 4 schedulers can consume 128 cores. One SM of Kepler totally 192 cores, remaining 192-128=64 is splitted into 2 groups, every group of 32 cores, this 32 cores Shared by 2 warp schedulers.Best pattern is that 1 group of single-shot is penetrated (1 FFMA instruction), 2 groups of double transmittings (4 FFMA instructions) Penetrated (1 FFMA is instructed) with 1 group of single-shot.Its organizational form is following as shown in figure 4, because two of which warp double transmittings at the same time When can compete shared resource and wait.Therefore since the 3rd cycle (cycle period), four warp can mutually sting Close.Each cycle launches 6 instructions, the positive all cores of good utilisation.
Fig. 5 is that the instruction that the double transmittings carried out by the Scheduling Block 1-2-2-1 patterns to 7 instructions are extended calculates suitable Sequence figure, wherein oval 2 squares for enclosing represent two instructions of double transmittings, isolated square represents single-shot and penetrates.In square Digitized representation computation sequence.The benefit of such organizational computing order is by using register cache mechanism (Operand Collector) register pressure is reduced.Register cache mechanism permission register operand can be lived by Cache (caching), Then used by ensuing instruction, so as to reduce the number of times for reading register.Because register only has 4 bank, if double Transmitting, two FFMA instructions will read 6 registers, beyond the quantity of register bank.Operand caching mechanism can be with The FFMA instruction operands to be accessed of double transmittings are controlled within 4 registers.
For maximized utilization bandwidth, and the conflict between shared register bank is avoided, the invention allows for one Plant the strategy process of the different bit wide access instructions (LDS and LDG) of selection.LDS, LDG, LDS.64, LDG.128 etc. are from storage The middle instruction for reading data, LDS represents and is read from shared drive that LDG is represented and read from global memory, digital table below Show the data for once reading how many bit wide.Select the memory access width of LDS and LDG.Micro benchmark test program of the invention Test result, the present invention is instructed from global memory using LDG.128 and reads data, and LDG.128 crosses texture Cache, and bandwidth is higher. The present invention reads data using LDS.64 from shared drive, and LDS.64 has the bandwidth higher than LDS.128.The other present invention does reality Issue after examination and approval now, LDS instructions are just to start to read source register after it's 2 clock cycle pasts instruction issue, and LDS.64 can find Suitable deployment position so that the source register and FFMA of reading instruct no bank conflicts, LDS.128 does not accomplish but.This hair Bright one reason of conjecture is that LDS maximums transaction is 256 bytes, and each thread reads 128bit, the line in a warp Journey will read 512 bytes, and the limitation beyond 256 bytes is, it is necessary to transaction several times, and secondary transaction is assorted When start read source register just it is unpredictable.Be primarily due to it using LDG.128 has higher relative to LD.128 Bandwidth and lower delay.LDG was texture cache, and the GPU after Kepler, LD only cross L2cache, but L1cache.L1cache is only used to local memory.LDG.128 has bandwidth higher relative to LDG.32 and LDG.64, and Reduce the number of load instructions.
It is below system embodiment corresponding with above method embodiment, present embodiment can be mutual with above-mentioned implementation method Coordinate and implement.The above-mentioned relevant technical details mentioned in mode of applying are still effective in the present embodiment, in order to reduce repetition, this In repeat no more.Correspondingly, the relevant technical details mentioned in present embodiment are also applicable in above-mentioned implementation method.
The present invention also proposes a kind of single precision Matrix Multiplication optimization system, and the wherein system is based on NVIDIA Kepler GPU Assembly instruction, including:
Piecemeal module, the row length bn for row length bm and the B partitioning of matrix according to the A partitionings of matrix enters to original matrix Row piecemeal, each block treatment<bm,bn>The output matrix C of dimension;
Creation module, for creating 4 temporarily providing rooms smA, smB, smAx and smBx in GPU secondary storages;
Read module, reads the matrix of the smA sizes to the smA, from matrix B for the matrix A from the storage of GPU one-levels Read the matrix of the smB sizes to the smB;
Computing module, for loading a row A partitionings of matrix data to register from the smA every time, a line is loaded from the smB B partitioning of matrix data read the content of registers to register, and do matrix multiplication with multiply-add fusion instruction, and are doing While matrix multiplication, a row of next smA are read to the smAx from GPU one-levels storage, and it is next to store up reading A line of the smB is to the smBx;
Exchange module, for after the smA and the smB Matrix Multiplications, the smA and the mAx addresses being exchanged, by the smB and The smBx addresses exchange.
The single precision Matrix Multiplication optimizes system, and the wherein computing module includes register bank conflict resolving modules, is used for Elimination loads a row A partitioning of matrix data and loads a line B partitionings of matrix data to deposit to register, and from the smB from the smA Conflict during device between register, register bank conflict resolving modules are allocated for the instruction of multiple source registers, will be many Individual source register instruction is arranged respectively in different register.
The single precision Matrix Multiplication optimizes system, the multiply-add fusion instruction of this wherein in the computing module using single-shot penetrate with it is double Launch alternate mode and access operand.
The single precision Matrix Multiplication optimizes system, and the wherein computing module also includes register cache mechanism, and the mechanism passes through The operand of the register is cached, so that the multiply-add fusion instruction is used.
The single precision Matrix Multiplication optimize system, wherein in the computing module read register when using LDG.128 instruction from Global memory reads data;Data are read from shared drive using LDS.64.
By description and explanation with reference to accompanying drawing to the specific embodiment of the invention, other side of the invention and feature are to this It is obvious for the technical staff in field.It should be noted that these embodiments should be considered to be only exemplary, and without In limiting the invention.In the case of without departing substantially from spirit of the invention and its essence, those of ordinary skill in the art work as Various corresponding changes and deformation can be made according to the present invention, but these corresponding changes and deformation should all belong to appended by the present invention Scope of the claims.

Claims (10)

1. a kind of single precision Matrix Multiplication optimization method, it is characterised in that the method is based on NVIDIA KeplerGPU assembly instructions, Including:
Step 1, the row length bn of row length bm and the B partitioning of matrix according to the A partitionings of matrix carries out piecemeal to original matrix, each Block treatment<bm,bn>The output matrix C of dimension;
Step 2, creates 4 temporarily providing rooms smA, smB, smAx and smBx in GPU secondary storages;
Step 3, the matrix A from the storage of GPU one-levels reads the matrix of the smA sizes to the smA, reads the smB from matrix B big Small matrix is to the smB;
Step 4, loads a row A partitionings of matrix data to register from the smA every time, and a line B partitioning of matrix numbers are loaded from the smB According to register, reading the content of registers, and matrix multiplication is done with multiply-add fusion instruction, and doing matrix multiplication Meanwhile, a row of the next smA are read to the smAx from GPU one-levels storage, and are stored up and read a line of next smB and arrive The smBx;
Step 5, finishes after the Matrix Multiplication of the smA and smB, the smA and the mAx addresses is exchanged, by the smB and smBx Address exchanges.
2. single precision Matrix Multiplication optimization method as claimed in claim 1, it is characterised in that the step 4 includes register bank Conflict resolving step, with eliminate from the smA load one row A partitioning of matrix data load a line B squares to register, and from the smB , to the conflict between register during register, register bank conflict resolvings step is specifically, for multiple sources for battle array block data Register instruction is allocated, and multiple source register instructions are arranged respectively in different register.
3. single precision Matrix Multiplication optimization method as claimed in claim 1, it is characterised in that the multiply-add fusion in the step 4 Instruction is penetrated using single-shot and accesses operand with double alternate modes of transmitting.
4. single precision Matrix Multiplication optimization method as claimed in claim 1, it is characterised in that the step 4 is also slow including register Mechanism is deposited, is cached by by the operand of the register, so that the multiply-add fusion instruction is used.
5. single precision Matrix Multiplication optimization method as claimed in claim 1, it is characterised in that when reading register in the step 4 Using the strategy process of different bit wide access instructions, the method includes reading data from global memory using LDG.128 instructions;Adopt With LDS.64 data are read from shared drive.
6. a kind of single precision Matrix Multiplication optimizes system, it is characterised in that the system is based on NVIDIA KeplerGPU assembly instructions, Including:
Piecemeal module, the row length bn for row length bm and the B partitioning of matrix according to the A partitionings of matrix is carried out to original matrix Piecemeal, each block treatment<bm,bn>The output matrix C of dimension;
Creation module, for creating 4 temporarily providing rooms smA, smB, smAx and smBx in GPU secondary storages;
Read module, the matrix of the smA sizes is read to the smA for the matrix A from the storage of GPU one-levels, is read from matrix B The matrix of the smB sizes is to the smB;
Computing module, for loading a row A partitionings of matrix data to register from the smA every time, a line B squares is loaded from the smB Battle array block data reads the content of registers to register, and does matrix multiplication with multiply-add fusion instruction, and is doing matrix While multiplication, a row of next smA are read to the smAx from GPU one-levels storage, and store up the next smB of reading A line to the smBx;
Module is exchanged, for after the smA and the smB Matrix Multiplications, the smA and the mAx addresses being exchanged, by the smB and should SmBx addresses exchange.
7. single precision Matrix Multiplication as claimed in claim 6 optimizes system, it is characterised in that the computing module includes register Bank conflict resolving modules, for eliminate from the smA load one row A partitioning of matrix data to register, and from the smB loading one Row B partitionings of matrix data to the conflict between register during register, for multiple sources post by register bank conflict resolving modules Storage instruction is allocated, and multiple source register instructions are arranged respectively in different register.
8. single precision Matrix Multiplication as claimed in claim 6 optimization system, it is characterised in that this in the computing module is multiply-add to be melted Close to instruct to be penetrated using single-shot and access operand with double alternate modes of transmitting.
9. single precision Matrix Multiplication as claimed in claim 6 optimizes system, it is characterised in that the computing module also includes register Caching mechanism, the mechanism is cached by by the operand of the register, so that the multiply-add fusion instruction is used.
10. single precision Matrix Multiplication as claimed in claim 6 optimizes system, it is characterised in that deposit is read in the computing module Data are read from global memory using LDG.128 instructions during device;Data are read from shared drive using LDS.64.
CN201611260732.XA 2016-12-30 2016-12-30 Single-precision matrix multiplication optimization method and system based on NVIDIA Kepler GPU assembly instruction Pending CN106681694A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611260732.XA CN106681694A (en) 2016-12-30 2016-12-30 Single-precision matrix multiplication optimization method and system based on NVIDIA Kepler GPU assembly instruction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611260732.XA CN106681694A (en) 2016-12-30 2016-12-30 Single-precision matrix multiplication optimization method and system based on NVIDIA Kepler GPU assembly instruction

Publications (1)

Publication Number Publication Date
CN106681694A true CN106681694A (en) 2017-05-17

Family

ID=58848732

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611260732.XA Pending CN106681694A (en) 2016-12-30 2016-12-30 Single-precision matrix multiplication optimization method and system based on NVIDIA Kepler GPU assembly instruction

Country Status (1)

Country Link
CN (1) CN106681694A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109460533A (en) * 2017-09-06 2019-03-12 华为技术有限公司 A kind of method and device improving GEMM calculated performance
CN110147222A (en) * 2018-09-18 2019-08-20 北京中科寒武纪科技有限公司 Arithmetic unit and method
CN110147248A (en) * 2019-04-19 2019-08-20 中国科学院计算技术研究所 The single precision Matrix Multiplication optimization method and system accelerated using AMD GPU assembly instruction
US10657442B2 (en) 2018-04-19 2020-05-19 International Business Machines Corporation Deep learning accelerator architecture with chunking GEMM
CN111610975A (en) * 2019-02-26 2020-09-01 深信服科技股份有限公司 Executable file type determination method, device, equipment and storage medium
CN113050988A (en) * 2019-12-27 2021-06-29 上海商汤智能科技有限公司 Data processing method and device
CN113805941A (en) * 2021-08-19 2021-12-17 贝式计算(天津)信息技术有限公司 System and method for accelerating application software by replacing instruction set
CN115617396A (en) * 2022-10-09 2023-01-17 上海燧原科技有限公司 Register allocation method and device applied to novel artificial intelligence processor

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102323917A (en) * 2011-09-06 2012-01-18 中国人民解放军国防科学技术大学 Shared memory based method for realizing multiprocess GPU (Graphics Processing Unit) sharing
KR101400577B1 (en) * 2013-03-11 2014-06-19 한양대학교 산학협력단 Method for multiplication of sparse matrix on the gpu
CN104102513A (en) * 2014-07-18 2014-10-15 西北工业大学 Kepler-architecture based CUDA (compute unified device architecture) runtime parameter transparent-optimization method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102323917A (en) * 2011-09-06 2012-01-18 中国人民解放军国防科学技术大学 Shared memory based method for realizing multiprocess GPU (Graphics Processing Unit) sharing
KR101400577B1 (en) * 2013-03-11 2014-06-19 한양대학교 산학협력단 Method for multiplication of sparse matrix on the gpu
CN104102513A (en) * 2014-07-18 2014-10-15 西北工业大学 Kepler-architecture based CUDA (compute unified device architecture) runtime parameter transparent-optimization method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JUNJIE LAI等: "Performance Upper bound Analysis and Optimization of SGEMM on Fermi and Kepler GPUs", 《CGO’13-2013 INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION》 *
李晓雯等: "缓存结构GPU矩阵乘法算法的自动优化", 《现代电子技术》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109460533A (en) * 2017-09-06 2019-03-12 华为技术有限公司 A kind of method and device improving GEMM calculated performance
CN109460533B (en) * 2017-09-06 2021-10-26 华为技术有限公司 Method and device for improving GEMM calculation performance
US10657442B2 (en) 2018-04-19 2020-05-19 International Business Machines Corporation Deep learning accelerator architecture with chunking GEMM
CN110147222A (en) * 2018-09-18 2019-08-20 北京中科寒武纪科技有限公司 Arithmetic unit and method
CN111610975A (en) * 2019-02-26 2020-09-01 深信服科技股份有限公司 Executable file type determination method, device, equipment and storage medium
CN110147248A (en) * 2019-04-19 2019-08-20 中国科学院计算技术研究所 The single precision Matrix Multiplication optimization method and system accelerated using AMD GPU assembly instruction
CN110147248B (en) * 2019-04-19 2021-06-29 中国科学院计算技术研究所 Single-precision matrix multiplication optimization method and system using AMD GPU assembly instruction acceleration
CN113050988A (en) * 2019-12-27 2021-06-29 上海商汤智能科技有限公司 Data processing method and device
CN113805941A (en) * 2021-08-19 2021-12-17 贝式计算(天津)信息技术有限公司 System and method for accelerating application software by replacing instruction set
CN113805941B (en) * 2021-08-19 2023-12-12 贝式计算(天津)信息技术有限公司 System and method for accelerating application software by replacing instruction set
CN115617396A (en) * 2022-10-09 2023-01-17 上海燧原科技有限公司 Register allocation method and device applied to novel artificial intelligence processor
CN115617396B (en) * 2022-10-09 2023-08-29 上海燧原科技有限公司 Register allocation method and device applied to novel artificial intelligence processor

Similar Documents

Publication Publication Date Title
CN106681694A (en) Single-precision matrix multiplication optimization method and system based on NVIDIA Kepler GPU assembly instruction
CN107168683A (en) GEMM dense matrix multiply high-performance implementation method on the domestic many-core CPU of Shen prestige 26010
US20100115233A1 (en) Dynamically-selectable vector register partitioning
CN110337635A (en) System, method and apparatus for dot product operations
CN103902507B (en) Matrix multiplication calculating device and matrix multiplication calculating method both oriented to programmable algebra processor
CN109002659B (en) Fluid machinery simulation program optimization method based on super computer
CN117724763A (en) Apparatus, method and system for matrix operation accelerator instruction
US8615770B1 (en) System and method for dynamically spawning thread blocks within multi-threaded processing systems
Romero et al. High performance implementations of the 2D Ising model on GPUs
CN105739951B (en) A kind of L1 minimization problem fast solution methods based on GPU
CN103279379A (en) Methods and apparatus for scheduling instructions without instruction decode
CN110308982A (en) A kind of shared drive multiplexing method and device
US20140257769A1 (en) Parallel algorithm for molecular dynamics simulation
US20210042261A1 (en) Data processing
CN106201870A (en) A kind of method and device testing GPU
CN111459543B (en) Method for managing register file unit
Zubair et al. An optimized multicolor point-implicit solver for unstructured grid applications on graphics processing units
CN110414672B (en) Convolution operation method, device and system
CN115860066A (en) Neural network reasoning pipeline multiplexing method based on batch processing
Li et al. GPU matrix multiplication
Wei et al. Reconstructing permutation table to improve the Tabu Search for the PFSP on GPU
US8959497B1 (en) System and method for dynamically spawning thread blocks within multi-threaded processing systems
Rivera et al. Ism2: Optimizing irregular-shaped matrix-matrix multiplication on gpus
Charlton et al. Two-dimensional batch linear programming on the GPU
US20220188380A1 (en) Data processing method and apparatus applied to graphics processing unit, and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170517