CN106681694A - Single-precision matrix multiplication optimization method and system based on NVIDIA Kepler GPU assembly instruction - Google Patents
Single-precision matrix multiplication optimization method and system based on NVIDIA Kepler GPU assembly instruction Download PDFInfo
- Publication number
- CN106681694A CN106681694A CN201611260732.XA CN201611260732A CN106681694A CN 106681694 A CN106681694 A CN 106681694A CN 201611260732 A CN201611260732 A CN 201611260732A CN 106681694 A CN106681694 A CN 106681694A
- Authority
- CN
- China
- Prior art keywords
- matrix
- register
- smb
- sma
- read
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 239000011159 matrix material Substances 0.000 title claims abstract description 131
- 238000000034 method Methods 0.000 title claims abstract description 31
- 238000005457 optimization Methods 0.000 title claims abstract description 21
- 238000003860 storage Methods 0.000 claims abstract description 25
- 238000000638 solvent extraction Methods 0.000 claims description 31
- 230000004927 fusion Effects 0.000 claims description 16
- 230000007246 mechanism Effects 0.000 claims description 11
- 230000008569 process Effects 0.000 claims description 6
- 230000014759 maintenance of location Effects 0.000 claims description 5
- 238000012545 processing Methods 0.000 abstract description 4
- 238000009826 distribution Methods 0.000 description 8
- 230000017105 transposition Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000003139 buffering effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000008030 elimination Effects 0.000 description 2
- 238000003379 elimination reaction Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 210000004209 hair Anatomy 0.000 description 2
- 230000008859 change Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000011773 genetically engineered mouse model Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/20—Processor architectures; Processor configuration, e.g. pipelining
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30036—Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/3012—Organisation of register space, e.g. banked or distributed register file
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computational Mathematics (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Computing Systems (AREA)
- Memory System Of A Hierarchy Structure (AREA)
Abstract
The invention relates to a single-precision matrix multiplication optimization method based on an NVIDIA Kepler GPU assembly instruction. The method comprises the steps that according to the column length bm of a matrix block A and a row length bn of a matrix block B, an original matrix is blocked, and each block is subjected to <bm,bn> processing; a matrix C is output dimensionally; four storage spaces smA, smB, smAx and smBx are created on a GPU secondary storage; a matrix of the size being smA is read from a matrix A on a GPU primary storage to the smA, and a matrix of the size being smB is read from a matrix B to the smB; a column matrix block A data is loaded from the smA to a register each time, a column matrix block B data is loaded from the smB to the register each time, the content of the register is read, a multiply-add-fused instruction is applied to matrix multiplication operation, and while the matrix multiplication operation is performed, a column of data of the next smA is read from the GPU primary storage to the smAx, and a column of data of the next smB is read to the smBx; after smA and smB matrix multiplication is performed, smA and smAx addresses are interchanged, and smB and smBx addresses are interchanged.
Description
Technical field
The present invention relates to deep learning, high-performance calculation, GPGPU programming techniques field, more particularly to one kind is based on
The single precision Matrix Multiplication optimization method and system of NVIDIA Kepler GPU assembly instructions.
Background technology
GPU graphic process unit is a kind of chip for being exclusively used in image and Video processing, due to its chip design it is special
Property --- simplified logical process, increase computing unit, application programming early stage GPU only related for processing graph image, and with
Becoming stronger day by day for GPU chips, GPU turns to GPGPU (calculating graphic process unit) development, i.e., its versatility has obtained substantial amounts of carrying
Height, at present, in embedded system, intelligent terminal, PC is widely used GPU in the equipment such as work station,
Tesla series GPU is that NVIDIA is released, specifically designed for the chip that numerical computations are quoted, compared to common GPU, and its floating-point fortune
Calculate performance higher, such as Tesla K20m GPU, with 13 SM, there are 192 SP units on each SM, its peak performance can be with
3520gflops is reached, has exceeded most of main flow cpu chip.
SGEMM (single precision Matrix Multiplication) is an important function in BLAS numerical computations storehouse, on multi-core CPU, MKL,
ATLAS, and the math library such as Openblas has done very careful tuning to SGEMM, and recently, increasing concern has focused on use
The optimization of GEMM programs on GPU, on GPU, Cublas is the Blas high-performance mathematics on GPU that NVIDIA companies provide
Storehouse, but Cublas has the disadvantage that:
1. do not increase income.The present invention cannot every assembly instruction of intuitivism apprehension implication, NVIDIA companies provide that PTX is pseudo- to converge
Code is compiled, it can produce different assembly instructions, however, because it is not primary compilation according to the difference of GPU chips
Instruction, the present invention can not control to generate each part of assembly instruction, also, a PTX instruction can generate a plurality of compilation sometimes
Instruction, it has not been convenient to present invention optimization.
2. performance is low.General two matrixes according to input of SGEMM functions whether transposition, call different processing routines,
Present invention provide that t represents transposition, n represents each subprogram such as not transposition, the nt of Cublas, tt, tn, nn only up to be reached
74% or so performance, minimum only 60% or so performance, therefore the present invention need to develop a SGEMM journey for higher performance
Sequence.
The content of the invention
In view of the shortcomings of the prior art, the present invention proposes a kind of single essence based on NVIDIA Kepler GPU assembly instructions
Degree Matrix Multiplication optimization method and system.
A kind of single precision Matrix Multiplication optimization method, wherein the method are based on NVIDIA Kepler GPU assembly instructions, bag
Include:
Step 1, the row length bn of row length bm and the B partitioning of matrix according to the A partitionings of matrix carries out piecemeal to original matrix,
Each block treatment<bm,bn>The output matrix C of dimension;
Step 2, creates 4 temporarily providing rooms smA, smB, smAx and smBx in GPU secondary storages;
Step 3, the matrix A from the storage of GPU one-levels reads the matrix of the smA sizes to the smA, and being read from matrix B should
The matrix of smB sizes is to the smB;
Step 4, loads a row A partitionings of matrix data to register from the smA every time, and a line B matrixes point are loaded from the smB
Block number reads the content of registers according to register is arrived, and does matrix multiplication with multiply-add fusion instruction, and is doing Matrix Multiplication fortune
While calculation, a row of next smA are read to the smAx from GPU one-levels storage, and store up the one of the next smB of reading
Row arrives the smBx;
Step 5, is finished after the Matrix Multiplication of the smA and the smB, and the smA and the mAx addresses are exchanged, by the smB and should
SmBx addresses exchange.
The single precision Matrix Multiplication optimization method, the wherein step 4 include register bank conflict resolving steps, with eliminate from
The smA loads a row A partitioning of matrix data and is posted to register, and when loading a line B partitionings of matrix data to register from the smB
Conflict between storage, register bank conflict resolvings step are allocated specifically, being instructed for multiple source registers, will be many
Individual source register instruction is arranged respectively in different register.
The single precision Matrix Multiplication optimization method, multiply-add fusion instruction wherein in the step 4 is penetrated and double hairs using single-shot
Penetrate alternate mode and access operand.
The single precision Matrix Multiplication optimization method, the wherein step 4 also include register cache mechanism, by by the register
Operand caching, for this it is multiply-add fusion instruction use.
The single precision Matrix Multiplication optimization method, different bit wide access instructions are used when register is read wherein in the step 4
Strategy process, the method include using LDG.128 instruction from global memory read data;Read from shared drive using LDS.64
Access evidence.
The invention allows for a kind of single precision Matrix Multiplication optimization system, the wherein system is based on NVIDIA Kepler
GPU assembly instructions, including:
Piecemeal module, the row length bn for row length bm and the B partitioning of matrix according to the A partitionings of matrix enters to original matrix
Row piecemeal, each block treatment<bm,bn>The output matrix C of dimension;
Creation module, for creating 4 temporarily providing rooms smA, smB, smAx and smBx in GPU secondary storages;
Read module, reads the matrix of the smA sizes to the smA, from matrix B for the matrix A from the storage of GPU one-levels
Read the matrix of the smB sizes to the smB;
Computing module, for loading a row A partitionings of matrix data to register from the smA every time, a line is loaded from the smB
B partitioning of matrix data read the content of registers to register, and do matrix multiplication with multiply-add fusion instruction, and are doing
While matrix multiplication, a row of next smA are read to the smAx from GPU one-levels storage, and it is next to store up reading
A line of the smB is to the smBx;
Exchange module, for after the smA and the smB Matrix Multiplications, the smA and the mAx addresses being exchanged, by the smB and
The smBx addresses exchange.
The single precision Matrix Multiplication optimizes system, and the wherein computing module includes register bank conflict resolving modules, is used for
Elimination loads a row A partitioning of matrix data and loads a line B partitionings of matrix data to deposit to register, and from the smB from the smA
Conflict during device between register, register bank conflict resolving modules are allocated for the instruction of multiple source registers, will be many
Individual source register instruction is arranged respectively in different register.
The single precision Matrix Multiplication optimizes system, the multiply-add fusion instruction of this wherein in the computing module using single-shot penetrate with it is double
Launch alternate mode and access operand.
The single precision Matrix Multiplication optimizes system, and the wherein computing module also includes register cache mechanism, and the mechanism passes through
The operand of the register is cached, so that the multiply-add fusion instruction is used.
The single precision Matrix Multiplication optimize system, wherein in the computing module read register when using LDG.128 instruction from
Global memory reads data;Data are read from shared drive using LDS.64.
From above scheme, the advantage of the invention is that:The present invention can avoid the FFMA on Kepler GPU from instructing
The register bank for bringing conflicts;97% instruction throughput can be reached;Maximum bandwidth can be utilized, and avoids shared
Memory bank conflict;88% peak performance can be reached, 15% or so is realized more than Cublas optimal at present.
Brief description of the drawings
Fig. 1 is that double buffering section technique Matrix Multiplication flow chart is utilized on GPU;
Fig. 2 is register bank distribution maps on GPU;
Fig. 3 is the register map of 12x12 registers piecemeal distribution;
Fig. 4 is the double emission mode organizational form figures of FFMA instructions;
Fig. 5 is the instruction computation sequence figure that double transmittings that Scheduling Block 1-2-2-1 patterns are carried out are extended.
Specific embodiment
Specifically, network virtualization framework involved in the present invention is as shown in Figure 1.Double buffering Matrix Multiplication is utilized on GPU
Algorithm steps are as follows:
Step 1:First according to bm (the row length of the A partitionings of matrix) and bn (the row length of the B partitionings of matrix) to original matrix
Piecemeal is carried out, each block treatment<bm,bn>The output matrix C of dimension;
Step 2:Created on shared memory (secondary storage on GPU) 4 temporarily providing rooms smA, smB, smAx and
smBx;
Step 3:Matrix A from global memory (one-level storage) on GPU read the matrix of smA sizes to
SmA, from the matrix of matrix B reading smB sizes to smB;
Step 4:A row (A partitionings of matrix data) to register are loaded from smA every time, a line (B matrixes point are loaded from smB
Block number evidence) to register, matrix multiplication is done with FFMA (multiply-add fusion) instructions, and while matrix multiplication is done, from
Global memory read a line of next smA and smB to smAx and smBx;
Step 5:Finish after the Matrix Multiplication of smA and smB, smA and mAx addresses are exchanged, smB and smBx addresses is mutual
Change.
During optimization, the present invention is main to be needed to consider the partitioning of matrix, register bank conflict resolvings, FFMA
The double transmittings of (multiply-add fusion) instruction, and vector instruction use.
Register bank conflict resolvings.To a son row of A matrixes, the sub-block distribution of a sub-line and C matrixes of B matrixes is posted
Storage, the present invention needs to consider three factors:Correctness, without bank conflict and compact register index.LDG.128 requirements
The byte-aligned of register number 4.Because NVIDIA GPU without the register of 128, it is necessary to continuously be posted with 4 32
Storage RN, RN+1, RN+2, RN+3 carry out the register of combination replacement 128.Present invention discover that a NVIDIA is undocumented about
Fixed, N moulds 4 (representing that N does the remainder of division to 4) are necessarily equal to 0, and the mistake of illegal instruction occurs when can otherwise run.This
It can be appreciated that the requirement of LDG.128 nybbles alignment can make hardware logic simple, power consumption is further reduced.Due to nybble
Align and N moulds 4 are equal to 0, register bank distributions on GPU are found out from following table.
Definition register bank conflict resolving steps of the present invention are as follows:
It is allocated for the instruction of multiple source registers, multiple source register instructions is respectively allocated in different deposit
Specifically, being instructed for n source register in device, if these instructions are all on different register bank, without register
Bank conflicts.
FFMA R0 as shown in Figure 2, R1, R4, R5 instruction, its source register are distributed in bank1, bank2, bank3, because
This conflicts without bank.
Assuming that a son row distribution of A is [bank0, bank1], the sub-line distribution of B is [bank2, bank3].Also
There are two kinds of selections to leave C for.[bank1bank2;Bank3bank0] and [bank3bank1;bank0bank2].A, B allocation model
It is identical, therefore total 2x2=4 kind bank model selections, and these four patterns equivalence in performance.The present invention arbitrarily chooses it
Middle one kind [bank0bank1] [bank2bank3] [bank1bank2;Bank3bank0] it is respectively allocated to A, B and C.Fig. 3 is this
It is the register of 12x12 registers piecemeal distribution to invent, and the numeral in figure is the index of register.Arbitrary Cij can be verified,
The FFMA bank conflicts of Ai, Bj are completely eliminated.In addition distribution register when, as far as possible it is compact come avoid register exceed 255
Limitation.
The double transmittings of FFMA instructions.Kepler frameworks support the double transmittings of FFMA instructions, and this needs the software and hardware combining could to maintain.
Allow the double transmittings of warp schedulers unrealistic always, because hardware is shared, it is necessary to which single-shot is penetrated and double transmittings replace.Kepler
A SM have 4 warp schedulers, 32 cores of each scheduler point, such 4 schedulers can consume 128 cores.
One SM of Kepler totally 192 cores, remaining 192-128=64 is splitted into 2 groups, every group of 32 cores, this 32 cores
Shared by 2 warp schedulers.Best pattern is that 1 group of single-shot is penetrated (1 FFMA instruction), 2 groups of double transmittings (4 FFMA instructions)
Penetrated (1 FFMA is instructed) with 1 group of single-shot.Its organizational form is following as shown in figure 4, because two of which warp double transmittings at the same time
When can compete shared resource and wait.Therefore since the 3rd cycle (cycle period), four warp can mutually sting
Close.Each cycle launches 6 instructions, the positive all cores of good utilisation.
Fig. 5 is that the instruction that the double transmittings carried out by the Scheduling Block 1-2-2-1 patterns to 7 instructions are extended calculates suitable
Sequence figure, wherein oval 2 squares for enclosing represent two instructions of double transmittings, isolated square represents single-shot and penetrates.In square
Digitized representation computation sequence.The benefit of such organizational computing order is by using register cache mechanism (Operand
Collector) register pressure is reduced.Register cache mechanism permission register operand can be lived by Cache (caching),
Then used by ensuing instruction, so as to reduce the number of times for reading register.Because register only has 4 bank, if double
Transmitting, two FFMA instructions will read 6 registers, beyond the quantity of register bank.Operand caching mechanism can be with
The FFMA instruction operands to be accessed of double transmittings are controlled within 4 registers.
For maximized utilization bandwidth, and the conflict between shared register bank is avoided, the invention allows for one
Plant the strategy process of the different bit wide access instructions (LDS and LDG) of selection.LDS, LDG, LDS.64, LDG.128 etc. are from storage
The middle instruction for reading data, LDS represents and is read from shared drive that LDG is represented and read from global memory, digital table below
Show the data for once reading how many bit wide.Select the memory access width of LDS and LDG.Micro benchmark test program of the invention
Test result, the present invention is instructed from global memory using LDG.128 and reads data, and LDG.128 crosses texture Cache, and bandwidth is higher.
The present invention reads data using LDS.64 from shared drive, and LDS.64 has the bandwidth higher than LDS.128.The other present invention does reality
Issue after examination and approval now, LDS instructions are just to start to read source register after it's 2 clock cycle pasts instruction issue, and LDS.64 can find
Suitable deployment position so that the source register and FFMA of reading instruct no bank conflicts, LDS.128 does not accomplish but.This hair
Bright one reason of conjecture is that LDS maximums transaction is 256 bytes, and each thread reads 128bit, the line in a warp
Journey will read 512 bytes, and the limitation beyond 256 bytes is, it is necessary to transaction several times, and secondary transaction is assorted
When start read source register just it is unpredictable.Be primarily due to it using LDG.128 has higher relative to LD.128
Bandwidth and lower delay.LDG was texture cache, and the GPU after Kepler, LD only cross L2cache, but
L1cache.L1cache is only used to local memory.LDG.128 has bandwidth higher relative to LDG.32 and LDG.64, and
Reduce the number of load instructions.
It is below system embodiment corresponding with above method embodiment, present embodiment can be mutual with above-mentioned implementation method
Coordinate and implement.The above-mentioned relevant technical details mentioned in mode of applying are still effective in the present embodiment, in order to reduce repetition, this
In repeat no more.Correspondingly, the relevant technical details mentioned in present embodiment are also applicable in above-mentioned implementation method.
The present invention also proposes a kind of single precision Matrix Multiplication optimization system, and the wherein system is based on NVIDIA Kepler GPU
Assembly instruction, including:
Piecemeal module, the row length bn for row length bm and the B partitioning of matrix according to the A partitionings of matrix enters to original matrix
Row piecemeal, each block treatment<bm,bn>The output matrix C of dimension;
Creation module, for creating 4 temporarily providing rooms smA, smB, smAx and smBx in GPU secondary storages;
Read module, reads the matrix of the smA sizes to the smA, from matrix B for the matrix A from the storage of GPU one-levels
Read the matrix of the smB sizes to the smB;
Computing module, for loading a row A partitionings of matrix data to register from the smA every time, a line is loaded from the smB
B partitioning of matrix data read the content of registers to register, and do matrix multiplication with multiply-add fusion instruction, and are doing
While matrix multiplication, a row of next smA are read to the smAx from GPU one-levels storage, and it is next to store up reading
A line of the smB is to the smBx;
Exchange module, for after the smA and the smB Matrix Multiplications, the smA and the mAx addresses being exchanged, by the smB and
The smBx addresses exchange.
The single precision Matrix Multiplication optimizes system, and the wherein computing module includes register bank conflict resolving modules, is used for
Elimination loads a row A partitioning of matrix data and loads a line B partitionings of matrix data to deposit to register, and from the smB from the smA
Conflict during device between register, register bank conflict resolving modules are allocated for the instruction of multiple source registers, will be many
Individual source register instruction is arranged respectively in different register.
The single precision Matrix Multiplication optimizes system, the multiply-add fusion instruction of this wherein in the computing module using single-shot penetrate with it is double
Launch alternate mode and access operand.
The single precision Matrix Multiplication optimizes system, and the wherein computing module also includes register cache mechanism, and the mechanism passes through
The operand of the register is cached, so that the multiply-add fusion instruction is used.
The single precision Matrix Multiplication optimize system, wherein in the computing module read register when using LDG.128 instruction from
Global memory reads data;Data are read from shared drive using LDS.64.
By description and explanation with reference to accompanying drawing to the specific embodiment of the invention, other side of the invention and feature are to this
It is obvious for the technical staff in field.It should be noted that these embodiments should be considered to be only exemplary, and without
In limiting the invention.In the case of without departing substantially from spirit of the invention and its essence, those of ordinary skill in the art work as
Various corresponding changes and deformation can be made according to the present invention, but these corresponding changes and deformation should all belong to appended by the present invention
Scope of the claims.
Claims (10)
1. a kind of single precision Matrix Multiplication optimization method, it is characterised in that the method is based on NVIDIA KeplerGPU assembly instructions,
Including:
Step 1, the row length bn of row length bm and the B partitioning of matrix according to the A partitionings of matrix carries out piecemeal to original matrix, each
Block treatment<bm,bn>The output matrix C of dimension;
Step 2, creates 4 temporarily providing rooms smA, smB, smAx and smBx in GPU secondary storages;
Step 3, the matrix A from the storage of GPU one-levels reads the matrix of the smA sizes to the smA, reads the smB from matrix B big
Small matrix is to the smB;
Step 4, loads a row A partitionings of matrix data to register from the smA every time, and a line B partitioning of matrix numbers are loaded from the smB
According to register, reading the content of registers, and matrix multiplication is done with multiply-add fusion instruction, and doing matrix multiplication
Meanwhile, a row of the next smA are read to the smAx from GPU one-levels storage, and are stored up and read a line of next smB and arrive
The smBx;
Step 5, finishes after the Matrix Multiplication of the smA and smB, the smA and the mAx addresses is exchanged, by the smB and smBx
Address exchanges.
2. single precision Matrix Multiplication optimization method as claimed in claim 1, it is characterised in that the step 4 includes register bank
Conflict resolving step, with eliminate from the smA load one row A partitioning of matrix data load a line B squares to register, and from the smB
, to the conflict between register during register, register bank conflict resolvings step is specifically, for multiple sources for battle array block data
Register instruction is allocated, and multiple source register instructions are arranged respectively in different register.
3. single precision Matrix Multiplication optimization method as claimed in claim 1, it is characterised in that the multiply-add fusion in the step 4
Instruction is penetrated using single-shot and accesses operand with double alternate modes of transmitting.
4. single precision Matrix Multiplication optimization method as claimed in claim 1, it is characterised in that the step 4 is also slow including register
Mechanism is deposited, is cached by by the operand of the register, so that the multiply-add fusion instruction is used.
5. single precision Matrix Multiplication optimization method as claimed in claim 1, it is characterised in that when reading register in the step 4
Using the strategy process of different bit wide access instructions, the method includes reading data from global memory using LDG.128 instructions;Adopt
With LDS.64 data are read from shared drive.
6. a kind of single precision Matrix Multiplication optimizes system, it is characterised in that the system is based on NVIDIA KeplerGPU assembly instructions,
Including:
Piecemeal module, the row length bn for row length bm and the B partitioning of matrix according to the A partitionings of matrix is carried out to original matrix
Piecemeal, each block treatment<bm,bn>The output matrix C of dimension;
Creation module, for creating 4 temporarily providing rooms smA, smB, smAx and smBx in GPU secondary storages;
Read module, the matrix of the smA sizes is read to the smA for the matrix A from the storage of GPU one-levels, is read from matrix B
The matrix of the smB sizes is to the smB;
Computing module, for loading a row A partitionings of matrix data to register from the smA every time, a line B squares is loaded from the smB
Battle array block data reads the content of registers to register, and does matrix multiplication with multiply-add fusion instruction, and is doing matrix
While multiplication, a row of next smA are read to the smAx from GPU one-levels storage, and store up the next smB of reading
A line to the smBx;
Module is exchanged, for after the smA and the smB Matrix Multiplications, the smA and the mAx addresses being exchanged, by the smB and should
SmBx addresses exchange.
7. single precision Matrix Multiplication as claimed in claim 6 optimizes system, it is characterised in that the computing module includes register
Bank conflict resolving modules, for eliminate from the smA load one row A partitioning of matrix data to register, and from the smB loading one
Row B partitionings of matrix data to the conflict between register during register, for multiple sources post by register bank conflict resolving modules
Storage instruction is allocated, and multiple source register instructions are arranged respectively in different register.
8. single precision Matrix Multiplication as claimed in claim 6 optimization system, it is characterised in that this in the computing module is multiply-add to be melted
Close to instruct to be penetrated using single-shot and access operand with double alternate modes of transmitting.
9. single precision Matrix Multiplication as claimed in claim 6 optimizes system, it is characterised in that the computing module also includes register
Caching mechanism, the mechanism is cached by by the operand of the register, so that the multiply-add fusion instruction is used.
10. single precision Matrix Multiplication as claimed in claim 6 optimizes system, it is characterised in that deposit is read in the computing module
Data are read from global memory using LDG.128 instructions during device;Data are read from shared drive using LDS.64.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611260732.XA CN106681694A (en) | 2016-12-30 | 2016-12-30 | Single-precision matrix multiplication optimization method and system based on NVIDIA Kepler GPU assembly instruction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611260732.XA CN106681694A (en) | 2016-12-30 | 2016-12-30 | Single-precision matrix multiplication optimization method and system based on NVIDIA Kepler GPU assembly instruction |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106681694A true CN106681694A (en) | 2017-05-17 |
Family
ID=58848732
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611260732.XA Pending CN106681694A (en) | 2016-12-30 | 2016-12-30 | Single-precision matrix multiplication optimization method and system based on NVIDIA Kepler GPU assembly instruction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106681694A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109460533A (en) * | 2017-09-06 | 2019-03-12 | 华为技术有限公司 | A kind of method and device improving GEMM calculated performance |
CN110147222A (en) * | 2018-09-18 | 2019-08-20 | 北京中科寒武纪科技有限公司 | Arithmetic unit and method |
CN110147248A (en) * | 2019-04-19 | 2019-08-20 | 中国科学院计算技术研究所 | The single precision Matrix Multiplication optimization method and system accelerated using AMD GPU assembly instruction |
US10657442B2 (en) | 2018-04-19 | 2020-05-19 | International Business Machines Corporation | Deep learning accelerator architecture with chunking GEMM |
CN111610975A (en) * | 2019-02-26 | 2020-09-01 | 深信服科技股份有限公司 | Executable file type determination method, device, equipment and storage medium |
CN113050988A (en) * | 2019-12-27 | 2021-06-29 | 上海商汤智能科技有限公司 | Data processing method and device |
CN113805941A (en) * | 2021-08-19 | 2021-12-17 | 贝式计算(天津)信息技术有限公司 | System and method for accelerating application software by replacing instruction set |
CN115617396A (en) * | 2022-10-09 | 2023-01-17 | 上海燧原科技有限公司 | Register allocation method and device applied to novel artificial intelligence processor |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102323917A (en) * | 2011-09-06 | 2012-01-18 | 中国人民解放军国防科学技术大学 | Shared memory based method for realizing multiprocess GPU (Graphics Processing Unit) sharing |
KR101400577B1 (en) * | 2013-03-11 | 2014-06-19 | 한양대학교 산학협력단 | Method for multiplication of sparse matrix on the gpu |
CN104102513A (en) * | 2014-07-18 | 2014-10-15 | 西北工业大学 | Kepler-architecture based CUDA (compute unified device architecture) runtime parameter transparent-optimization method |
-
2016
- 2016-12-30 CN CN201611260732.XA patent/CN106681694A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102323917A (en) * | 2011-09-06 | 2012-01-18 | 中国人民解放军国防科学技术大学 | Shared memory based method for realizing multiprocess GPU (Graphics Processing Unit) sharing |
KR101400577B1 (en) * | 2013-03-11 | 2014-06-19 | 한양대학교 산학협력단 | Method for multiplication of sparse matrix on the gpu |
CN104102513A (en) * | 2014-07-18 | 2014-10-15 | 西北工业大学 | Kepler-architecture based CUDA (compute unified device architecture) runtime parameter transparent-optimization method |
Non-Patent Citations (2)
Title |
---|
JUNJIE LAI等: "Performance Upper bound Analysis and Optimization of SGEMM on Fermi and Kepler GPUs", 《CGO’13-2013 INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION》 * |
李晓雯等: "缓存结构GPU矩阵乘法算法的自动优化", 《现代电子技术》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109460533A (en) * | 2017-09-06 | 2019-03-12 | 华为技术有限公司 | A kind of method and device improving GEMM calculated performance |
CN109460533B (en) * | 2017-09-06 | 2021-10-26 | 华为技术有限公司 | Method and device for improving GEMM calculation performance |
US10657442B2 (en) | 2018-04-19 | 2020-05-19 | International Business Machines Corporation | Deep learning accelerator architecture with chunking GEMM |
CN110147222A (en) * | 2018-09-18 | 2019-08-20 | 北京中科寒武纪科技有限公司 | Arithmetic unit and method |
CN111610975A (en) * | 2019-02-26 | 2020-09-01 | 深信服科技股份有限公司 | Executable file type determination method, device, equipment and storage medium |
CN110147248A (en) * | 2019-04-19 | 2019-08-20 | 中国科学院计算技术研究所 | The single precision Matrix Multiplication optimization method and system accelerated using AMD GPU assembly instruction |
CN110147248B (en) * | 2019-04-19 | 2021-06-29 | 中国科学院计算技术研究所 | Single-precision matrix multiplication optimization method and system using AMD GPU assembly instruction acceleration |
CN113050988A (en) * | 2019-12-27 | 2021-06-29 | 上海商汤智能科技有限公司 | Data processing method and device |
CN113805941A (en) * | 2021-08-19 | 2021-12-17 | 贝式计算(天津)信息技术有限公司 | System and method for accelerating application software by replacing instruction set |
CN113805941B (en) * | 2021-08-19 | 2023-12-12 | 贝式计算(天津)信息技术有限公司 | System and method for accelerating application software by replacing instruction set |
CN115617396A (en) * | 2022-10-09 | 2023-01-17 | 上海燧原科技有限公司 | Register allocation method and device applied to novel artificial intelligence processor |
CN115617396B (en) * | 2022-10-09 | 2023-08-29 | 上海燧原科技有限公司 | Register allocation method and device applied to novel artificial intelligence processor |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106681694A (en) | Single-precision matrix multiplication optimization method and system based on NVIDIA Kepler GPU assembly instruction | |
CN107168683A (en) | GEMM dense matrix multiply high-performance implementation method on the domestic many-core CPU of Shen prestige 26010 | |
US20100115233A1 (en) | Dynamically-selectable vector register partitioning | |
CN110337635A (en) | System, method and apparatus for dot product operations | |
CN103902507B (en) | Matrix multiplication calculating device and matrix multiplication calculating method both oriented to programmable algebra processor | |
CN109002659B (en) | Fluid machinery simulation program optimization method based on super computer | |
CN117724763A (en) | Apparatus, method and system for matrix operation accelerator instruction | |
US8615770B1 (en) | System and method for dynamically spawning thread blocks within multi-threaded processing systems | |
Romero et al. | High performance implementations of the 2D Ising model on GPUs | |
CN105739951B (en) | A kind of L1 minimization problem fast solution methods based on GPU | |
CN103279379A (en) | Methods and apparatus for scheduling instructions without instruction decode | |
CN110308982A (en) | A kind of shared drive multiplexing method and device | |
US20140257769A1 (en) | Parallel algorithm for molecular dynamics simulation | |
US20210042261A1 (en) | Data processing | |
CN106201870A (en) | A kind of method and device testing GPU | |
CN111459543B (en) | Method for managing register file unit | |
Zubair et al. | An optimized multicolor point-implicit solver for unstructured grid applications on graphics processing units | |
CN110414672B (en) | Convolution operation method, device and system | |
CN115860066A (en) | Neural network reasoning pipeline multiplexing method based on batch processing | |
Li et al. | GPU matrix multiplication | |
Wei et al. | Reconstructing permutation table to improve the Tabu Search for the PFSP on GPU | |
US8959497B1 (en) | System and method for dynamically spawning thread blocks within multi-threaded processing systems | |
Rivera et al. | Ism2: Optimizing irregular-shaped matrix-matrix multiplication on gpus | |
Charlton et al. | Two-dimensional batch linear programming on the GPU | |
US20220188380A1 (en) | Data processing method and apparatus applied to graphics processing unit, and electronic device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170517 |