CN106681694A

CN106681694A - Single-precision matrix multiplication optimization method and system based on NVIDIA Kepler GPU assembly instruction

Info

Publication number: CN106681694A
Application number: CN201611260732.XA
Authority: CN
Inventors: 谭光明; 张秀霞; 周可人; 王朝尉
Original assignee: Chinese Academy Of Sciences State Owned Assets Management Co ltd; Institute of Computing Technology of CAS
Current assignee: Chinese Academy Of Sciences State Owned Assets Management Co ltd; Institute of Computing Technology of CAS
Priority date: 2016-12-30
Filing date: 2016-12-30
Publication date: 2017-05-17

Abstract

The invention relates to a single-precision matrix multiplication optimization method based on an NVIDIA Kepler GPU assembly instruction. The method comprises the steps that according to the column length bm of a matrix block A and a row length bn of a matrix block B, an original matrix is blocked, and each block is subjected to <bm,bn> processing; a matrix C is output dimensionally; four storage spaces smA, smB, smAx and smBx are created on a GPU secondary storage; a matrix of the size being smA is read from a matrix A on a GPU primary storage to the smA, and a matrix of the size being smB is read from a matrix B to the smB; a column matrix block A data is loaded from the smA to a register each time, a column matrix block B data is loaded from the smB to the register each time, the content of the register is read, a multiply-add-fused instruction is applied to matrix multiplication operation, and while the matrix multiplication operation is performed, a column of data of the next smA is read from the GPU primary storage to the smAx, and a column of data of the next smB is read to the smBx; after smA and smB matrix multiplication is performed, smA and smAx addresses are interchanged, and smB and smBx addresses are interchanged.

Description

Single precision Matrix Multiplication optimization method based on NVIDIA Kepler GPU assembly instructions With system

Technical field

The present invention relates to deep learning, high-performance calculation, GPGPU programming techniques field, more particularly to one kind is based on The single precision Matrix Multiplication optimization method and system of NVIDIA Kepler GPU assembly instructions.

Background technology

GPU graphic process unit is a kind of chip for being exclusively used in image and Video processing, due to its chip design it is special Property --- simplified logical process, increase computing unit, application programming early stage GPU only related for processing graph image, and with Becoming stronger day by day for GPU chips, GPU turns to GPGPU (calculating graphic process unit) development, i.e., its versatility has obtained substantial amounts of carrying Height, at present, in embedded system, intelligent terminal, PC is widely used GPU in the equipment such as work station, Tesla series GPU is that NVIDIA is released, specifically designed for the chip that numerical computations are quoted, compared to common GPU, and its floating-point fortune Calculate performance higher, such as Tesla K20m GPU, with 13 SM, there are 192 SP units on each SM, its peak performance can be with 3520gflops is reached, has exceeded most of main flow cpu chip.

SGEMM (single precision Matrix Multiplication) is an important function in BLAS numerical computations storehouse, on multi-core CPU, MKL, ATLAS, and the math library such as Openblas has done very careful tuning to SGEMM, and recently, increasing concern has focused on use The optimization of GEMM programs on GPU, on GPU, Cublas is the Blas high-performance mathematics on GPU that NVIDIA companies provide Storehouse, but Cublas has the disadvantage that：

1. do not increase income.The present invention cannot every assembly instruction of intuitivism apprehension implication, NVIDIA companies provide that PTX is pseudo- to converge Code is compiled, it can produce different assembly instructions, however, because it is not primary compilation according to the difference of GPU chips Instruction, the present invention can not control to generate each part of assembly instruction, also, a PTX instruction can generate a plurality of compilation sometimes Instruction, it has not been convenient to present invention optimization.

2. performance is low.General two matrixes according to input of SGEMM functions whether transposition, call different processing routines, Present invention provide that t represents transposition, n represents each subprogram such as not transposition, the nt of Cublas, tt, tn, nn only up to be reached 74% or so performance, minimum only 60% or so performance, therefore the present invention need to develop a SGEMM journey for higher performance Sequence.

The content of the invention

In view of the shortcomings of the prior art, the present invention proposes a kind of single essence based on NVIDIA Kepler GPU assembly instructions Degree Matrix Multiplication optimization method and system.

A kind of single precision Matrix Multiplication optimization method, wherein the method are based on NVIDIA Kepler GPU assembly instructions, bag Include：

Step 1, the row length bn of row length bm and the B partitioning of matrix according to the A partitionings of matrix carries out piecemeal to original matrix, Each block treatment<bm,bn>The output matrix C of dimension；

Step 2, creates 4 temporarily providing rooms smA, smB, smAx and smBx in GPU secondary storages；

Step 3, the matrix A from the storage of GPU one-levels reads the matrix of the smA sizes to the smA, and being read from matrix B should The matrix of smB sizes is to the smB；

Step 4, loads a row A partitionings of matrix data to register from the smA every time, and a line B matrixes point are loaded from the smB Block number reads the content of registers according to register is arrived, and does matrix multiplication with multiply-add fusion instruction, and is doing Matrix Multiplication fortune While calculation, a row of next smA are read to the smAx from GPU one-levels storage, and store up the one of the next smB of reading Row arrives the smBx；

Step 5, is finished after the Matrix Multiplication of the smA and the smB, and the smA and the mAx addresses are exchanged, by the smB and should SmBx addresses exchange.

The single precision Matrix Multiplication optimization method, the wherein step 4 include register bank conflict resolving steps, with eliminate from The smA loads a row A partitioning of matrix data and is posted to register, and when loading a line B partitionings of matrix data to register from the smB Conflict between storage, register bank conflict resolvings step are allocated specifically, being instructed for multiple source registers, will be many Individual source register instruction is arranged respectively in different register.

The single precision Matrix Multiplication optimization method, multiply-add fusion instruction wherein in the step 4 is penetrated and double hairs using single-shot Penetrate alternate mode and access operand.

The single precision Matrix Multiplication optimization method, the wherein step 4 also include register cache mechanism, by by the register Operand caching, for this it is multiply-add fusion instruction use.

The single precision Matrix Multiplication optimization method, different bit wide access instructions are used when register is read wherein in the step 4 Strategy process, the method include using LDG.128 instruction from global memory read data；Read from shared drive using LDS.64 Access evidence.

The invention allows for a kind of single precision Matrix Multiplication optimization system, the wherein system is based on NVIDIA Kepler GPU assembly instructions, including：

Piecemeal module, the row length bn for row length bm and the B partitioning of matrix according to the A partitionings of matrix enters to original matrix Row piecemeal, each block treatment<bm,bn>The output matrix C of dimension；

Creation module, for creating 4 temporarily providing rooms smA, smB, smAx and smBx in GPU secondary storages；

Read module, reads the matrix of the smA sizes to the smA, from matrix B for the matrix A from the storage of GPU one-levels Read the matrix of the smB sizes to the smB；

Computing module, for loading a row A partitionings of matrix data to register from the smA every time, a line is loaded from the smB B partitioning of matrix data read the content of registers to register, and do matrix multiplication with multiply-add fusion instruction, and are doing While matrix multiplication, a row of next smA are read to the smAx from GPU one-levels storage, and it is next to store up reading A line of the smB is to the smBx；

Exchange module, for after the smA and the smB Matrix Multiplications, the smA and the mAx addresses being exchanged, by the smB and The smBx addresses exchange.

The single precision Matrix Multiplication optimizes system, and the wherein computing module includes register bank conflict resolving modules, is used for Elimination loads a row A partitioning of matrix data and loads a line B partitionings of matrix data to deposit to register, and from the smB from the smA Conflict during device between register, register bank conflict resolving modules are allocated for the instruction of multiple source registers, will be many Individual source register instruction is arranged respectively in different register.

The single precision Matrix Multiplication optimizes system, the multiply-add fusion instruction of this wherein in the computing module using single-shot penetrate with it is double Launch alternate mode and access operand.

The single precision Matrix Multiplication optimizes system, and the wherein computing module also includes register cache mechanism, and the mechanism passes through The operand of the register is cached, so that the multiply-add fusion instruction is used.

The single precision Matrix Multiplication optimize system, wherein in the computing module read register when using LDG.128 instruction from Global memory reads data；Data are read from shared drive using LDS.64.

From above scheme, the advantage of the invention is that：The present invention can avoid the FFMA on Kepler GPU from instructing The register bank for bringing conflicts；97% instruction throughput can be reached；Maximum bandwidth can be utilized, and avoids shared Memory bank conflict；88% peak performance can be reached, 15% or so is realized more than Cublas optimal at present.

Brief description of the drawings

Fig. 1 is that double buffering section technique Matrix Multiplication flow chart is utilized on GPU；

Fig. 2 is register bank distribution maps on GPU；

Fig. 3 is the register map of 12x12 registers piecemeal distribution；

Fig. 4 is the double emission mode organizational form figures of FFMA instructions；

Fig. 5 is the instruction computation sequence figure that double transmittings that Scheduling Block 1-2-2-1 patterns are carried out are extended.

Specific embodiment

Specifically, network virtualization framework involved in the present invention is as shown in Figure 1.Double buffering Matrix Multiplication is utilized on GPU Algorithm steps are as follows：

Step 1：First according to bm (the row length of the A partitionings of matrix) and bn (the row length of the B partitionings of matrix) to original matrix Piecemeal is carried out, each block treatment<bm,bn>The output matrix C of dimension；

Step 2：Created on shared memory (secondary storage on GPU) 4 temporarily providing rooms smA, smB, smAx and smBx；

Step 3：Matrix A from global memory (one-level storage) on GPU read the matrix of smA sizes to SmA, from the matrix of matrix B reading smB sizes to smB；

Step 4：A row (A partitionings of matrix data) to register are loaded from smA every time, a line (B matrixes point are loaded from smB Block number evidence) to register, matrix multiplication is done with FFMA (multiply-add fusion) instructions, and while matrix multiplication is done, from Global memory read a line of next smA and smB to smAx and smBx；

Step 5：Finish after the Matrix Multiplication of smA and smB, smA and mAx addresses are exchanged, smB and smBx addresses is mutual Change.

During optimization, the present invention is main to be needed to consider the partitioning of matrix, register bank conflict resolvings, FFMA The double transmittings of (multiply-add fusion) instruction, and vector instruction use.

Register bank conflict resolvings.To a son row of A matrixes, the sub-block distribution of a sub-line and C matrixes of B matrixes is posted Storage, the present invention needs to consider three factors：Correctness, without bank conflict and compact register index.LDG.128 requirements The byte-aligned of register number 4.Because NVIDIA GPU without the register of 128, it is necessary to continuously be posted with 4 32 Storage RN, RN+1, RN+2, RN+3 carry out the register of combination replacement 128.Present invention discover that a NVIDIA is undocumented about Fixed, N moulds 4 (representing that N does the remainder of division to 4) are necessarily equal to 0, and the mistake of illegal instruction occurs when can otherwise run.This It can be appreciated that the requirement of LDG.128 nybbles alignment can make hardware logic simple, power consumption is further reduced.Due to nybble Align and N moulds 4 are equal to 0, register bank distributions on GPU are found out from following table.

Definition register bank conflict resolving steps of the present invention are as follows：

It is allocated for the instruction of multiple source registers, multiple source register instructions is respectively allocated in different deposit Specifically, being instructed for n source register in device, if these instructions are all on different register bank, without register Bank conflicts.

FFMA R0 as shown in Figure 2, R1, R4, R5 instruction, its source register are distributed in bank1, bank2, bank3, because This conflicts without bank.

Assuming that a son row distribution of A is [bank0, bank1], the sub-line distribution of B is [bank2, bank3].Also There are two kinds of selections to leave C for.[bank1bank2；Bank3bank0] and [bank3bank1；bank0bank2].A, B allocation model It is identical, therefore total 2x2=4 kind bank model selections, and these four patterns equivalence in performance.The present invention arbitrarily chooses it Middle one kind [bank0bank1] [bank2bank3] [bank1bank2；Bank3bank0] it is respectively allocated to A, B and C.Fig. 3 is this It is the register of 12x12 registers piecemeal distribution to invent, and the numeral in figure is the index of register.Arbitrary Cij can be verified, The FFMA bank conflicts of Ai, Bj are completely eliminated.In addition distribution register when, as far as possible it is compact come avoid register exceed 255 Limitation.

The double transmittings of FFMA instructions.Kepler frameworks support the double transmittings of FFMA instructions, and this needs the software and hardware combining could to maintain. Allow the double transmittings of warp schedulers unrealistic always, because hardware is shared, it is necessary to which single-shot is penetrated and double transmittings replace.Kepler A SM have 4 warp schedulers, 32 cores of each scheduler point, such 4 schedulers can consume 128 cores. One SM of Kepler totally 192 cores, remaining 192-128=64 is splitted into 2 groups, every group of 32 cores, this 32 cores Shared by 2 warp schedulers.Best pattern is that 1 group of single-shot is penetrated (1 FFMA instruction), 2 groups of double transmittings (4 FFMA instructions) Penetrated (1 FFMA is instructed) with 1 group of single-shot.Its organizational form is following as shown in figure 4, because two of which warp double transmittings at the same time When can compete shared resource and wait.Therefore since the 3rd cycle (cycle period), four warp can mutually sting Close.Each cycle launches 6 instructions, the positive all cores of good utilisation.

Fig. 5 is that the instruction that the double transmittings carried out by the Scheduling Block 1-2-2-1 patterns to 7 instructions are extended calculates suitable Sequence figure, wherein oval 2 squares for enclosing represent two instructions of double transmittings, isolated square represents single-shot and penetrates.In square Digitized representation computation sequence.The benefit of such organizational computing order is by using register cache mechanism (Operand Collector) register pressure is reduced.Register cache mechanism permission register operand can be lived by Cache (caching), Then used by ensuing instruction, so as to reduce the number of times for reading register.Because register only has 4 bank, if double Transmitting, two FFMA instructions will read 6 registers, beyond the quantity of register bank.Operand caching mechanism can be with The FFMA instruction operands to be accessed of double transmittings are controlled within 4 registers.

For maximized utilization bandwidth, and the conflict between shared register bank is avoided, the invention allows for one Plant the strategy process of the different bit wide access instructions (LDS and LDG) of selection.LDS, LDG, LDS.64, LDG.128 etc. are from storage The middle instruction for reading data, LDS represents and is read from shared drive that LDG is represented and read from global memory, digital table below Show the data for once reading how many bit wide.Select the memory access width of LDS and LDG.Micro benchmark test program of the invention Test result, the present invention is instructed from global memory using LDG.128 and reads data, and LDG.128 crosses texture Cache, and bandwidth is higher. The present invention reads data using LDS.64 from shared drive, and LDS.64 has the bandwidth higher than LDS.128.The other present invention does reality Issue after examination and approval now, LDS instructions are just to start to read source register after it's 2 clock cycle pasts instruction issue, and LDS.64 can find Suitable deployment position so that the source register and FFMA of reading instruct no bank conflicts, LDS.128 does not accomplish but.This hair Bright one reason of conjecture is that LDS maximums transaction is 256 bytes, and each thread reads 128bit, the line in a warp Journey will read 512 bytes, and the limitation beyond 256 bytes is, it is necessary to transaction several times, and secondary transaction is assorted When start read source register just it is unpredictable.Be primarily due to it using LDG.128 has higher relative to LD.128 Bandwidth and lower delay.LDG was texture cache, and the GPU after Kepler, LD only cross L2cache, but L1cache.L1cache is only used to local memory.LDG.128 has bandwidth higher relative to LDG.32 and LDG.64, and Reduce the number of load instructions.

It is below system embodiment corresponding with above method embodiment, present embodiment can be mutual with above-mentioned implementation method Coordinate and implement.The above-mentioned relevant technical details mentioned in mode of applying are still effective in the present embodiment, in order to reduce repetition, this In repeat no more.Correspondingly, the relevant technical details mentioned in present embodiment are also applicable in above-mentioned implementation method.

The present invention also proposes a kind of single precision Matrix Multiplication optimization system, and the wherein system is based on NVIDIA Kepler GPU Assembly instruction, including：

By description and explanation with reference to accompanying drawing to the specific embodiment of the invention, other side of the invention and feature are to this It is obvious for the technical staff in field.It should be noted that these embodiments should be considered to be only exemplary, and without In limiting the invention.In the case of without departing substantially from spirit of the invention and its essence, those of ordinary skill in the art work as Various corresponding changes and deformation can be made according to the present invention, but these corresponding changes and deformation should all belong to appended by the present invention Scope of the claims.

Claims

1. a kind of single precision Matrix Multiplication optimization method, it is characterised in that the method is based on NVIDIA KeplerGPU assembly instructions, Including：

Step 3, the matrix A from the storage of GPU one-levels reads the matrix of the smA sizes to the smA, reads the smB from matrix B big Small matrix is to the smB；

Step 4, loads a row A partitionings of matrix data to register from the smA every time, and a line B partitioning of matrix numbers are loaded from the smB According to register, reading the content of registers, and matrix multiplication is done with multiply-add fusion instruction, and doing matrix multiplication Meanwhile, a row of the next smA are read to the smAx from GPU one-levels storage, and are stored up and read a line of next smB and arrive The smBx；

Step 5, finishes after the Matrix Multiplication of the smA and smB, the smA and the mAx addresses is exchanged, by the smB and smBx Address exchanges.

2. single precision Matrix Multiplication optimization method as claimed in claim 1, it is characterised in that the step 4 includes register bank Conflict resolving step, with eliminate from the smA load one row A partitioning of matrix data load a line B squares to register, and from the smB , to the conflict between register during register, register bank conflict resolvings step is specifically, for multiple sources for battle array block data Register instruction is allocated, and multiple source register instructions are arranged respectively in different register.

3. single precision Matrix Multiplication optimization method as claimed in claim 1, it is characterised in that the multiply-add fusion in the step 4 Instruction is penetrated using single-shot and accesses operand with double alternate modes of transmitting.

4. single precision Matrix Multiplication optimization method as claimed in claim 1, it is characterised in that the step 4 is also slow including register Mechanism is deposited, is cached by by the operand of the register, so that the multiply-add fusion instruction is used.

5. single precision Matrix Multiplication optimization method as claimed in claim 1, it is characterised in that when reading register in the step 4 Using the strategy process of different bit wide access instructions, the method includes reading data from global memory using LDG.128 instructions；Adopt With LDS.64 data are read from shared drive.

6. a kind of single precision Matrix Multiplication optimizes system, it is characterised in that the system is based on NVIDIA KeplerGPU assembly instructions, Including：

Piecemeal module, the row length bn for row length bm and the B partitioning of matrix according to the A partitionings of matrix is carried out to original matrix Piecemeal, each block treatment<bm,bn>The output matrix C of dimension；

Read module, the matrix of the smA sizes is read to the smA for the matrix A from the storage of GPU one-levels, is read from matrix B The matrix of the smB sizes is to the smB；

Computing module, for loading a row A partitionings of matrix data to register from the smA every time, a line B squares is loaded from the smB Battle array block data reads the content of registers to register, and does matrix multiplication with multiply-add fusion instruction, and is doing matrix While multiplication, a row of next smA are read to the smAx from GPU one-levels storage, and store up the next smB of reading A line to the smBx；

Module is exchanged, for after the smA and the smB Matrix Multiplications, the smA and the mAx addresses being exchanged, by the smB and should SmBx addresses exchange.

7. single precision Matrix Multiplication as claimed in claim 6 optimizes system, it is characterised in that the computing module includes register Bank conflict resolving modules, for eliminate from the smA load one row A partitioning of matrix data to register, and from the smB loading one Row B partitionings of matrix data to the conflict between register during register, for multiple sources post by register bank conflict resolving modules Storage instruction is allocated, and multiple source register instructions are arranged respectively in different register.

8. single precision Matrix Multiplication as claimed in claim 6 optimization system, it is characterised in that this in the computing module is multiply-add to be melted Close to instruct to be penetrated using single-shot and access operand with double alternate modes of transmitting.

9. single precision Matrix Multiplication as claimed in claim 6 optimizes system, it is characterised in that the computing module also includes register Caching mechanism, the mechanism is cached by by the operand of the register, so that the multiply-add fusion instruction is used.

10. single precision Matrix Multiplication as claimed in claim 6 optimizes system, it is characterised in that deposit is read in the computing module Data are read from global memory using LDG.128 instructions during device；Data are read from shared drive using LDS.64.