CN103150157B

CN103150157B - Based on the GPU kernel program restructuring optimization method of memory access difference

Info

Publication number: CN103150157B
Application number: CN201310000459.7A
Authority: CN
Inventors: 甘新标; 刘杰; 迟利华; 晏益慧; 徐涵; 胡庆丰; 王志英; 苏博; 朱琪; 刘聪
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2013-01-03
Filing date: 2013-01-03
Publication date: 2015-11-25
Anticipated expiration: 2033-01-03
Also published as: CN103150157A

Abstract

The invention discloses a kind of GPU kernel program based on memory access difference restructuring optimization method, object is the execution efficiency and the application program capacity that promote extensive GPUKernel.Technical scheme first adopts Create method to build memory access behavior mark sheet, adopt the memory access track of each thread in Record method record each Kernel function again, then judge according to the memory access address of the GPU thread in each Kernel function whether the thread memory access in the GPU thread in same Kernel function occurs memory access difference, then based on the GPU of memory access difference? Kernel recombinates optimization, comprise based on the GPUKernel division of memory access difference and the GPU based on continuous memory access? Kernel merges two steps.Adopt the present invention can solve extensive many GPU? does the problem that Kernel application execution efficiency is low, promote extensive GPU? the execution efficiency of Kernel and application program capacity.

Description

Based on the GPU kernel program restructuring optimization method of memory access difference

Technical field

The present invention relates to GPU kernel program (i.e. GPUKernel) to recombinate optimization method, espespecially to recombinate optimization method based on the GPUKernel of memory access difference.

Background technology

In recent years, GPU (GraphicsProcessingUnit, Graphics Processing Unit) the concurrent execution pattern of powerful computing power, magnanimity thread and programming model flexibly, make GPU be widely used in numerous high-performance computing sectors such as molecular dynamics simulation, biogenesis analysis, weather prognosis.Map large-scale GPGPU (GeneralPurposecomputingonGraphicsProcessingUnits) application face-to-face, core program (Kernel) pattern of standard cannot meet the demand of large-scale application program.

GPU kernel program (GPUKernel) is exactly the program segment operated on GPU, usual programmer can by computation-intensive in program, run core subroutine consuming time and be transplanted on GPU and accelerate, operate in core subroutine such on GPU and be commonly referred to GPUKernel.

In the face of large-scale GPGPU application program, its GPUKernel number nearly tens of up to a hundred.In order to improve the dispatching efficiency of so many GPUKernel, develop program parallelization to greatest extent, many GPUKernel restructuring optimisation techniques become the effective way of raising program operational efficiency.Current GPUKernel restructuring optimization method mainly contains following several:

(1) merge based on many Kernel that Kernel is concurrent.GPU architecture before NVIDIA second generation unified shader does not support many Kernel concurrence performance, and in order to allow independently GPUKernel can concurrence performance, it be the effective ways of many Kernel concurrence performance that many Kernel merge.Although NVIDIA second generation unified shader GPU can perform by concurrent many Kernel simultaneously, but the quantity of concurrent Kernel is also extremely limited, therefore, many Kernel merging methods can improve the concurrency between many Kernel, alleviate the pressure that many Kernel order performs, reduce many Kernel Start-up costs, develop the concurrency between many Kernel, improve the operational efficiency of GPU program.

(2) the many Kernel based on GPU shared storage merge.If there is data dependence relation between GPUKernel, namely the output of a Kernel is just the input of another Kernel, in order to avoid the global storage of each Kernel sequential access long delay, the multiple Kernel that there is inputoutput data dependence can be merged into an independently GPUKernel, in GPUKernel, explicit management GPU shared storage transmits intermediary as data, avoid the access of global storage, improve memory access efficiency, reduce GPUKernel number simultaneously, decrease GPUKernel Start-up costs and be conducive to Kernel scheduling, improve GPU program operational efficiency.

(3) based on many Kernel restructuring of program branches.Be different from traditional CPU architecture, resource most on sheet is used for calculating by GPU, and the control module in GPU is relative with branch prediction unit deficient.How in GPUKernel, to avoid thread branch, most important for the execution efficiency improving GPUKernel, therefore, scholar is had to propose the distinct program branch in same GPUKernel to be separated the different GPUSub-Kernel of establishment, then the GPUSub-Kernel with identical execution route is merged the new GPUGPUKernel of composition one, the GPUKernel restructuring of testing based on program branches can effectively avoid GPU thread branch and the thread waits that causes with synchronous, significantly improve the execution efficiency of GPUKernel.

Above-mentioned three class GPUKernel restructuring optimization methods, to a certain extent, accelerate the execution efficiency that can improve GPUKernel towards concrete extensive GPU optimizing application, promote the performance of GPU application program.But above-mentioned three class GPUKernel restructuring optimization methods have ignored GPU threads store access module to the impact of the execution efficiency of GPUKernel.In practice, the memory access behavior feature of GPU thread will have a strong impact on the execution efficiency of GPUKernel.Therefore, how from practical application, solve because the GPU system efficiency that causes of the memory access behavior feature of GPU thread is low and a difficult problem that is GPU application performance difference is the important technological problems that those skilled in the art pay close attention to.

The memory access behavior feature of GPU thread can be divided into following two classes:

(1) the thread accesses a slice continuous print storage space in Kernel, this is also a kind of memory access behavior feature of Utopian GPU thread, and under this memory access behavior feature, GPUKernel execution efficiency is the highest;

(2) there is Jump in the thread memory access behavior in Kernel, the interruption storage unit of the storage space that namely each thread accesses is different or the same storage space of discontinuous access.

In the memory access behavior feature of GPU thread, Kernel thread occurs that the memory access behavior of Jump is referred to as GPU thread memory access difference.

SIMD (SingleInstructionMultipleData, single instruction multiple data) accelerates the effective ways having become lifting body architecture usefulness, and GPU architecture is exactly a kind of typical SIMD acceleration bodies architecture.The mode of operation that SIMD accelerates is based on GPU thread accesses one section of continuous print storage space.If there is Jump in the memory access behavior of GPU thread, not only can not elevator system usefulness, accelerate GPU application, also can have a strong impact on GPU system usefulness and GPU application program operational efficiency.Restructuring optimization occurs that the GPU thread of memory access difference is the gordian technique and the method that promote GPU system performance and GPU application program operational efficiency.Therefore, eliminate GPU thread memory access difference by GPUKernel recombination method, GPU system usefulness and GPU application program operational efficiency can be promoted to greatest extent.Open source literature is not also had to relate to related art scheme at present.

Summary of the invention

The technical problem to be solved in the present invention is: for the problem that extensive many GPUKernel application execution efficiencys are low, a kind of GPUKernel based on memory access difference restructuring optimization method is proposed, to promote execution efficiency and the application program capacity of extensive GPUKernel.

In order to solve the problems of the technologies described above, concrete technical scheme of the present invention is:

The first step, each Kernel function adopting Create method to be GPU program build memory access behavior mark sheet.The concrete steps of Create method are: for each Kernel function of GPU program sets up a memory access behavior mark sheet, memory access behavior mark sheet comprises four fields, is respectively: the storage space logical address Addr of the type of memory MemT of thread number Tid, thread accesses, the size of data Size of thread accesses and access.Thread number Tid represents the unique number of thread in this Kernel domain of function; The type of memory MemT of thread accesses represents that the type of memory of thread accesses, type of memory comprise global storage Global, shared storage SharedMemory, Texture memory TextureMemory and constant storer ConstantMemory; The size of data Size of thread accesses represents that the data of this thread accesses occupy the number of storage space byte; The storage space logical address Addr of thread accesses shows that threading operation calculates the address space of the deposit data of needs.

The memory access track of each thread in second step, employing Record method record each Kernel function.The concrete steps of Record method are:

2.1 scanning GPU kernel programs, in definition GPU kernel program, the function numbering Kid of Kernel function is followed successively by 0,1 ... i ... M-1, wherein, 0≤i<M, M are the number of Kernel function in GPU kernel program; And the number of threads that the Kernel function that definition is numbered i starts is T _i, then memory access behavior mark sheet add up to M; The list item number being numbered memory access behavior mark sheet corresponding to the Kernel function of i is the number of threads T that this Kernel function starts _i.By list item corresponding for the access memory access trace information of the Kernel thread in GPU program write memory access behavior mark sheet and field;

2.2 initialization j=0;

2.3 obtain the number of threads T that function is numbered the Kernel function startup of j _j, and initialization k=0;

2.4 memory access trace informations function being numbered the kth thread of the Kernel function of j write field Tid, MemT, Size and Addr of memory access behavior mark sheet respectively, upgrade k=k+1;

If 2.5 k≤T _j-1, turn 2.4; Otherwise turn 2.6;

2.6 upgrade j=j+1;

If 2.7 j≤M-1, turn 2.3, otherwise turn 2.8;

In 2.8 each Kernel functions, the memory access trace information record of each thread is complete, performs the 3rd step.

3rd step, memory access difference judge.According to the memory access address of the GPU thread in each Kernel function, judge whether the thread memory access in the GPU thread in same Kernel function occurs memory access difference.Determination methods is as follows:

3.1 initialization j=0;

3.2 obtain the number of threads T that function is numbered the Kernel function startup of j _j, define the type of memory S set of this Kernel function access _jfor sky, namely and define difference address set A _jfor sky, namely initialization m=0;

3.3 inquiry memory access behavior mark sheets obtain thread T _mthe type of memory of access represent m thread T _maccess type of memory.If with thread T _m+1the type of memory of access difference, then judge thread T _mthere is memory access difference with thread m+1, will occur that the type of memory of difference adds S set to _jin, namely with perform 3.4; Otherwise directly perform 3.4;

3.4 inquiry memory access behavior mark sheets obtain thread T _mthe storage space logical address of access represent m thread T _maccess storage space logical address.If with the storage space logical address of m+1 thread accesses difference be not equal to thread T _mthe size of the data of access, then judge thread T _mwith thread T _m+1there is memory access difference, will the access type of memory of difference be occurred with two tuple-sets that thread address is formed with be incorporated to set A successively _jin, namely with perform 3.5; Otherwise, directly perform 3.5;

3.5 upgrade m=m+1, if m<T _j-1, turn 3.2; Otherwise, represent that thread accesses type of memory multilevel iudge does not terminate, perform 3.6;

3.6 upgrade j=j+1;

If 3.7 j≤M-1, turn 3.2, otherwise, show do not have Kernel function to need to carry out the judgement of memory access difference, perform 3.8.

3.8 each Kernel function memory access differences judge complete, if and then represent there is not memory access difference, directly terminate to optimize, turn the 5th step, otherwise, carry out GPUKernel restructuring and optimize, turn the 4th step.

4th step, based on memory access difference GPUKernel restructuring optimize.To occurring in GPU kernel program that the GPUKernel of memory access difference carries out restructuring optimization and mainly comprises two steps: GPUKernel division is become multiple sub-Kernel that there is not memory access difference; Sub-for memory access continuous print Kernel is fused into a new GPUKernel.

GPUKernel based on memory access difference divides by 4.1.To occur that the GPUKernel division of memory access difference becomes multiple sub-Kernel that there is not memory access difference, method is as follows:

4.1.1 initialization j=0;

4.1.2 S set is obtained _jelement number that is, Kernel division function being numbered j becomes individual sub-Kernel;

4.1.3 query function is numbered memory access behavior mark sheet corresponding to the Kernel of j, obtains the sets of threads with identical access type of memory, obtains individual sets of threads, with each sets of threads for thread block organizes a sub-Kernel;

4.1.4 j=j+1 is upgraded;

If 4.1.5 j≤M-1, turns 4.1.2, otherwise, perform 4.1.6;

4.1.6 the GPUKernel division based on type of memory is complete, turns 4.1.7.

4.1.7 initialization j=0;

If 4.1.8 S _jfor sky, namely turn 4.1.9; Otherwise, for each Kernel function in the GPU program after Kernel division builds memory access behavior mark sheet, turn the first step;

4.1.9 upgrade j=j+1, if j≤M-1, turn 4.1.8; Otherwise, turn 4.1.10;

4.1.10 initialization j=0;

4.1.11 set A is obtained _jelement number the Kernel division being numbered j by function becomes individual sub-Kernel;

4.1.12 query function is numbered memory access behavior mark sheet corresponding to the Kernel of j, with A _jmiddle element be boundary function is numbered j Kernel division become individual sub-Kernel;

4.1.13 j=j+1 is upgraded;

If 4.1.14 j≤M-1, turns 4.1.11, otherwise, turn 4.1.15;

4.1.15 initialization j=0;

If 4.1.16 A _jfor sky, that is, turn 4.1.17; Otherwise, turn 4.1.11;

4.1.17 upgrade j=j+1, if j≤M-1, turn 4.1.16; Otherwise, turn 4.2;

The GPUKernel of 4.2 pairs of continuous memory access merges.May be there is Kernel fragment in the GPU kernel program after Kernel division is formed, affect GPUKernel operational efficiency.A new larger GPUKernel that permeated by the continuous memory access thread with the same memory access can significantly promote GPU system efficiency and application program capacity, at this moment the Kernel formed after the Kernel carrying out merging comprises division and the Kernel not having division, as Kernel1 has split into Kernel11 and Kernel12, Kernel2 is division not, but Kernel fusion needs to consider Kernel11, Kernel12 and Kernel2 to be carried out merging the new Kernel of formation.Concrete Kernel fusion method is as follows:

4.2.1 adopt Create method to be that each Kernel function in GPU program after the 4.1st step process builds memory access behavior mark sheet, then adopt the memory access track of each thread in Record method record each Kernel function;

4.2.2 initialization j=0;

4.2.3 the memory access track characteristic table that 4.2.1 builds is inquired about, obtain the thread accesses type of memory that function is numbered the Kernel of j and j+1, if the thread accesses type of memory that function is numbered the Kernel (being expressed as Kernelj) of j is identical with the Kernel (being expressed as Kernelj+1) being numbered j+1, turn 4.2.4; Otherwise turn 4.2.6;

4.2.4 inquire about memory access track characteristic table, obtain the start address of a slice continuous space and the size of data of termination address and last thread accesses of Kernelj and Kernelj+1 access.If the termination address of the thread accesses address continuous space of function Kernelj differs the size of data of last thread accesses in Kernelj or the thread accesses address continuous space of Kernelj+1 termination address with the start address of Kernelj+1 differs the size of data of last thread accesses in Kernelj with the start address of Kernelj, what then judge Kernelj and Kernelj+1 access is a slice continuous print storage space, turns 4.2.5; Otherwise, turn 4.2.6;

4.2.5 carry out mixing operation to Kernelj and Kernelj+1, concrete operation method is:

Adopt in background technology and reconsolidated by the thread in Kernelj and Kernelj+1 based on many Kernel merging methods that Kernel is concurrent, tissue becomes a larger new GPUKernel.

4.2.6 j=j+1 is upgraded;

If 4.2.7 j<M-1, turn 4.2.3; Otherwise, perform the 5th step.

5th step, end.

Adopt the present invention can reach following technique effect:

1. alleviate GPU program memory access pressure, improve the memory access efficiency of GPU system;

2. accelerate the travelling speed of GPU application program and improve the resource utilization of GPU system.

Accompanying drawing explanation

Fig. 1 is memory access behavior mark sheet structure.

Fig. 2 optimizes overview flow chart based on the GPUKernel restructuring of memory access difference

Embodiment

Fig. 1 is memory access behavior mark sheet structure, and it is as follows that mode set up by concrete mark sheet:

For each Kernel function of GPU program sets up a memory access behavior mark sheet, memory access behavior mark sheet comprises four fields, is respectively: the storage space logical address Addr of the type of memory MemT of thread number Tid, thread accesses, the size of data Size of thread accesses and access.Thread number Tid represents the unique number of thread in this Kernel domain of function; The type of memory MemT of thread accesses represents that the type of memory of thread accesses, type of memory comprise global storage Global, shared storage SharedMemory, Texture memory TextureMemory and constant storer ConstantMemory; The size of data Size of thread accesses represents that the data of this thread accesses occupy the number of storage space byte; The storage space logical address Addr of thread accesses shows that threading operation calculates the address space of the deposit data of needs.

Fig. 2 is overview flow chart of the present invention, and its concrete implementation step is as follows:

The first step, structure memory access behavior mark sheet.

Second step, record the memory access track of each thread in each Kernel function.

3rd step, memory access difference judge.

4th step, based on memory access difference GPUKernel restructuring optimize.

4.1 divide based on the GPUKernel of memory access difference.

4.2 merge based on the GPUKernel of continuous memory access.

5th step, end.

Claims

1., based on a GPU kernel program restructuring optimization method for memory access difference, it is characterized in that comprising the following steps:

The first step, employing Create method build memory access behavior mark sheet, concrete steps are: for each Kernel function of GPU program sets up a memory access behavior mark sheet, memory access behavior mark sheet comprises four fields, is respectively: the storage space logical address Addr of the type of memory MemT of thread number Tid, thread accesses, the size of data Size of thread accesses and access; Thread number Tid represents the unique number of thread in this Kernel domain of function; The type of memory MemT of thread accesses represents the type of memory of thread accesses; The size of data Size of thread accesses represents that the data of this thread accesses occupy the number of storage space byte; The storage space logical address Addr of thread accesses shows that threading operation calculates the address space of the deposit data of needs;

The memory access track of each thread in second step, employing Record method record each Kernel function, concrete steps are:

2.1 scanning GPU kernel programs, in definition GPU kernel program, the function numbering Kid of Kernel function is followed successively by 0,1 ... i ... M-1, wherein, 0≤i<M, M are the number of Kernel function in GPU kernel program; And the number of threads that the Kernel function that definition is numbered i starts is T _i, then memory access behavior mark sheet add up to M; The list item number being numbered memory access behavior mark sheet corresponding to the Kernel function of i is T _i; By list item corresponding for the access memory access trace information of the Kernel thread in GPU program write memory access behavior mark sheet and field;

2.2 initialization j=0;

If 2.5 k≤T _j-1, turn 2.4; Otherwise turn 2.6;

2.6 upgrade j=j+1;

If 2.7 j≤M-1, turn 2.3, otherwise turn 2.8;

In 2.8 each Kernel functions, the memory access trace information record of each thread is complete, performs the 3rd step;

3rd step, memory access address according to the GPU thread in each Kernel function, judge whether the thread memory access in the GPU thread in same Kernel function occurs memory access difference, method is as follows:

3.1 initialization j=0;

3.3 inquiry memory access behavior mark sheets obtain thread T _mthe type of memory of access represent m thread T _maccess type of memory, if with thread T _m+1the type of memory of access difference, then judge thread T _mthere is memory access difference with thread m+1, will occur that the type of memory of difference adds S set to _jin, namely with perform 3.4; Otherwise directly perform 3.4;

3.4 inquiry memory access behavior mark sheets obtain thread T _mthe storage space logical address of access represent m thread T _maccess storage space logical address, if with the storage space logical address of m+1 thread accesses difference be not equal to thread T _mthe size of the data of access, then judge thread T _mwith thread T _m+1there is memory access difference, will the access type of memory of difference be occurred with two tuple-sets that thread address is formed with be incorporated to set A successively _jin, namely with perform 3.5; Otherwise, directly perform 3.5;

3.6 upgrade j=j+1;

If 3.7 j≤M-1, turn 3.2, otherwise, show do not have Kernel function to need to carry out the judgement of memory access difference, perform 3.8;

3.8 each Kernel function memory access differences judge complete, if and turn the 5th step, otherwise perform the 4th step;

4th step, based on memory access difference GPUKernel restructuring optimize:

4.1 will occur that the GPUKernel division of memory access difference becomes multiple sub-Kernel that there is not memory access difference, and method is as follows:

4.1.1 initialization j=0;

4.1.2 S set is obtained _jelement number Num _sj, that is, Kernel division function being numbered j becomes Num _sjindividual sub-Kernel;

4.1.3 query function is numbered memory access behavior mark sheet corresponding to the Kernel of j, obtains the sets of threads with identical access type of memory, obtains Num _sjindividual sets of threads, with each sets of threads for thread block organizes a sub-Kernel;

4.1.4 j=j+1 is upgraded;

If 4.1.5 j≤M-1, turns 4.1.2, otherwise, perform 4.1.6;

4.1.6 the GPUKernel division based on type of memory is complete, turns 4.1.7;

4.1.7 initialization j=0;

If 4.1.8 S _jfor sky, namely turn 4.1.9; Otherwise, turn the first step;

4.1.9 upgrade j=j+1, if j≤M-1, turn 4.1.8; Otherwise, turn 4.1.10;

4.1.10 initialization j=0;

4.1.11 set A is obtained _jelement number Num _aj, the Kernel division being numbered j by function becomes Num _ajindividual sub-Kernel;

4.1.12 query function is numbered memory access behavior mark sheet corresponding to the Kernel of j, with A _jmiddle element be boundary function is numbered j Kernel division become Num _ajindividual sub-Kernel;

4.1.13 j=j+1 is upgraded;

If 4.1.14 j≤M-1, turns 4.1.11, otherwise, turn 4.1.15;

4.1.15 initialization j=0;

If 4.1.16 A _jfor sky, that is, turn 4.1.17; Otherwise, turn 4.1.11;

4.1.17 upgrade j=j+1, if j≤M-1, turn 4.1.16; Otherwise, turn 4.2;

The GPUKernel of 4.2 pairs of continuous memory access merges, and concrete fusion method is as follows:

4.2.2 initialization j=0;

4.2.3 the memory access track characteristic table that 4.2.1 builds is inquired about, obtain the thread accesses type of memory that function is numbered the Kernel of j and j+1, if the thread accesses type of memory that function is numbered Kernel and Kernelj of j is identical with Kernel and Kernelj+1 being numbered j+1, turn 4.2.4; Otherwise turn 4.2.6;

4.2.4 inquire about memory access track characteristic table, obtain the start address of a slice continuous space and the size of data of termination address and last thread accesses of Kernelj and Kernelj+1 access; If the termination address of the thread accesses address continuous space of function Kernelj differs the size of data of last thread accesses in Kernelj or the thread accesses address continuous space of Kernelj+1 termination address with the start address of Kernelj+1 differs the size of data of last thread accesses in Kernelj with the start address of Kernelj, what then judge Kernelj and Kernelj+1 access is a slice continuous print storage space, turns 4.2.5; Otherwise, turn 4.2.6;

4.2.5 carry out mixing operation to Kernelj and Kernelj+1, method is:

Adopt and reconsolidated by the thread in Kernelj and Kernelj+1 based on many Kernel merging methods that Kernel is concurrent, tissue becomes a new GPUKernel;

4.2.6 j=j+1 is upgraded;

If 4.2.7 j<M-1, turns 4.2.3; Otherwise, perform the 5th step;

5th step, end.