CN103150157A

CN103150157A - Memory access bifurcation-based GPU (Graphics Processing Unit) kernel program recombination optimization method

Info

Publication number: CN103150157A
Application number: CN2013100004597A
Authority: CN
Inventors: 甘新标; 刘杰; 迟利华; 晏益慧; 徐涵; 胡庆丰; 王志英; 苏博; 朱琪; 刘聪
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2013-01-03
Filing date: 2013-01-03
Publication date: 2013-06-12
Anticipated expiration: 2033-01-03
Also published as: CN103150157B

Abstract

The invention discloses a memory access bifurcation-based GPU (Graphics Processing Unit) kernel program recombination optimization method, which aims to improve the executing efficiency and the application program performance of large-scale GPU Kernel. The technical scheme is that a memory access behavior feature list is constructed by using a Create method; the memory access track of each thread in each Kernel function is recorded by using a Record method; next, whether memory access bifurcation occurs in the thread memory access in the GPU thread of the same Kernel function is judged according to a memory access address of the GPU thread in each Kernel function; and then Kernel recombination optimization is performed on the memory access bifurcation-based GPU, and the method comprises two steps of memory access bifurcation-based GPU Kernel split and continuous memory access-based GPU Kernel fusion. By using the method, the problem of low executing efficiency of the large-scale GPU Kernel application can be solved, and the executing efficiency and the application program performance of the large-scale GPU Kernel are improved.

Description

GPU kernel program restructuring optimization method based on the memory access difference

Technical field

The present invention relates to GPU kernel program (being GPU Kernel) restructuring optimization method, espespecially based on the GPU Kernel restructuring optimization method of memory access difference.

Background technology

In recent years, GPU (Graphics Processing Unit, Graphics Processing Unit) execution pattern that powerful computing power, magnanimity thread are concurrent and programming model flexibly make GPU be widely used in numerous high-performance computing sectors such as molecular dynamics simulation, biogenesis analysis, weather prognosis.Face-to-face large-scale GPGPU (General Purpose computing on Graphics Processing Units) is used mapping, core program (Kernel) pattern of standard can't satisfy the demand of large-scale application program.

GPU kernel program (GPU Kernel) is exactly the program segment that operates on GPU, usually the programmer can be transplanted to the upper acceleration of GPU with computation-intensive, operation core subroutine consuming time in program, operates in the upper such core subroutine of GPU and is commonly referred to GPU Kernel.

In the face of large-scale GPGPU application program, its GPU Kernel number reaches tens of up to a hundred.In order to improve the dispatching efficiency of so many GPU Kernel, development sequence concurrency to greatest extent, many GPU Kernel restructuring optimisation technique becomes the effective way of raising program operational efficiency.The optimization method of GPU Kernel restructuring at present mainly contains following several:

(1) merge based on the concurrent many Kernel of Kernel.GPU architecture before NVIDIA second generation unified shader is not support the concurrent execution of many Kernel, and GPU Kernel can concurrently carry out in order to allow independently, and it is the effective ways of the concurrent execution of many Kernel that many Kernel merge.Although NVIDIA second generation unified shader GPU can carry out by concurrent many Kernel simultaneously, but the quantity of concurrent Kernel is also extremely limited, therefore, many Kernel merging methods can improve the concurrency between many Kernel, alleviate the pressure that many Kernel sequentially carry out, reduce many Kernel Start-up costs, develop the concurrency between many Kernel, improve the operational efficiency of GPU program.

(2) the many Kernel based on the GPU shared storage merge.if have data dependence relation between GPU Kernel, the i.e. output of a Kernel is just the input of another Kernel, global storage for fear of each Kernel sequential access long delay, a plurality of Kernel that have the inputoutput data dependence can be merged into an independently GPU Kernel, explicit management GPU shared storage is as data transmission intermediary in GPU Kernel, avoid the access of global storage, improve memory access efficient, reduce simultaneously GPU Kernel number, reduce GPU Kernel Start-up costs and be conducive to the Kernel scheduling, improve GPU program operational efficiency.

(3) the many Kernel based on program branches recombinate.Be different from traditional CPU architecture, GPU is used for most resources on sheet to calculate, and the control module in GPU is relative with the branch prediction parts deficient.How to avoid thread branch in GPU Kernel, most important for the execution efficient that improves GPU Kernel, therefore, there is the scholar to propose the distinct program branch in same GPU Kernel is separated and sets up different GPU Sub-Kernel, the GPU Sub-Kernel that then will have identical execution route merges and forms a new GPU GPU Kernel, experiment based on the GPU Kernel of program branches restructuring can effectively avoid GPU thread branch and the thread waits that causes with synchronize, significantly improved the execution efficient of GPU Kernel.

Above-mentioned three class GPU Kernel restructuring optimization methods to a certain extent, accelerate to improve the execution efficient of GPU Kernel towards concrete extensive GPU optimizing application, promote the performance of GPU application program.But above-mentioned three class GPU Kernel restructuring optimization methods have been ignored the impact of GPU threads store access module on the execution efficient of GPU Kernel.In practice, the memory access behavior feature of GPU thread will have a strong impact on the execution efficient of GPU Kernel.Therefore, how from application in practice, solve because the GPU system effectiveness that the memory access behavior feature of GPU thread causes is low and the poor difficult problem of GPU application performance is the important technological problems that those skilled in the art pay close attention to.

The memory access behavior feature of GPU thread can be divided into following two classes:

(1) the continuous storage space of thread accesses a slice in Kernel, this is also a kind of memory access behavior feature of Utopian GPU thread, GPU Kernel carries out most effective under this memory access behavior feature;

(2) Jump appears in the thread memory access behavior in Kernel, the storage space that namely each thread accesses is different or the interruption storage unit of the same storage space of discontinuous access.

The memory access behavior that in the memory access behavior feature of GPU thread, Jump appears in the Kernel thread is referred to as GPU thread memory access difference.

SIMD(Single Instruction Multiple Data, single instruction multiple data) accelerate to have become the effective ways of lifting body architecture usefulness, the GPU architecture is exactly a kind of typical SIMD acceleration bodies architecture.The mode of operation that SIMD accelerates is take one section continuous storage space of GPU thread accesses as the basis.If Jump has appearred in the memory access behavior of GPU thread, not only can not Hoisting System usefulness, accelerate GPU and use, also can have a strong impact on GPU system effectiveness and GPU application program operational efficiency.The GPU thread that the memory access difference appears in restructuring optimization is gordian technique and the method that promotes GPU system performance and GPU application program operational efficiency.Therefore, eliminate GPU thread memory access difference by GPU Kernel recombination method, can promote to greatest extent GPU system effectiveness and GPU application program operational efficiency.Also do not have at present open source literature to relate to related art scheme.

Summary of the invention

The technical problem to be solved in the present invention is: use for extensive many GPU Kernel the inefficient problem of carrying out, a kind of GPU Kernel restructuring optimization method based on the memory access difference is proposed, to promote execution efficient and the application program capacity of extensive GPU Kernel.

In order to solve the problems of the technologies described above, concrete technical scheme of the present invention is:

The first step, employing Create method are that each Kernel function of GPU program builds memory access behavior mark sheet.The concrete steps of Create method are: for each Kernel function of GPU program is set up a memory access behavior mark sheet, the memory access behavior mark sheet comprises four fields, is respectively: the storage space logical address Addr of the type of memory MemT of thread number Tid, thread accesses, the size of data Size of thread accesses and access.Thread number Tid represents the unique number of thread in this Kernel domain of function; The type of memory MemT of thread accesses represents the type of memory of thread accesses, and type of memory comprises global storage Global, shared storage Shared Memory, texture storage device Texture Memory and constant storer Constant Memory; The size of data Size of thread accesses represents that the data of this thread accesses occupy the number of storage space byte; The storage space logical address Addr of thread accesses shows that threading operation calculates the address space of the deposit data of needs.

Second step, employing Record method record the memory access track of each thread in each Kernel function.The concrete steps of Record method are:

2.1 scanning GPU kernel program, in definition GPU kernel program, the function numbering Kid of Kernel function is followed successively by 0,1 ... i ... M-1, wherein, 0≤i＜M, M are the number of Kernel function in the GPU kernel program; And the number of threads that defines the Kernel function startup that is numbered i is T _i, the memory access behavior mark sheet adds up to M; Be numbered the number of threads T that the list item number of memory access behavior mark sheet corresponding to the Kernel function of i starts for this Kernel function _iThe access memory access trace information of the Kernel thread in the GPU program is write list item corresponding to memory access behavior mark sheet and field;

2.2 initialization j=0;

2.3 obtain the number of threads T that function is numbered the Kernel function startup of j _j, and initialization k=0;

2.4 the memory access trace information of k thread that function is numbered the Kernel function of j writes respectively field Tid, MemT, Size and the Addr of memory access behavior mark sheet, upgrades k=k+1;

If 2.5 k≤T _j-1, turn 2.4; Otherwise turn 2.6;

2.6 upgrade j=j+1;

If 2.7 j≤M-1 turns 2.3, otherwise turn 2.8;

2.8 in each Kernel function, the memory access trace information of each thread records completely, carries out for the 3rd step.

The 3rd step, the judgement of memory access difference.According to the memory access address of the GPU thread in each Kernel function, judge whether the thread memory access in the GPU thread in same Kernel function the memory access difference occurs.Determination methods is as follows:

3.1 initialization j=0;

3.2 obtain the number of threads T that function is numbered the Kernel function startup of j _j, define the type of memory S set of this Kernel function access _jBe sky, namely

And fixed with difference address set A _jBe sky, namely

Initialization m=0;

3.3 inquiry memory access behavior mark sheet obtains thread T _mThe type of memory of access

Represent m thread T _mThe reference-to storage type.If

With thread T _m+1The type of memory of access

Difference is judged thread T _mThe memory access difference occurs with thread m+1, add the type of memory that difference occurs to S set _jIn, namely

With

Carry out 3.4; Otherwise directly carry out 3.4;

3.4 inquiry memory access behavior mark sheet obtains thread T _mThe storage space logical address of access

Represent m thread T _mAccess storage space logical address.If

Storage space logical address with m+1 thread accesses Difference be not equal to thread T _mThe size of the data of access is judged thread T _mWith thread T _m+1The memory access difference occurs, will the reference-to storage type of difference occur

The two tuple set that consist of with the thread address

With

Incorporate successively set A into _jIn, namely

With

Carry out 3.5; Otherwise, directly carry out 3.5;

3.5 upgrade m=m+1, if m＜T _j-1, turn 3.2; Otherwise expression thread accesses type of memory does not relatively judge not end, carries out 3.6;

3.6 upgrade j=j+1;

If 3.7 j≤M-1 turns 3.2, otherwise showing does not have the Kernel function need to carry out the judgement of memory access difference, carries out 3.8.

3.8 each Kernel function memory access difference judgement is complete, if And

There is not the memory access difference in expression, directly finishes to optimize, and turns for the 5th step, otherwise, carry out GPU Kernel restructuring and optimize, turned for the 4th step.

The 4th goes on foot, optimizes based on the GPU Kernel restructuring of memory access difference.The GPU Kernel that occurs the memory access difference in the GPU kernel program is recombinated to optimize mainly comprise two steps: GPU Kernel division is become a plurality of sub-Kernel that do not have the memory access difference; The sub-Kernel that memory access is continuous is fused into a new GPU Kernel.

4.1 will be based on the GPU Kernel division of memory access difference.The GPU Kernel division that the memory access difference occurs is become a plurality of sub-Kernel that do not have the memory access difference, and method is as follows:

4.1.1 initialization j=0;

4.1.2 obtain S set _jElement number That is the Kernel division that, function is numbered j becomes

Individual sub-Kernel;

4.1.3 query function is numbered memory access behavior mark sheet corresponding to the Kernel of j, obtains the sets of threads with identical reference-to storage type, obtains

Individual sets of threads, sub-Kernel take each sets of threads as the thread block tissue;

4.1.4 upgrade j=j+1;

4.1.5 if j≤M-1 turns 4.1.2, otherwise, carry out 4.1.6;

4.1.6 the GPU Kernel division based on type of memory is complete, turns 4.1.7.

4.1.7 initialization j=0;

If 4.1.8 S _jBe sky, i.e. j

Turn 4.1.9; Otherwise, for each the Kernel function in the GPU program after the Kernel division builds the memory access behavior mark sheet, turn the first step;

4.1.9 if j≤M-1 turns 4.1.8; Otherwise, turn 4.1.10;

4.1.10 initialization j=0;

4.1.11 obtain set A _jElement number

Being about to the Kernel division that function is numbered j becomes

Individual sub-Kernel;

4.1.12 query function is numbered memory access behavior mark sheet corresponding to the Kernel of j, with A _jMiddle element is that the boundary becomes the Kernel division that function is numbered j

Individual sub-Kernel;

4.1.13 upgrade j=j+1;

4.1.14 if j≤M-1 turns 4.1.11, otherwise, turn 4.1.15;

4.1.15 initialization j=0;

If 4.1.16 A _jBe sky, that is,

Turn 4.1.17; Otherwise, turn 4.1.11;

4.1.17 if j≤M-1 turns 4.1.8; Otherwise, turn 4.2.

4.2 the GPU Kernel to continuous memory access merges.May there be the Kernel fragment in GPU kernel program after division forms through Kernel, affects GPU Kernel operational efficiency.The continuous memory access thread that will have an a same memory access new larger GPU Kernel that permeates can significantly promote GPU system effectiveness and application program capacity, at this moment the Kernel that merges comprises the Kernel that forms after division and the Kernel that there is no division, Kernel11 and Kernel12 have been split into as Kernel1, Kernel2 is division not, but the Kernel fusion need to consider Kernel11, Kernel12 and Kernel2 merges the new Kernel of formation.Concrete Kernel fusion method is as follows:

Build memory access behavior mark sheet 4.2.1 employing Create method is each the Kernel function in the 4.1st GPU program that goes on foot after processing, then adopt the Record method to record the memory access track of each thread in each Kernel function;

4.2.2 initialization j=0;

4.2.3 the memory access track characteristic table that inquiry 4.2.1 builds, obtain the thread accesses type of memory that function is numbered the Kernel of j and j+1, if function is numbered the Kernel(of j and is expressed as Kernelj) the thread accesses type of memory be expressed as Kernelj+1 with the Kernel(that is numbered j+1) identical, turn 4.2.4; Otherwise turn 4.2.6;

4.2.4 inquiry memory access track characteristic table obtains the start address of a slice continuous space of Kernelj and Kernelj+1 access and the size of data of termination address and last thread accesses.If the termination address of the thread accesses address continuous space of function Kernelj and the start address of Kernel j+1 differ the size of data that the start address of the termination address of thread accesses address continuous space of the size of data of last thread accesses in Kernelj or Kernelj+1 and Kernelj differs last thread accesses in Kernelj, what judge Kernelj and Kernelj+1 access is the continuous storage space of a slice, turns 4.2.5; Otherwise, turn 4.2.6;

4.2.5 Kernel j and Kernel j+1 are carried out mixing operation, and concrete operation method is:

Adopt in background technology based on the concurrent many Kernel merging methods of Kernel, the thread in Kernelj and Kernelj+1 to be reconsolidated, tissue becomes a larger new GPU Kernel.

4.2.6 upgrade j=j+1;

4.2.7 if j＜M-1 turns 4.2.3; Otherwise, carried out for the 5th step.

The 5th step, end.

Adopt the present invention can reach following technique effect:

1. alleviate GPU program memory access pressure, improved the memory access efficient of GPU system;

2. accelerate the travelling speed of GPU application program and improved the resource utilization of GPU system.

Description of drawings

Fig. 1 is memory access behavior mark sheet structure.

Fig. 2 optimizes overview flow chart based on the GPU Kernel restructuring of memory access difference

Embodiment

Fig. 1 is memory access behavior mark sheet structure, and it is as follows that concrete mark sheet is set up mode:

For each Kernel function of GPU program is set up a memory access behavior mark sheet, the memory access behavior mark sheet comprises four fields, is respectively: the storage space logical address Addr of the type of memory MemT of thread number Tid, thread accesses, the size of data Size of thread accesses and access.Thread number Tid represents the unique number of thread in this Kernel domain of function; The type of memory MemT of thread accesses represents the type of memory of thread accesses, and type of memory comprises global storage Global, shared storage Shared Memory, texture storage device Texture Memory and constant storer Constant Memory; The size of data Size of thread accesses represents that the data of this thread accesses occupy the number of storage space byte; The storage space logical address Addr of thread accesses shows that threading operation calculates the address space of the deposit data of needs.

Fig. 2 is overview flow chart of the present invention, and its concrete implementation step is as follows:

The first step, structure memory access behavior mark sheet.

Second step, record the memory access track of each thread in each Kernel function.

The 3rd step, the judgement of memory access difference.

The 4th goes on foot, optimizes based on the GPU Kernel restructuring of memory access difference.

4.1 the GPU Kernel division based on the memory access difference.

4.2 the GPU Kernel based on continuous memory access merges.

The 5th step, end.

Claims

1. one kind based on the GPU kernel program of memory access difference restructuring optimization method, it is characterized in that comprising the following steps:

The first step, employing Create method build the memory access behavior mark sheet, concrete steps are: for each Kernel function of GPU program is set up a memory access behavior mark sheet, the memory access behavior mark sheet comprises four fields, is respectively: the storage space logical address Addr of the type of memory MemT of thread number Tid, thread accesses, the size of data Size of thread accesses and access; Thread number Tid represents the unique number of thread in this Kernel domain of function; The type of memory MemT of thread accesses represents the type of memory of thread accesses; The size of data Size of thread accesses represents that the data of this thread accesses occupy the number of storage space byte; The storage space logical address Addr of thread accesses shows that threading operation calculates the address space of the deposit data of needs;

Second step, employing Record method record the memory access track of each thread in each Kernel function, and concrete steps are:

2.1 scanning GPU kernel program, in definition GPU kernel program, the function numbering Kid of Kernel function is followed successively by 0,1 ... i ... M-1, wherein, 0≤i＜M, M are the number of Kernel function in the GPU kernel program; And the number of threads that defines the Kernel function startup that is numbered i is T _i, the memory access behavior mark sheet adds up to M; The list item number that is numbered memory access behavior mark sheet corresponding to the Kernel function of i is T _iThe access memory access trace information of the Kernel thread in the GPU program is write list item corresponding to memory access behavior mark sheet and field;

2.2 initialization j=0;

If 2.5 k≤T _i-1, turn 2.4; Otherwise turn 2.6;

2.6 upgrade j=j+1;

If 2.7 j≤M-1 turns 2.3, otherwise turn 2.8;

2.8 in each Kernel function, the memory access trace information of each thread records completely, carries out for the 3rd step;

The 3rd the step, according to the memory access address of the GPU thread in each Kernel function, judge whether the thread memory access in the GPU thread in same Kernel function the memory access difference occurs, method is as follows:

3.1 initialization j=0;

And fixed with difference address set A _jBe sky, namely

Initialization m=0;

Represent m thread T _mThe reference-to storage type, if

With thread T _m+1The type of memory of access

With

Carry out 3.4; Otherwise directly carry out 3.4;

Represent m thread T _mAccess storage space logical address, if Storage space logical address with m+1 thread accesses

Difference be not equal to thread T _mThe size of the data of access is judged thread T _mWith thread T _m+1The memory access difference occurs, will the reference-to storage type of difference occur The two tuple set that consist of with the thread address

With

Incorporate successively set A into _jIn, namely

With

Carry out 3.5; Otherwise, directly carry out 3.5;

3.6 upgrade j=j+1;

If 3.7 j≤M-1 turns 3.2, otherwise showing does not have the Kernel function need to carry out the judgement of memory access difference, carries out 3.8;

3.8 each Kernel function memory access difference judgement is complete, if

And

Turned for the 5th step, otherwise carried out for the 4th step;

The 4th goes on foot, optimizes based on the GPU Kernel restructuring of memory access difference:

Become a plurality of sub-Kernel that do not have the memory access difference 4.1 the GPU Kernel division of memory access difference will occur, method is as follows:

4.1.1 initialization j=0;

4.1.2 obtain S set _jElement number

That is the Kernel division that, function is numbered j becomes

Individual sub-Kernel;

4.1.3 query function is numbered memory access behavior mark sheet corresponding to the Kernel of j, obtains the sets of threads with identical reference-to storage type, obtains Individual sets of threads, sub-Kernel take each sets of threads as the thread block tissue;

4.1.4 upgrade j=j+1;

4.1.5 if j≤M-1 turns 4.1.2, otherwise, carry out 4.1.6;

4.1.6 the GPU Kernel division based on type of memory is complete, turns 4.1.7;

4.1.7 initialization j=0;

If 4.1.8 S _jBe sky, namely

Turn 4.1.9; Otherwise, turn the first step;

4.1.9 if j≤M-1 turns 4.1.8; Otherwise, turn 4.1.10;

4.1.10 initialization j=0;

4.1.11 obtain set A _jElement number

Being about to the Kernel division that function is numbered j becomes

Individual sub-Kernel;

Individual sub-Kernel;

4.1.13 upgrade j=j+1;

4.1.14 if j≤M-1 turns 4.1.11, otherwise, turn 4.1.15;

4.1.15 initialization j=0;

If 4.1.16 A _jBe sky, that is, Turn 4.1.17; Otherwise, turn 4.1.11;

4.1.17 if j≤M-1 turns 4.1.8; Otherwise, turn 4.2.

4.2 the GPUKernel to continuous memory access merges, concrete fusion method is as follows:

Build memory access behavior mark sheet 4.2.1 employing Create method is each the Kernel function in the 4.1st GPU program that goes on foot after processing, then adopt the Record method to record the memory access track of each thread in each Kernel function; 4.2.2 initialization j=0;

4.2.3 the memory access track characteristic table that inquiry 4.2.1 builds, obtain the thread accesses type of memory that function is numbered the Kernel of j and j+1, if it is that the thread accesses type of memory of Kernelj is that Kernelj+1 is identical with the Kernel that is numbered j+1 that function is numbered the Kernel of j, turn 4.2.4; Otherwise turn 4.2.6;

4.2.4 inquiry memory access track characteristic table obtains the start address of a slice continuous space of Kernel j and Kernel j+1 access and the size of data of termination address and last thread accesses; If the termination address of the thread accesses address continuous space of function Kernel j and the start address of Kernel j+1 differ the size of data that the start address of the termination address of thread accesses address continuous space of the size of data of last thread accesses in Kernel j or Kernelj+1 and Kernelj differs last thread accesses in Kernelj, what judge Kernel j and Kernelj+1 access is the continuous storage space of a slice, turns 4.2.5; Otherwise, turn 4.2.6;

4.2.5 Kernel j and Kernel j+1 are carried out mixing operation, and method is:

Employing reconsolidates the thread in Kernelj and Kernelj+1 based on the concurrent many Kernel merging methods of Kernel, and tissue becomes a new GPU Kernel;

4.2.6 upgrade j=j+1;

4.2.7 if j＜M-1 turns 4.2.3; Otherwise, carried out for the 5th step;

The 5th step, end.