CN103150157A - Memory access bifurcation-based GPU (Graphics Processing Unit) kernel program recombination optimization method - Google Patents

Memory access bifurcation-based GPU (Graphics Processing Unit) kernel program recombination optimization method Download PDF

Info

Publication number
CN103150157A
CN103150157A CN2013100004597A CN201310000459A CN103150157A CN 103150157 A CN103150157 A CN 103150157A CN 2013100004597 A CN2013100004597 A CN 2013100004597A CN 201310000459 A CN201310000459 A CN 201310000459A CN 103150157 A CN103150157 A CN 103150157A
Authority
CN
China
Prior art keywords
kernel
thread
memory access
gpu
function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013100004597A
Other languages
Chinese (zh)
Other versions
CN103150157B (en
Inventor
甘新标
刘杰
迟利华
晏益慧
徐涵
胡庆丰
王志英
苏博
朱琪
刘聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201310000459.7A priority Critical patent/CN103150157B/en
Publication of CN103150157A publication Critical patent/CN103150157A/en
Application granted granted Critical
Publication of CN103150157B publication Critical patent/CN103150157B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Stored Programmes (AREA)

Abstract

The invention discloses a memory access bifurcation-based GPU (Graphics Processing Unit) kernel program recombination optimization method, which aims to improve the executing efficiency and the application program performance of large-scale GPU Kernel. The technical scheme is that a memory access behavior feature list is constructed by using a Create method; the memory access track of each thread in each Kernel function is recorded by using a Record method; next, whether memory access bifurcation occurs in the thread memory access in the GPU thread of the same Kernel function is judged according to a memory access address of the GPU thread in each Kernel function; and then Kernel recombination optimization is performed on the memory access bifurcation-based GPU, and the method comprises two steps of memory access bifurcation-based GPU Kernel split and continuous memory access-based GPU Kernel fusion. By using the method, the problem of low executing efficiency of the large-scale GPU Kernel application can be solved, and the executing efficiency and the application program performance of the large-scale GPU Kernel are improved.

Description

GPU kernel program restructuring optimization method based on the memory access difference
Technical field
The present invention relates to GPU kernel program (being GPU Kernel) restructuring optimization method, espespecially based on the GPU Kernel restructuring optimization method of memory access difference.
Background technology
In recent years, GPU (Graphics Processing Unit, Graphics Processing Unit) execution pattern that powerful computing power, magnanimity thread are concurrent and programming model flexibly make GPU be widely used in numerous high-performance computing sectors such as molecular dynamics simulation, biogenesis analysis, weather prognosis.Face-to-face large-scale GPGPU (General Purpose computing on Graphics Processing Units) is used mapping, core program (Kernel) pattern of standard can't satisfy the demand of large-scale application program.
GPU kernel program (GPU Kernel) is exactly the program segment that operates on GPU, usually the programmer can be transplanted to the upper acceleration of GPU with computation-intensive, operation core subroutine consuming time in program, operates in the upper such core subroutine of GPU and is commonly referred to GPU Kernel.
In the face of large-scale GPGPU application program, its GPU Kernel number reaches tens of up to a hundred.In order to improve the dispatching efficiency of so many GPU Kernel, development sequence concurrency to greatest extent, many GPU Kernel restructuring optimisation technique becomes the effective way of raising program operational efficiency.The optimization method of GPU Kernel restructuring at present mainly contains following several:
(1) merge based on the concurrent many Kernel of Kernel.GPU architecture before NVIDIA second generation unified shader is not support the concurrent execution of many Kernel, and GPU Kernel can concurrently carry out in order to allow independently, and it is the effective ways of the concurrent execution of many Kernel that many Kernel merge.Although NVIDIA second generation unified shader GPU can carry out by concurrent many Kernel simultaneously, but the quantity of concurrent Kernel is also extremely limited, therefore, many Kernel merging methods can improve the concurrency between many Kernel, alleviate the pressure that many Kernel sequentially carry out, reduce many Kernel Start-up costs, develop the concurrency between many Kernel, improve the operational efficiency of GPU program.
(2) the many Kernel based on the GPU shared storage merge.if have data dependence relation between GPU Kernel, the i.e. output of a Kernel is just the input of another Kernel, global storage for fear of each Kernel sequential access long delay, a plurality of Kernel that have the inputoutput data dependence can be merged into an independently GPU Kernel, explicit management GPU shared storage is as data transmission intermediary in GPU Kernel, avoid the access of global storage, improve memory access efficient, reduce simultaneously GPU Kernel number, reduce GPU Kernel Start-up costs and be conducive to the Kernel scheduling, improve GPU program operational efficiency.
(3) the many Kernel based on program branches recombinate.Be different from traditional CPU architecture, GPU is used for most resources on sheet to calculate, and the control module in GPU is relative with the branch prediction parts deficient.How to avoid thread branch in GPU Kernel, most important for the execution efficient that improves GPU Kernel, therefore, there is the scholar to propose the distinct program branch in same GPU Kernel is separated and sets up different GPU Sub-Kernel, the GPU Sub-Kernel that then will have identical execution route merges and forms a new GPU GPU Kernel, experiment based on the GPU Kernel of program branches restructuring can effectively avoid GPU thread branch and the thread waits that causes with synchronize, significantly improved the execution efficient of GPU Kernel.
Above-mentioned three class GPU Kernel restructuring optimization methods to a certain extent, accelerate to improve the execution efficient of GPU Kernel towards concrete extensive GPU optimizing application, promote the performance of GPU application program.But above-mentioned three class GPU Kernel restructuring optimization methods have been ignored the impact of GPU threads store access module on the execution efficient of GPU Kernel.In practice, the memory access behavior feature of GPU thread will have a strong impact on the execution efficient of GPU Kernel.Therefore, how from application in practice, solve because the GPU system effectiveness that the memory access behavior feature of GPU thread causes is low and the poor difficult problem of GPU application performance is the important technological problems that those skilled in the art pay close attention to.
The memory access behavior feature of GPU thread can be divided into following two classes:
(1) the continuous storage space of thread accesses a slice in Kernel, this is also a kind of memory access behavior feature of Utopian GPU thread, GPU Kernel carries out most effective under this memory access behavior feature;
(2) Jump appears in the thread memory access behavior in Kernel, the storage space that namely each thread accesses is different or the interruption storage unit of the same storage space of discontinuous access.
The memory access behavior that in the memory access behavior feature of GPU thread, Jump appears in the Kernel thread is referred to as GPU thread memory access difference.
SIMD(Single Instruction Multiple Data, single instruction multiple data) accelerate to have become the effective ways of lifting body architecture usefulness, the GPU architecture is exactly a kind of typical SIMD acceleration bodies architecture.The mode of operation that SIMD accelerates is take one section continuous storage space of GPU thread accesses as the basis.If Jump has appearred in the memory access behavior of GPU thread, not only can not Hoisting System usefulness, accelerate GPU and use, also can have a strong impact on GPU system effectiveness and GPU application program operational efficiency.The GPU thread that the memory access difference appears in restructuring optimization is gordian technique and the method that promotes GPU system performance and GPU application program operational efficiency.Therefore, eliminate GPU thread memory access difference by GPU Kernel recombination method, can promote to greatest extent GPU system effectiveness and GPU application program operational efficiency.Also do not have at present open source literature to relate to related art scheme.
Summary of the invention
The technical problem to be solved in the present invention is: use for extensive many GPU Kernel the inefficient problem of carrying out, a kind of GPU Kernel restructuring optimization method based on the memory access difference is proposed, to promote execution efficient and the application program capacity of extensive GPU Kernel.
In order to solve the problems of the technologies described above, concrete technical scheme of the present invention is:
The first step, employing Create method are that each Kernel function of GPU program builds memory access behavior mark sheet.The concrete steps of Create method are: for each Kernel function of GPU program is set up a memory access behavior mark sheet, the memory access behavior mark sheet comprises four fields, is respectively: the storage space logical address Addr of the type of memory MemT of thread number Tid, thread accesses, the size of data Size of thread accesses and access.Thread number Tid represents the unique number of thread in this Kernel domain of function; The type of memory MemT of thread accesses represents the type of memory of thread accesses, and type of memory comprises global storage Global, shared storage Shared Memory, texture storage device Texture Memory and constant storer Constant Memory; The size of data Size of thread accesses represents that the data of this thread accesses occupy the number of storage space byte; The storage space logical address Addr of thread accesses shows that threading operation calculates the address space of the deposit data of needs.
Second step, employing Record method record the memory access track of each thread in each Kernel function.The concrete steps of Record method are:
2.1 scanning GPU kernel program, in definition GPU kernel program, the function numbering Kid of Kernel function is followed successively by 0,1 ... i ... M-1, wherein, 0≤i<M, M are the number of Kernel function in the GPU kernel program; And the number of threads that defines the Kernel function startup that is numbered i is T i, the memory access behavior mark sheet adds up to M; Be numbered the number of threads T that the list item number of memory access behavior mark sheet corresponding to the Kernel function of i starts for this Kernel function iThe access memory access trace information of the Kernel thread in the GPU program is write list item corresponding to memory access behavior mark sheet and field;
2.2 initialization j=0;
2.3 obtain the number of threads T that function is numbered the Kernel function startup of j j, and initialization k=0;
2.4 the memory access trace information of k thread that function is numbered the Kernel function of j writes respectively field Tid, MemT, Size and the Addr of memory access behavior mark sheet, upgrades k=k+1;
If 2.5 k≤T j-1, turn 2.4; Otherwise turn 2.6;
2.6 upgrade j=j+1;
If 2.7 j≤M-1 turns 2.3, otherwise turn 2.8;
2.8 in each Kernel function, the memory access trace information of each thread records completely, carries out for the 3rd step.
The 3rd step, the judgement of memory access difference.According to the memory access address of the GPU thread in each Kernel function, judge whether the thread memory access in the GPU thread in same Kernel function the memory access difference occurs.Determination methods is as follows:
3.1 initialization j=0;
3.2 obtain the number of threads T that function is numbered the Kernel function startup of j j, define the type of memory S set of this Kernel function access jBe sky, namely
Figure BDA00002696029200041
And fixed with difference address set A jBe sky, namely
Figure BDA00002696029200042
Initialization m=0;
3.3 inquiry memory access behavior mark sheet obtains thread T mThe type of memory of access
Figure BDA00002696029200043
Figure BDA00002696029200044
Represent m thread T mThe reference-to storage type.If
Figure BDA00002696029200045
With thread T m+1The type of memory of access
Figure BDA00002696029200046
Difference is judged thread T mThe memory access difference occurs with thread m+1, add the type of memory that difference occurs to S set jIn, namely
Figure BDA00002696029200047
With
Figure BDA00002696029200048
Carry out 3.4; Otherwise directly carry out 3.4;
3.4 inquiry memory access behavior mark sheet obtains thread T mThe storage space logical address of access
Figure BDA00002696029200049
Figure BDA000026960292000410
Represent m thread T mAccess storage space logical address.If
Figure BDA000026960292000411
Storage space logical address with m+1 thread accesses Difference be not equal to thread T mThe size of the data of access is judged thread T mWith thread T m+1The memory access difference occurs, will the reference-to storage type of difference occur
Figure BDA000026960292000413
The two tuple set that consist of with the thread address
Figure BDA000026960292000414
With
Figure BDA000026960292000415
Incorporate successively set A into jIn, namely
Figure BDA000026960292000416
With
Figure BDA000026960292000417
Carry out 3.5; Otherwise, directly carry out 3.5;
3.5 upgrade m=m+1, if m<T j-1, turn 3.2; Otherwise expression thread accesses type of memory does not relatively judge not end, carries out 3.6;
3.6 upgrade j=j+1;
If 3.7 j≤M-1 turns 3.2, otherwise showing does not have the Kernel function need to carry out the judgement of memory access difference, carries out 3.8.
3.8 each Kernel function memory access difference judgement is complete, if And
Figure BDA000026960292000419
There is not the memory access difference in expression, directly finishes to optimize, and turns for the 5th step, otherwise, carry out GPU Kernel restructuring and optimize, turned for the 4th step.
The 4th goes on foot, optimizes based on the GPU Kernel restructuring of memory access difference.The GPU Kernel that occurs the memory access difference in the GPU kernel program is recombinated to optimize mainly comprise two steps: GPU Kernel division is become a plurality of sub-Kernel that do not have the memory access difference; The sub-Kernel that memory access is continuous is fused into a new GPU Kernel.
4.1 will be based on the GPU Kernel division of memory access difference.The GPU Kernel division that the memory access difference occurs is become a plurality of sub-Kernel that do not have the memory access difference, and method is as follows:
4.1.1 initialization j=0;
4.1.2 obtain S set jElement number That is the Kernel division that, function is numbered j becomes
Figure BDA00002696029200052
Individual sub-Kernel;
4.1.3 query function is numbered memory access behavior mark sheet corresponding to the Kernel of j, obtains the sets of threads with identical reference-to storage type, obtains
Figure BDA00002696029200053
Individual sets of threads, sub-Kernel take each sets of threads as the thread block tissue;
4.1.4 upgrade j=j+1;
4.1.5 if j≤M-1 turns 4.1.2, otherwise, carry out 4.1.6;
4.1.6 the GPU Kernel division based on type of memory is complete, turns 4.1.7.
4.1.7 initialization j=0;
If 4.1.8 S jBe sky, i.e. j
Figure BDA00002696029200054
Turn 4.1.9; Otherwise, for each the Kernel function in the GPU program after the Kernel division builds the memory access behavior mark sheet, turn the first step;
4.1.9 if j≤M-1 turns 4.1.8; Otherwise, turn 4.1.10;
4.1.10 initialization j=0;
4.1.11 obtain set A jElement number
Figure BDA00002696029200055
Being about to the Kernel division that function is numbered j becomes
Figure BDA00002696029200056
Individual sub-Kernel;
4.1.12 query function is numbered memory access behavior mark sheet corresponding to the Kernel of j, with A jMiddle element is that the boundary becomes the Kernel division that function is numbered j
Figure BDA00002696029200057
Individual sub-Kernel;
4.1.13 upgrade j=j+1;
4.1.14 if j≤M-1 turns 4.1.11, otherwise, turn 4.1.15;
4.1.15 initialization j=0;
If 4.1.16 A jBe sky, that is,
Figure BDA00002696029200058
Turn 4.1.17; Otherwise, turn 4.1.11;
4.1.17 if j≤M-1 turns 4.1.8; Otherwise, turn 4.2.
4.2 the GPU Kernel to continuous memory access merges.May there be the Kernel fragment in GPU kernel program after division forms through Kernel, affects GPU Kernel operational efficiency.The continuous memory access thread that will have an a same memory access new larger GPU Kernel that permeates can significantly promote GPU system effectiveness and application program capacity, at this moment the Kernel that merges comprises the Kernel that forms after division and the Kernel that there is no division, Kernel11 and Kernel12 have been split into as Kernel1, Kernel2 is division not, but the Kernel fusion need to consider Kernel11, Kernel12 and Kernel2 merges the new Kernel of formation.Concrete Kernel fusion method is as follows:
Build memory access behavior mark sheet 4.2.1 employing Create method is each the Kernel function in the 4.1st GPU program that goes on foot after processing, then adopt the Record method to record the memory access track of each thread in each Kernel function;
4.2.2 initialization j=0;
4.2.3 the memory access track characteristic table that inquiry 4.2.1 builds, obtain the thread accesses type of memory that function is numbered the Kernel of j and j+1, if function is numbered the Kernel(of j and is expressed as Kernelj) the thread accesses type of memory be expressed as Kernelj+1 with the Kernel(that is numbered j+1) identical, turn 4.2.4; Otherwise turn 4.2.6;
4.2.4 inquiry memory access track characteristic table obtains the start address of a slice continuous space of Kernelj and Kernelj+1 access and the size of data of termination address and last thread accesses.If the termination address of the thread accesses address continuous space of function Kernelj and the start address of Kernel j+1 differ the size of data that the start address of the termination address of thread accesses address continuous space of the size of data of last thread accesses in Kernelj or Kernelj+1 and Kernelj differs last thread accesses in Kernelj, what judge Kernelj and Kernelj+1 access is the continuous storage space of a slice, turns 4.2.5; Otherwise, turn 4.2.6;
4.2.5 Kernel j and Kernel j+1 are carried out mixing operation, and concrete operation method is:
Adopt in background technology based on the concurrent many Kernel merging methods of Kernel, the thread in Kernelj and Kernelj+1 to be reconsolidated, tissue becomes a larger new GPU Kernel.
4.2.6 upgrade j=j+1;
4.2.7 if j<M-1 turns 4.2.3; Otherwise, carried out for the 5th step.
The 5th step, end.
Adopt the present invention can reach following technique effect:
1. alleviate GPU program memory access pressure, improved the memory access efficient of GPU system;
2. accelerate the travelling speed of GPU application program and improved the resource utilization of GPU system.
Description of drawings
Fig. 1 is memory access behavior mark sheet structure.
Fig. 2 optimizes overview flow chart based on the GPU Kernel restructuring of memory access difference
Embodiment
Fig. 1 is memory access behavior mark sheet structure, and it is as follows that concrete mark sheet is set up mode:
For each Kernel function of GPU program is set up a memory access behavior mark sheet, the memory access behavior mark sheet comprises four fields, is respectively: the storage space logical address Addr of the type of memory MemT of thread number Tid, thread accesses, the size of data Size of thread accesses and access.Thread number Tid represents the unique number of thread in this Kernel domain of function; The type of memory MemT of thread accesses represents the type of memory of thread accesses, and type of memory comprises global storage Global, shared storage Shared Memory, texture storage device Texture Memory and constant storer Constant Memory; The size of data Size of thread accesses represents that the data of this thread accesses occupy the number of storage space byte; The storage space logical address Addr of thread accesses shows that threading operation calculates the address space of the deposit data of needs.
Fig. 2 is overview flow chart of the present invention, and its concrete implementation step is as follows:
The first step, structure memory access behavior mark sheet.
Second step, record the memory access track of each thread in each Kernel function.
The 3rd step, the judgement of memory access difference.
The 4th goes on foot, optimizes based on the GPU Kernel restructuring of memory access difference.
4.1 the GPU Kernel division based on the memory access difference.
4.2 the GPU Kernel based on continuous memory access merges.
The 5th step, end.

Claims (1)

1. one kind based on the GPU kernel program of memory access difference restructuring optimization method, it is characterized in that comprising the following steps:
The first step, employing Create method build the memory access behavior mark sheet, concrete steps are: for each Kernel function of GPU program is set up a memory access behavior mark sheet, the memory access behavior mark sheet comprises four fields, is respectively: the storage space logical address Addr of the type of memory MemT of thread number Tid, thread accesses, the size of data Size of thread accesses and access; Thread number Tid represents the unique number of thread in this Kernel domain of function; The type of memory MemT of thread accesses represents the type of memory of thread accesses; The size of data Size of thread accesses represents that the data of this thread accesses occupy the number of storage space byte; The storage space logical address Addr of thread accesses shows that threading operation calculates the address space of the deposit data of needs;
Second step, employing Record method record the memory access track of each thread in each Kernel function, and concrete steps are:
2.1 scanning GPU kernel program, in definition GPU kernel program, the function numbering Kid of Kernel function is followed successively by 0,1 ... i ... M-1, wherein, 0≤i<M, M are the number of Kernel function in the GPU kernel program; And the number of threads that defines the Kernel function startup that is numbered i is T i, the memory access behavior mark sheet adds up to M; The list item number that is numbered memory access behavior mark sheet corresponding to the Kernel function of i is T iThe access memory access trace information of the Kernel thread in the GPU program is write list item corresponding to memory access behavior mark sheet and field;
2.2 initialization j=0;
2.3 obtain the number of threads T that function is numbered the Kernel function startup of j j, and initialization k=0;
2.4 the memory access trace information of k thread that function is numbered the Kernel function of j writes respectively field Tid, MemT, Size and the Addr of memory access behavior mark sheet, upgrades k=k+1;
If 2.5 k≤T i-1, turn 2.4; Otherwise turn 2.6;
2.6 upgrade j=j+1;
If 2.7 j≤M-1 turns 2.3, otherwise turn 2.8;
2.8 in each Kernel function, the memory access trace information of each thread records completely, carries out for the 3rd step;
The 3rd the step, according to the memory access address of the GPU thread in each Kernel function, judge whether the thread memory access in the GPU thread in same Kernel function the memory access difference occurs, method is as follows:
3.1 initialization j=0;
3.2 obtain the number of threads T that function is numbered the Kernel function startup of j j, define the type of memory S set of this Kernel function access jBe sky, namely
Figure FDA00002696029100021
And fixed with difference address set A jBe sky, namely
Figure FDA00002696029100022
Initialization m=0;
3.3 inquiry memory access behavior mark sheet obtains thread T mThe type of memory of access
Figure FDA00002696029100023
Figure FDA00002696029100024
Represent m thread T mThe reference-to storage type, if
Figure FDA00002696029100025
With thread T m+1The type of memory of access
Figure FDA00002696029100026
Difference is judged thread T mThe memory access difference occurs with thread m+1, add the type of memory that difference occurs to S set jIn, namely
Figure FDA00002696029100027
With
Figure FDA00002696029100028
Carry out 3.4; Otherwise directly carry out 3.4;
3.4 inquiry memory access behavior mark sheet obtains thread T mThe storage space logical address of access
Figure FDA000026960291000210
Represent m thread T mAccess storage space logical address, if Storage space logical address with m+1 thread accesses
Figure FDA000026960291000212
Difference be not equal to thread T mThe size of the data of access is judged thread T mWith thread T m+1The memory access difference occurs, will the reference-to storage type of difference occur The two tuple set that consist of with the thread address
Figure FDA000026960291000214
With
Figure FDA000026960291000215
Incorporate successively set A into jIn, namely
Figure FDA000026960291000216
With
Figure FDA000026960291000217
Carry out 3.5; Otherwise, directly carry out 3.5;
3.5 upgrade m=m+1, if m<T j-1, turn 3.2; Otherwise expression thread accesses type of memory does not relatively judge not end, carries out 3.6;
3.6 upgrade j=j+1;
If 3.7 j≤M-1 turns 3.2, otherwise showing does not have the Kernel function need to carry out the judgement of memory access difference, carries out 3.8;
3.8 each Kernel function memory access difference judgement is complete, if
Figure FDA000026960291000218
And
Figure FDA000026960291000219
Turned for the 5th step, otherwise carried out for the 4th step;
The 4th goes on foot, optimizes based on the GPU Kernel restructuring of memory access difference:
Become a plurality of sub-Kernel that do not have the memory access difference 4.1 the GPU Kernel division of memory access difference will occur, method is as follows:
4.1.1 initialization j=0;
4.1.2 obtain S set jElement number
Figure FDA000026960291000220
That is the Kernel division that, function is numbered j becomes
Figure FDA000026960291000221
Individual sub-Kernel;
4.1.3 query function is numbered memory access behavior mark sheet corresponding to the Kernel of j, obtains the sets of threads with identical reference-to storage type, obtains Individual sets of threads, sub-Kernel take each sets of threads as the thread block tissue;
4.1.4 upgrade j=j+1;
4.1.5 if j≤M-1 turns 4.1.2, otherwise, carry out 4.1.6;
4.1.6 the GPU Kernel division based on type of memory is complete, turns 4.1.7;
4.1.7 initialization j=0;
If 4.1.8 S jBe sky, namely
Figure FDA00002696029100032
Turn 4.1.9; Otherwise, turn the first step;
4.1.9 if j≤M-1 turns 4.1.8; Otherwise, turn 4.1.10;
4.1.10 initialization j=0;
4.1.11 obtain set A jElement number
Figure FDA00002696029100033
Being about to the Kernel division that function is numbered j becomes
Figure FDA00002696029100034
Individual sub-Kernel;
4.1.12 query function is numbered memory access behavior mark sheet corresponding to the Kernel of j, with A jMiddle element is that the boundary becomes the Kernel division that function is numbered j
Figure FDA00002696029100035
Individual sub-Kernel;
4.1.13 upgrade j=j+1;
4.1.14 if j≤M-1 turns 4.1.11, otherwise, turn 4.1.15;
4.1.15 initialization j=0;
If 4.1.16 A jBe sky, that is, Turn 4.1.17; Otherwise, turn 4.1.11;
4.1.17 if j≤M-1 turns 4.1.8; Otherwise, turn 4.2.
4.2 the GPUKernel to continuous memory access merges, concrete fusion method is as follows:
Build memory access behavior mark sheet 4.2.1 employing Create method is each the Kernel function in the 4.1st GPU program that goes on foot after processing, then adopt the Record method to record the memory access track of each thread in each Kernel function; 4.2.2 initialization j=0;
4.2.3 the memory access track characteristic table that inquiry 4.2.1 builds, obtain the thread accesses type of memory that function is numbered the Kernel of j and j+1, if it is that the thread accesses type of memory of Kernelj is that Kernelj+1 is identical with the Kernel that is numbered j+1 that function is numbered the Kernel of j, turn 4.2.4; Otherwise turn 4.2.6;
4.2.4 inquiry memory access track characteristic table obtains the start address of a slice continuous space of Kernel j and Kernel j+1 access and the size of data of termination address and last thread accesses; If the termination address of the thread accesses address continuous space of function Kernel j and the start address of Kernel j+1 differ the size of data that the start address of the termination address of thread accesses address continuous space of the size of data of last thread accesses in Kernel j or Kernelj+1 and Kernelj differs last thread accesses in Kernelj, what judge Kernel j and Kernelj+1 access is the continuous storage space of a slice, turns 4.2.5; Otherwise, turn 4.2.6;
4.2.5 Kernel j and Kernel j+1 are carried out mixing operation, and method is:
Employing reconsolidates the thread in Kernelj and Kernelj+1 based on the concurrent many Kernel merging methods of Kernel, and tissue becomes a new GPU Kernel;
4.2.6 upgrade j=j+1;
4.2.7 if j<M-1 turns 4.2.3; Otherwise, carried out for the 5th step;
The 5th step, end.
CN201310000459.7A 2013-01-03 2013-01-03 Based on the GPU kernel program restructuring optimization method of memory access difference Expired - Fee Related CN103150157B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310000459.7A CN103150157B (en) 2013-01-03 2013-01-03 Based on the GPU kernel program restructuring optimization method of memory access difference

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310000459.7A CN103150157B (en) 2013-01-03 2013-01-03 Based on the GPU kernel program restructuring optimization method of memory access difference

Publications (2)

Publication Number Publication Date
CN103150157A true CN103150157A (en) 2013-06-12
CN103150157B CN103150157B (en) 2015-11-25

Family

ID=48548259

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310000459.7A Expired - Fee Related CN103150157B (en) 2013-01-03 2013-01-03 Based on the GPU kernel program restructuring optimization method of memory access difference

Country Status (1)

Country Link
CN (1) CN103150157B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104199782A (en) * 2014-08-25 2014-12-10 浙江大学城市学院 GPU memory access method
CN107291537A (en) * 2017-06-07 2017-10-24 江苏海平面数据科技有限公司 The optimization method that memory space is used on a kind of GPU pieces
CN109725903A (en) * 2017-10-30 2019-05-07 华为技术有限公司 Program code transform process method, device and compiling system
CN109783222A (en) * 2017-11-15 2019-05-21 杭州华为数字技术有限公司 A kind of method and apparatus for eliminating branch's disagreement

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
甘新标 等: "《面向众核GPU结构的椭圆曲线加密流化技术》", 《四川大学学报(工程科学版)》 *
马安国: "《高效能GPGPU体系结构关键技术研究》", 《中国博士学位论文全文数据库》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104199782A (en) * 2014-08-25 2014-12-10 浙江大学城市学院 GPU memory access method
CN104199782B (en) * 2014-08-25 2017-04-26 浙江大学城市学院 GPU memory access method
CN107291537A (en) * 2017-06-07 2017-10-24 江苏海平面数据科技有限公司 The optimization method that memory space is used on a kind of GPU pieces
CN109725903A (en) * 2017-10-30 2019-05-07 华为技术有限公司 Program code transform process method, device and compiling system
CN109783222A (en) * 2017-11-15 2019-05-21 杭州华为数字技术有限公司 A kind of method and apparatus for eliminating branch's disagreement

Also Published As

Publication number Publication date
CN103150157B (en) 2015-11-25

Similar Documents

Publication Publication Date Title
US11436400B2 (en) Optimization method for graph processing based on heterogeneous FPGA data streams
CN106991011B (en) CPU multithreading and GPU (graphics processing unit) multi-granularity parallel and cooperative optimization based method
CN1983196B (en) System and method for grouping execution threads
CN109002659B (en) Fluid machinery simulation program optimization method based on super computer
CN103150265B (en) The fine-grained data distribution method of isomery storer on Embedded sheet
CN101814039B (en) GPU-based Cache simulator and spatial parallel acceleration simulation method thereof
CN105487838A (en) Task-level parallel scheduling method and system for dynamically reconfigurable processor
CN106055311B (en) MapReduce tasks in parallel methods based on assembly line multithreading
JP2017091589A (en) Processor core and processor system
CN101799750B (en) Data processing method and device
Harris Optimizing cuda
US20140143570A1 (en) Thread consolidation in processor cores
CN102193830B (en) Many-core environment-oriented division mapping/reduction parallel programming model
DE102012221502A1 (en) A system and method for performing crafted memory access operations
CN102193811B (en) Compiling device for eliminating memory access conflict and realizing method thereof
CN103150157B (en) Based on the GPU kernel program restructuring optimization method of memory access difference
CN110750265B (en) High-level synthesis method and system for graph calculation
US20210334234A1 (en) Distributed graphics processor unit architecture
CN103699656A (en) GPU-based mass-multimedia-data-oriented MapReduce platform
CN103678571A (en) Multithreaded web crawler execution method applied to single host with multi-core processor
CN114970294B (en) Three-dimensional strain simulation PCG parallel optimization method and system based on Shenwei architecture
CN104317770A (en) Data storage structure and data access method for multiple core processing system
Bhatotia Incremental parallel and distributed systems
CN100489830C (en) 64 bit stream processor chip system structure oriented to scientific computing
CN109491934B (en) Storage management system control method integrating computing function

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20151125

Termination date: 20220103

CF01 Termination of patent right due to non-payment of annual fee