CN103150157A - Memory access bifurcation-based GPU (Graphics Processing Unit) kernel program recombination optimization method - Google Patents
Memory access bifurcation-based GPU (Graphics Processing Unit) kernel program recombination optimization method Download PDFInfo
- Publication number
- CN103150157A CN103150157A CN2013100004597A CN201310000459A CN103150157A CN 103150157 A CN103150157 A CN 103150157A CN 2013100004597 A CN2013100004597 A CN 2013100004597A CN 201310000459 A CN201310000459 A CN 201310000459A CN 103150157 A CN103150157 A CN 103150157A
- Authority
- CN
- China
- Prior art keywords
- kernel
- thread
- memory access
- gpu
- function
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Stored Programmes (AREA)
Abstract
The invention discloses a memory access bifurcation-based GPU (Graphics Processing Unit) kernel program recombination optimization method, which aims to improve the executing efficiency and the application program performance of large-scale GPU Kernel. The technical scheme is that a memory access behavior feature list is constructed by using a Create method; the memory access track of each thread in each Kernel function is recorded by using a Record method; next, whether memory access bifurcation occurs in the thread memory access in the GPU thread of the same Kernel function is judged according to a memory access address of the GPU thread in each Kernel function; and then Kernel recombination optimization is performed on the memory access bifurcation-based GPU, and the method comprises two steps of memory access bifurcation-based GPU Kernel split and continuous memory access-based GPU Kernel fusion. By using the method, the problem of low executing efficiency of the large-scale GPU Kernel application can be solved, and the executing efficiency and the application program performance of the large-scale GPU Kernel are improved.
Description
Technical field
The present invention relates to GPU kernel program (being GPU Kernel) restructuring optimization method, espespecially based on the GPU Kernel restructuring optimization method of memory access difference.
Background technology
In recent years, GPU (Graphics Processing Unit, Graphics Processing Unit) execution pattern that powerful computing power, magnanimity thread are concurrent and programming model flexibly make GPU be widely used in numerous high-performance computing sectors such as molecular dynamics simulation, biogenesis analysis, weather prognosis.Face-to-face large-scale GPGPU (General Purpose computing on Graphics Processing Units) is used mapping, core program (Kernel) pattern of standard can't satisfy the demand of large-scale application program.
GPU kernel program (GPU Kernel) is exactly the program segment that operates on GPU, usually the programmer can be transplanted to the upper acceleration of GPU with computation-intensive, operation core subroutine consuming time in program, operates in the upper such core subroutine of GPU and is commonly referred to GPU Kernel.
In the face of large-scale GPGPU application program, its GPU Kernel number reaches tens of up to a hundred.In order to improve the dispatching efficiency of so many GPU Kernel, development sequence concurrency to greatest extent, many GPU Kernel restructuring optimisation technique becomes the effective way of raising program operational efficiency.The optimization method of GPU Kernel restructuring at present mainly contains following several:
(1) merge based on the concurrent many Kernel of Kernel.GPU architecture before NVIDIA second generation unified shader is not support the concurrent execution of many Kernel, and GPU Kernel can concurrently carry out in order to allow independently, and it is the effective ways of the concurrent execution of many Kernel that many Kernel merge.Although NVIDIA second generation unified shader GPU can carry out by concurrent many Kernel simultaneously, but the quantity of concurrent Kernel is also extremely limited, therefore, many Kernel merging methods can improve the concurrency between many Kernel, alleviate the pressure that many Kernel sequentially carry out, reduce many Kernel Start-up costs, develop the concurrency between many Kernel, improve the operational efficiency of GPU program.
(2) the many Kernel based on the GPU shared storage merge.if have data dependence relation between GPU Kernel, the i.e. output of a Kernel is just the input of another Kernel, global storage for fear of each Kernel sequential access long delay, a plurality of Kernel that have the inputoutput data dependence can be merged into an independently GPU Kernel, explicit management GPU shared storage is as data transmission intermediary in GPU Kernel, avoid the access of global storage, improve memory access efficient, reduce simultaneously GPU Kernel number, reduce GPU Kernel Start-up costs and be conducive to the Kernel scheduling, improve GPU program operational efficiency.
(3) the many Kernel based on program branches recombinate.Be different from traditional CPU architecture, GPU is used for most resources on sheet to calculate, and the control module in GPU is relative with the branch prediction parts deficient.How to avoid thread branch in GPU Kernel, most important for the execution efficient that improves GPU Kernel, therefore, there is the scholar to propose the distinct program branch in same GPU Kernel is separated and sets up different GPU Sub-Kernel, the GPU Sub-Kernel that then will have identical execution route merges and forms a new GPU GPU Kernel, experiment based on the GPU Kernel of program branches restructuring can effectively avoid GPU thread branch and the thread waits that causes with synchronize, significantly improved the execution efficient of GPU Kernel.
Above-mentioned three class GPU Kernel restructuring optimization methods to a certain extent, accelerate to improve the execution efficient of GPU Kernel towards concrete extensive GPU optimizing application, promote the performance of GPU application program.But above-mentioned three class GPU Kernel restructuring optimization methods have been ignored the impact of GPU threads store access module on the execution efficient of GPU Kernel.In practice, the memory access behavior feature of GPU thread will have a strong impact on the execution efficient of GPU Kernel.Therefore, how from application in practice, solve because the GPU system effectiveness that the memory access behavior feature of GPU thread causes is low and the poor difficult problem of GPU application performance is the important technological problems that those skilled in the art pay close attention to.
The memory access behavior feature of GPU thread can be divided into following two classes:
(1) the continuous storage space of thread accesses a slice in Kernel, this is also a kind of memory access behavior feature of Utopian GPU thread, GPU Kernel carries out most effective under this memory access behavior feature;
(2) Jump appears in the thread memory access behavior in Kernel, the storage space that namely each thread accesses is different or the interruption storage unit of the same storage space of discontinuous access.
The memory access behavior that in the memory access behavior feature of GPU thread, Jump appears in the Kernel thread is referred to as GPU thread memory access difference.
SIMD(Single Instruction Multiple Data, single instruction multiple data) accelerate to have become the effective ways of lifting body architecture usefulness, the GPU architecture is exactly a kind of typical SIMD acceleration bodies architecture.The mode of operation that SIMD accelerates is take one section continuous storage space of GPU thread accesses as the basis.If Jump has appearred in the memory access behavior of GPU thread, not only can not Hoisting System usefulness, accelerate GPU and use, also can have a strong impact on GPU system effectiveness and GPU application program operational efficiency.The GPU thread that the memory access difference appears in restructuring optimization is gordian technique and the method that promotes GPU system performance and GPU application program operational efficiency.Therefore, eliminate GPU thread memory access difference by GPU Kernel recombination method, can promote to greatest extent GPU system effectiveness and GPU application program operational efficiency.Also do not have at present open source literature to relate to related art scheme.
Summary of the invention
The technical problem to be solved in the present invention is: use for extensive many GPU Kernel the inefficient problem of carrying out, a kind of GPU Kernel restructuring optimization method based on the memory access difference is proposed, to promote execution efficient and the application program capacity of extensive GPU Kernel.
In order to solve the problems of the technologies described above, concrete technical scheme of the present invention is:
The first step, employing Create method are that each Kernel function of GPU program builds memory access behavior mark sheet.The concrete steps of Create method are: for each Kernel function of GPU program is set up a memory access behavior mark sheet, the memory access behavior mark sheet comprises four fields, is respectively: the storage space logical address Addr of the type of memory MemT of thread number Tid, thread accesses, the size of data Size of thread accesses and access.Thread number Tid represents the unique number of thread in this Kernel domain of function; The type of memory MemT of thread accesses represents the type of memory of thread accesses, and type of memory comprises global storage Global, shared storage Shared Memory, texture storage device Texture Memory and constant storer Constant Memory; The size of data Size of thread accesses represents that the data of this thread accesses occupy the number of storage space byte; The storage space logical address Addr of thread accesses shows that threading operation calculates the address space of the deposit data of needs.
Second step, employing Record method record the memory access track of each thread in each Kernel function.The concrete steps of Record method are:
2.1 scanning GPU kernel program, in definition GPU kernel program, the function numbering Kid of Kernel function is followed successively by 0,1 ... i ... M-1, wherein, 0≤i<M, M are the number of Kernel function in the GPU kernel program; And the number of threads that defines the Kernel function startup that is numbered i is T
i, the memory access behavior mark sheet adds up to M; Be numbered the number of threads T that the list item number of memory access behavior mark sheet corresponding to the Kernel function of i starts for this Kernel function
iThe access memory access trace information of the Kernel thread in the GPU program is write list item corresponding to memory access behavior mark sheet and field;
2.2 initialization j=0;
2.3 obtain the number of threads T that function is numbered the Kernel function startup of j
j, and initialization k=0;
2.4 the memory access trace information of k thread that function is numbered the Kernel function of j writes respectively field Tid, MemT, Size and the Addr of memory access behavior mark sheet, upgrades k=k+1;
If 2.5 k≤T
j-1, turn 2.4; Otherwise turn 2.6;
2.6 upgrade j=j+1;
If 2.7 j≤M-1 turns 2.3, otherwise turn 2.8;
2.8 in each Kernel function, the memory access trace information of each thread records completely, carries out for the 3rd step.
The 3rd step, the judgement of memory access difference.According to the memory access address of the GPU thread in each Kernel function, judge whether the thread memory access in the GPU thread in same Kernel function the memory access difference occurs.Determination methods is as follows:
3.1 initialization j=0;
3.2 obtain the number of threads T that function is numbered the Kernel function startup of j
j, define the type of memory S set of this Kernel function access
jBe sky, namely
And fixed with difference address set A
jBe sky, namely
Initialization m=0;
3.3 inquiry memory access behavior mark sheet obtains thread T
mThe type of memory of access
Represent m thread T
mThe reference-to storage type.If
With thread T
m+1The type of memory of access
Difference is judged thread T
mThe memory access difference occurs with thread m+1, add the type of memory that difference occurs to S set
jIn, namely
With
Carry out 3.4; Otherwise directly carry out 3.4;
3.4 inquiry memory access behavior mark sheet obtains thread T
mThe storage space logical address of access
Represent m thread T
mAccess storage space logical address.If
Storage space logical address with m+1 thread accesses
Difference be not equal to thread T
mThe size of the data of access is judged thread T
mWith thread T
m+1The memory access difference occurs, will the reference-to storage type of difference occur
The two tuple set that consist of with the thread address
With
Incorporate successively set A into
jIn, namely
With
Carry out 3.5; Otherwise, directly carry out 3.5;
3.5 upgrade m=m+1, if m<T
j-1, turn 3.2; Otherwise expression thread accesses type of memory does not relatively judge not end, carries out 3.6;
3.6 upgrade j=j+1;
If 3.7 j≤M-1 turns 3.2, otherwise showing does not have the Kernel function need to carry out the judgement of memory access difference, carries out 3.8.
3.8 each Kernel function memory access difference judgement is complete, if
And
There is not the memory access difference in expression, directly finishes to optimize, and turns for the 5th step, otherwise, carry out GPU Kernel restructuring and optimize, turned for the 4th step.
The 4th goes on foot, optimizes based on the GPU Kernel restructuring of memory access difference.The GPU Kernel that occurs the memory access difference in the GPU kernel program is recombinated to optimize mainly comprise two steps: GPU Kernel division is become a plurality of sub-Kernel that do not have the memory access difference; The sub-Kernel that memory access is continuous is fused into a new GPU Kernel.
4.1 will be based on the GPU Kernel division of memory access difference.The GPU Kernel division that the memory access difference occurs is become a plurality of sub-Kernel that do not have the memory access difference, and method is as follows:
4.1.1 initialization j=0;
4.1.2 obtain S set
jElement number
That is the Kernel division that, function is numbered j becomes
Individual sub-Kernel;
4.1.3 query function is numbered memory access behavior mark sheet corresponding to the Kernel of j, obtains the sets of threads with identical reference-to storage type, obtains
Individual sets of threads, sub-Kernel take each sets of threads as the thread block tissue;
4.1.4 upgrade j=j+1;
4.1.5 if j≤M-1 turns 4.1.2, otherwise, carry out 4.1.6;
4.1.6 the GPU Kernel division based on type of memory is complete, turns 4.1.7.
4.1.7 initialization j=0;
If 4.1.8 S
jBe sky, i.e. j
Turn 4.1.9; Otherwise, for each the Kernel function in the GPU program after the Kernel division builds the memory access behavior mark sheet, turn the first step;
4.1.9 if j≤M-1 turns 4.1.8; Otherwise, turn 4.1.10;
4.1.10 initialization j=0;
4.1.11 obtain set A
jElement number
Being about to the Kernel division that function is numbered j becomes
Individual sub-Kernel;
4.1.12 query function is numbered memory access behavior mark sheet corresponding to the Kernel of j, with A
jMiddle element is that the boundary becomes the Kernel division that function is numbered j
Individual sub-Kernel;
4.1.13 upgrade j=j+1;
4.1.14 if j≤M-1 turns 4.1.11, otherwise, turn 4.1.15;
4.1.15 initialization j=0;
4.1.17 if j≤M-1 turns 4.1.8; Otherwise, turn 4.2.
4.2 the GPU Kernel to continuous memory access merges.May there be the Kernel fragment in GPU kernel program after division forms through Kernel, affects GPU Kernel operational efficiency.The continuous memory access thread that will have an a same memory access new larger GPU Kernel that permeates can significantly promote GPU system effectiveness and application program capacity, at this moment the Kernel that merges comprises the Kernel that forms after division and the Kernel that there is no division, Kernel11 and Kernel12 have been split into as Kernel1, Kernel2 is division not, but the Kernel fusion need to consider Kernel11, Kernel12 and Kernel2 merges the new Kernel of formation.Concrete Kernel fusion method is as follows:
Build memory access behavior mark sheet 4.2.1 employing Create method is each the Kernel function in the 4.1st GPU program that goes on foot after processing, then adopt the Record method to record the memory access track of each thread in each Kernel function;
4.2.2 initialization j=0;
4.2.3 the memory access track characteristic table that inquiry 4.2.1 builds, obtain the thread accesses type of memory that function is numbered the Kernel of j and j+1, if function is numbered the Kernel(of j and is expressed as Kernelj) the thread accesses type of memory be expressed as Kernelj+1 with the Kernel(that is numbered j+1) identical, turn 4.2.4; Otherwise turn 4.2.6;
4.2.4 inquiry memory access track characteristic table obtains the start address of a slice continuous space of Kernelj and Kernelj+1 access and the size of data of termination address and last thread accesses.If the termination address of the thread accesses address continuous space of function Kernelj and the start address of Kernel j+1 differ the size of data that the start address of the termination address of thread accesses address continuous space of the size of data of last thread accesses in Kernelj or Kernelj+1 and Kernelj differs last thread accesses in Kernelj, what judge Kernelj and Kernelj+1 access is the continuous storage space of a slice, turns 4.2.5; Otherwise, turn 4.2.6;
4.2.5 Kernel j and Kernel j+1 are carried out mixing operation, and concrete operation method is:
Adopt in background technology based on the concurrent many Kernel merging methods of Kernel, the thread in Kernelj and Kernelj+1 to be reconsolidated, tissue becomes a larger new GPU Kernel.
4.2.6 upgrade j=j+1;
4.2.7 if j<M-1 turns 4.2.3; Otherwise, carried out for the 5th step.
The 5th step, end.
Adopt the present invention can reach following technique effect:
1. alleviate GPU program memory access pressure, improved the memory access efficient of GPU system;
2. accelerate the travelling speed of GPU application program and improved the resource utilization of GPU system.
Description of drawings
Fig. 1 is memory access behavior mark sheet structure.
Fig. 2 optimizes overview flow chart based on the GPU Kernel restructuring of memory access difference
Embodiment
Fig. 1 is memory access behavior mark sheet structure, and it is as follows that concrete mark sheet is set up mode:
For each Kernel function of GPU program is set up a memory access behavior mark sheet, the memory access behavior mark sheet comprises four fields, is respectively: the storage space logical address Addr of the type of memory MemT of thread number Tid, thread accesses, the size of data Size of thread accesses and access.Thread number Tid represents the unique number of thread in this Kernel domain of function; The type of memory MemT of thread accesses represents the type of memory of thread accesses, and type of memory comprises global storage Global, shared storage Shared Memory, texture storage device Texture Memory and constant storer Constant Memory; The size of data Size of thread accesses represents that the data of this thread accesses occupy the number of storage space byte; The storage space logical address Addr of thread accesses shows that threading operation calculates the address space of the deposit data of needs.
Fig. 2 is overview flow chart of the present invention, and its concrete implementation step is as follows:
The first step, structure memory access behavior mark sheet.
Second step, record the memory access track of each thread in each Kernel function.
The 3rd step, the judgement of memory access difference.
The 4th goes on foot, optimizes based on the GPU Kernel restructuring of memory access difference.
4.1 the GPU Kernel division based on the memory access difference.
4.2 the GPU Kernel based on continuous memory access merges.
The 5th step, end.
Claims (1)
1. one kind based on the GPU kernel program of memory access difference restructuring optimization method, it is characterized in that comprising the following steps:
The first step, employing Create method build the memory access behavior mark sheet, concrete steps are: for each Kernel function of GPU program is set up a memory access behavior mark sheet, the memory access behavior mark sheet comprises four fields, is respectively: the storage space logical address Addr of the type of memory MemT of thread number Tid, thread accesses, the size of data Size of thread accesses and access; Thread number Tid represents the unique number of thread in this Kernel domain of function; The type of memory MemT of thread accesses represents the type of memory of thread accesses; The size of data Size of thread accesses represents that the data of this thread accesses occupy the number of storage space byte; The storage space logical address Addr of thread accesses shows that threading operation calculates the address space of the deposit data of needs;
Second step, employing Record method record the memory access track of each thread in each Kernel function, and concrete steps are:
2.1 scanning GPU kernel program, in definition GPU kernel program, the function numbering Kid of Kernel function is followed successively by 0,1 ... i ... M-1, wherein, 0≤i<M, M are the number of Kernel function in the GPU kernel program; And the number of threads that defines the Kernel function startup that is numbered i is T
i, the memory access behavior mark sheet adds up to M; The list item number that is numbered memory access behavior mark sheet corresponding to the Kernel function of i is T
iThe access memory access trace information of the Kernel thread in the GPU program is write list item corresponding to memory access behavior mark sheet and field;
2.2 initialization j=0;
2.3 obtain the number of threads T that function is numbered the Kernel function startup of j
j, and initialization k=0;
2.4 the memory access trace information of k thread that function is numbered the Kernel function of j writes respectively field Tid, MemT, Size and the Addr of memory access behavior mark sheet, upgrades k=k+1;
If 2.5 k≤T
i-1, turn 2.4; Otherwise turn 2.6;
2.6 upgrade j=j+1;
If 2.7 j≤M-1 turns 2.3, otherwise turn 2.8;
2.8 in each Kernel function, the memory access trace information of each thread records completely, carries out for the 3rd step;
The 3rd the step, according to the memory access address of the GPU thread in each Kernel function, judge whether the thread memory access in the GPU thread in same Kernel function the memory access difference occurs, method is as follows:
3.1 initialization j=0;
3.2 obtain the number of threads T that function is numbered the Kernel function startup of j
j, define the type of memory S set of this Kernel function access
jBe sky, namely
And fixed with difference address set A
jBe sky, namely
Initialization m=0;
3.3 inquiry memory access behavior mark sheet obtains thread T
mThe type of memory of access
Represent m thread T
mThe reference-to storage type, if
With thread T
m+1The type of memory of access
Difference is judged thread T
mThe memory access difference occurs with thread m+1, add the type of memory that difference occurs to S set
jIn, namely
With
Carry out 3.4; Otherwise directly carry out 3.4;
3.4 inquiry memory access behavior mark sheet obtains thread T
mThe storage space logical address of access
Represent m thread T
mAccess storage space logical address, if
Storage space logical address with m+1 thread accesses
Difference be not equal to thread T
mThe size of the data of access is judged thread T
mWith thread T
m+1The memory access difference occurs, will the reference-to storage type of difference occur
The two tuple set that consist of with the thread address
With
Incorporate successively set A into
jIn, namely
With
Carry out 3.5; Otherwise, directly carry out 3.5;
3.5 upgrade m=m+1, if m<T
j-1, turn 3.2; Otherwise expression thread accesses type of memory does not relatively judge not end, carries out 3.6;
3.6 upgrade j=j+1;
If 3.7 j≤M-1 turns 3.2, otherwise showing does not have the Kernel function need to carry out the judgement of memory access difference, carries out 3.8;
3.8 each Kernel function memory access difference judgement is complete, if
And
Turned for the 5th step, otherwise carried out for the 4th step;
The 4th goes on foot, optimizes based on the GPU Kernel restructuring of memory access difference:
Become a plurality of sub-Kernel that do not have the memory access difference 4.1 the GPU Kernel division of memory access difference will occur, method is as follows:
4.1.1 initialization j=0;
4.1.2 obtain S set
jElement number
That is the Kernel division that, function is numbered j becomes
Individual sub-Kernel;
4.1.3 query function is numbered memory access behavior mark sheet corresponding to the Kernel of j, obtains the sets of threads with identical reference-to storage type, obtains
Individual sets of threads, sub-Kernel take each sets of threads as the thread block tissue;
4.1.4 upgrade j=j+1;
4.1.5 if j≤M-1 turns 4.1.2, otherwise, carry out 4.1.6;
4.1.6 the GPU Kernel division based on type of memory is complete, turns 4.1.7;
4.1.7 initialization j=0;
4.1.9 if j≤M-1 turns 4.1.8; Otherwise, turn 4.1.10;
4.1.10 initialization j=0;
4.1.11 obtain set A
jElement number
Being about to the Kernel division that function is numbered j becomes
Individual sub-Kernel;
4.1.12 query function is numbered memory access behavior mark sheet corresponding to the Kernel of j, with A
jMiddle element is that the boundary becomes the Kernel division that function is numbered j
Individual sub-Kernel;
4.1.13 upgrade j=j+1;
4.1.14 if j≤M-1 turns 4.1.11, otherwise, turn 4.1.15;
4.1.15 initialization j=0;
If 4.1.16 A
jBe sky, that is,
Turn 4.1.17; Otherwise, turn 4.1.11;
4.1.17 if j≤M-1 turns 4.1.8; Otherwise, turn 4.2.
4.2 the GPUKernel to continuous memory access merges, concrete fusion method is as follows:
Build memory access behavior mark sheet 4.2.1 employing Create method is each the Kernel function in the 4.1st GPU program that goes on foot after processing, then adopt the Record method to record the memory access track of each thread in each Kernel function; 4.2.2 initialization j=0;
4.2.3 the memory access track characteristic table that inquiry 4.2.1 builds, obtain the thread accesses type of memory that function is numbered the Kernel of j and j+1, if it is that the thread accesses type of memory of Kernelj is that Kernelj+1 is identical with the Kernel that is numbered j+1 that function is numbered the Kernel of j, turn 4.2.4; Otherwise turn 4.2.6;
4.2.4 inquiry memory access track characteristic table obtains the start address of a slice continuous space of Kernel j and Kernel j+1 access and the size of data of termination address and last thread accesses; If the termination address of the thread accesses address continuous space of function Kernel j and the start address of Kernel j+1 differ the size of data that the start address of the termination address of thread accesses address continuous space of the size of data of last thread accesses in Kernel j or Kernelj+1 and Kernelj differs last thread accesses in Kernelj, what judge Kernel j and Kernelj+1 access is the continuous storage space of a slice, turns 4.2.5; Otherwise, turn 4.2.6;
4.2.5 Kernel j and Kernel j+1 are carried out mixing operation, and method is:
Employing reconsolidates the thread in Kernelj and Kernelj+1 based on the concurrent many Kernel merging methods of Kernel, and tissue becomes a new GPU Kernel;
4.2.6 upgrade j=j+1;
4.2.7 if j<M-1 turns 4.2.3; Otherwise, carried out for the 5th step;
The 5th step, end.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310000459.7A CN103150157B (en) | 2013-01-03 | 2013-01-03 | Based on the GPU kernel program restructuring optimization method of memory access difference |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310000459.7A CN103150157B (en) | 2013-01-03 | 2013-01-03 | Based on the GPU kernel program restructuring optimization method of memory access difference |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103150157A true CN103150157A (en) | 2013-06-12 |
CN103150157B CN103150157B (en) | 2015-11-25 |
Family
ID=48548259
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310000459.7A Expired - Fee Related CN103150157B (en) | 2013-01-03 | 2013-01-03 | Based on the GPU kernel program restructuring optimization method of memory access difference |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103150157B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104199782A (en) * | 2014-08-25 | 2014-12-10 | 浙江大学城市学院 | GPU memory access method |
CN107291537A (en) * | 2017-06-07 | 2017-10-24 | 江苏海平面数据科技有限公司 | The optimization method that memory space is used on a kind of GPU pieces |
CN109725903A (en) * | 2017-10-30 | 2019-05-07 | 华为技术有限公司 | Program code transform process method, device and compiling system |
CN109783222A (en) * | 2017-11-15 | 2019-05-21 | 杭州华为数字技术有限公司 | A kind of method and apparatus for eliminating branch's disagreement |
-
2013
- 2013-01-03 CN CN201310000459.7A patent/CN103150157B/en not_active Expired - Fee Related
Non-Patent Citations (2)
Title |
---|
甘新标 等: "《面向众核GPU结构的椭圆曲线加密流化技术》", 《四川大学学报(工程科学版)》 * |
马安国: "《高效能GPGPU体系结构关键技术研究》", 《中国博士学位论文全文数据库》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104199782A (en) * | 2014-08-25 | 2014-12-10 | 浙江大学城市学院 | GPU memory access method |
CN104199782B (en) * | 2014-08-25 | 2017-04-26 | 浙江大学城市学院 | GPU memory access method |
CN107291537A (en) * | 2017-06-07 | 2017-10-24 | 江苏海平面数据科技有限公司 | The optimization method that memory space is used on a kind of GPU pieces |
CN109725903A (en) * | 2017-10-30 | 2019-05-07 | 华为技术有限公司 | Program code transform process method, device and compiling system |
CN109783222A (en) * | 2017-11-15 | 2019-05-21 | 杭州华为数字技术有限公司 | A kind of method and apparatus for eliminating branch's disagreement |
Also Published As
Publication number | Publication date |
---|---|
CN103150157B (en) | 2015-11-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11436400B2 (en) | Optimization method for graph processing based on heterogeneous FPGA data streams | |
CN106991011B (en) | CPU multithreading and GPU (graphics processing unit) multi-granularity parallel and cooperative optimization based method | |
CN1983196B (en) | System and method for grouping execution threads | |
CN109002659B (en) | Fluid machinery simulation program optimization method based on super computer | |
CN103150265B (en) | The fine-grained data distribution method of isomery storer on Embedded sheet | |
CN101814039B (en) | GPU-based Cache simulator and spatial parallel acceleration simulation method thereof | |
CN105487838A (en) | Task-level parallel scheduling method and system for dynamically reconfigurable processor | |
CN106055311B (en) | MapReduce tasks in parallel methods based on assembly line multithreading | |
JP2017091589A (en) | Processor core and processor system | |
CN101799750B (en) | Data processing method and device | |
Harris | Optimizing cuda | |
US20140143570A1 (en) | Thread consolidation in processor cores | |
CN102193830B (en) | Many-core environment-oriented division mapping/reduction parallel programming model | |
DE102012221502A1 (en) | A system and method for performing crafted memory access operations | |
CN102193811B (en) | Compiling device for eliminating memory access conflict and realizing method thereof | |
CN103150157B (en) | Based on the GPU kernel program restructuring optimization method of memory access difference | |
CN110750265B (en) | High-level synthesis method and system for graph calculation | |
US20210334234A1 (en) | Distributed graphics processor unit architecture | |
CN103699656A (en) | GPU-based mass-multimedia-data-oriented MapReduce platform | |
CN103678571A (en) | Multithreaded web crawler execution method applied to single host with multi-core processor | |
CN114970294B (en) | Three-dimensional strain simulation PCG parallel optimization method and system based on Shenwei architecture | |
CN104317770A (en) | Data storage structure and data access method for multiple core processing system | |
Bhatotia | Incremental parallel and distributed systems | |
CN100489830C (en) | 64 bit stream processor chip system structure oriented to scientific computing | |
CN109491934B (en) | Storage management system control method integrating computing function |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20151125 Termination date: 20220103 |
|
CF01 | Termination of patent right due to non-payment of annual fee |