CN103150157B - Based on the GPU kernel program restructuring optimization method of memory access difference - Google Patents

Based on the GPU kernel program restructuring optimization method of memory access difference Download PDF

Info

Publication number
CN103150157B
CN103150157B CN201310000459.7A CN201310000459A CN103150157B CN 103150157 B CN103150157 B CN 103150157B CN 201310000459 A CN201310000459 A CN 201310000459A CN 103150157 B CN103150157 B CN 103150157B
Authority
CN
China
Prior art keywords
kernel
thread
memory access
memory
function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201310000459.7A
Other languages
Chinese (zh)
Other versions
CN103150157A (en
Inventor
甘新标
刘杰
迟利华
晏益慧
徐涵
胡庆丰
王志英
苏博
朱琪
刘聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201310000459.7A priority Critical patent/CN103150157B/en
Publication of CN103150157A publication Critical patent/CN103150157A/en
Application granted granted Critical
Publication of CN103150157B publication Critical patent/CN103150157B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Stored Programmes (AREA)

Abstract

The invention discloses a kind of GPU kernel program based on memory access difference restructuring optimization method, object is the execution efficiency and the application program capacity that promote extensive GPUKernel.Technical scheme first adopts Create method to build memory access behavior mark sheet, adopt the memory access track of each thread in Record method record each Kernel function again, then judge according to the memory access address of the GPU thread in each Kernel function whether the thread memory access in the GPU thread in same Kernel function occurs memory access difference, then based on the GPU of memory access difference? Kernel recombinates optimization, comprise based on the GPUKernel division of memory access difference and the GPU based on continuous memory access? Kernel merges two steps.Adopt the present invention can solve extensive many GPU? does the problem that Kernel application execution efficiency is low, promote extensive GPU? the execution efficiency of Kernel and application program capacity.

Description

Based on the GPU kernel program restructuring optimization method of memory access difference
Technical field
The present invention relates to GPU kernel program (i.e. GPUKernel) to recombinate optimization method, espespecially to recombinate optimization method based on the GPUKernel of memory access difference.
Background technology
In recent years, GPU (GraphicsProcessingUnit, Graphics Processing Unit) the concurrent execution pattern of powerful computing power, magnanimity thread and programming model flexibly, make GPU be widely used in numerous high-performance computing sectors such as molecular dynamics simulation, biogenesis analysis, weather prognosis.Map large-scale GPGPU (GeneralPurposecomputingonGraphicsProcessingUnits) application face-to-face, core program (Kernel) pattern of standard cannot meet the demand of large-scale application program.
GPU kernel program (GPUKernel) is exactly the program segment operated on GPU, usual programmer can by computation-intensive in program, run core subroutine consuming time and be transplanted on GPU and accelerate, operate in core subroutine such on GPU and be commonly referred to GPUKernel.
In the face of large-scale GPGPU application program, its GPUKernel number nearly tens of up to a hundred.In order to improve the dispatching efficiency of so many GPUKernel, develop program parallelization to greatest extent, many GPUKernel restructuring optimisation techniques become the effective way of raising program operational efficiency.Current GPUKernel restructuring optimization method mainly contains following several:
(1) merge based on many Kernel that Kernel is concurrent.GPU architecture before NVIDIA second generation unified shader does not support many Kernel concurrence performance, and in order to allow independently GPUKernel can concurrence performance, it be the effective ways of many Kernel concurrence performance that many Kernel merge.Although NVIDIA second generation unified shader GPU can perform by concurrent many Kernel simultaneously, but the quantity of concurrent Kernel is also extremely limited, therefore, many Kernel merging methods can improve the concurrency between many Kernel, alleviate the pressure that many Kernel order performs, reduce many Kernel Start-up costs, develop the concurrency between many Kernel, improve the operational efficiency of GPU program.
(2) the many Kernel based on GPU shared storage merge.If there is data dependence relation between GPUKernel, namely the output of a Kernel is just the input of another Kernel, in order to avoid the global storage of each Kernel sequential access long delay, the multiple Kernel that there is inputoutput data dependence can be merged into an independently GPUKernel, in GPUKernel, explicit management GPU shared storage transmits intermediary as data, avoid the access of global storage, improve memory access efficiency, reduce GPUKernel number simultaneously, decrease GPUKernel Start-up costs and be conducive to Kernel scheduling, improve GPU program operational efficiency.
(3) based on many Kernel restructuring of program branches.Be different from traditional CPU architecture, resource most on sheet is used for calculating by GPU, and the control module in GPU is relative with branch prediction unit deficient.How in GPUKernel, to avoid thread branch, most important for the execution efficiency improving GPUKernel, therefore, scholar is had to propose the distinct program branch in same GPUKernel to be separated the different GPUSub-Kernel of establishment, then the GPUSub-Kernel with identical execution route is merged the new GPUGPUKernel of composition one, the GPUKernel restructuring of testing based on program branches can effectively avoid GPU thread branch and the thread waits that causes with synchronous, significantly improve the execution efficiency of GPUKernel.
Above-mentioned three class GPUKernel restructuring optimization methods, to a certain extent, accelerate the execution efficiency that can improve GPUKernel towards concrete extensive GPU optimizing application, promote the performance of GPU application program.But above-mentioned three class GPUKernel restructuring optimization methods have ignored GPU threads store access module to the impact of the execution efficiency of GPUKernel.In practice, the memory access behavior feature of GPU thread will have a strong impact on the execution efficiency of GPUKernel.Therefore, how from practical application, solve because the GPU system efficiency that causes of the memory access behavior feature of GPU thread is low and a difficult problem that is GPU application performance difference is the important technological problems that those skilled in the art pay close attention to.
The memory access behavior feature of GPU thread can be divided into following two classes:
(1) the thread accesses a slice continuous print storage space in Kernel, this is also a kind of memory access behavior feature of Utopian GPU thread, and under this memory access behavior feature, GPUKernel execution efficiency is the highest;
(2) there is Jump in the thread memory access behavior in Kernel, the interruption storage unit of the storage space that namely each thread accesses is different or the same storage space of discontinuous access.
In the memory access behavior feature of GPU thread, Kernel thread occurs that the memory access behavior of Jump is referred to as GPU thread memory access difference.
SIMD (SingleInstructionMultipleData, single instruction multiple data) accelerates the effective ways having become lifting body architecture usefulness, and GPU architecture is exactly a kind of typical SIMD acceleration bodies architecture.The mode of operation that SIMD accelerates is based on GPU thread accesses one section of continuous print storage space.If there is Jump in the memory access behavior of GPU thread, not only can not elevator system usefulness, accelerate GPU application, also can have a strong impact on GPU system usefulness and GPU application program operational efficiency.Restructuring optimization occurs that the GPU thread of memory access difference is the gordian technique and the method that promote GPU system performance and GPU application program operational efficiency.Therefore, eliminate GPU thread memory access difference by GPUKernel recombination method, GPU system usefulness and GPU application program operational efficiency can be promoted to greatest extent.Open source literature is not also had to relate to related art scheme at present.
Summary of the invention
The technical problem to be solved in the present invention is: for the problem that extensive many GPUKernel application execution efficiencys are low, a kind of GPUKernel based on memory access difference restructuring optimization method is proposed, to promote execution efficiency and the application program capacity of extensive GPUKernel.
In order to solve the problems of the technologies described above, concrete technical scheme of the present invention is:
The first step, each Kernel function adopting Create method to be GPU program build memory access behavior mark sheet.The concrete steps of Create method are: for each Kernel function of GPU program sets up a memory access behavior mark sheet, memory access behavior mark sheet comprises four fields, is respectively: the storage space logical address Addr of the type of memory MemT of thread number Tid, thread accesses, the size of data Size of thread accesses and access.Thread number Tid represents the unique number of thread in this Kernel domain of function; The type of memory MemT of thread accesses represents that the type of memory of thread accesses, type of memory comprise global storage Global, shared storage SharedMemory, Texture memory TextureMemory and constant storer ConstantMemory; The size of data Size of thread accesses represents that the data of this thread accesses occupy the number of storage space byte; The storage space logical address Addr of thread accesses shows that threading operation calculates the address space of the deposit data of needs.
The memory access track of each thread in second step, employing Record method record each Kernel function.The concrete steps of Record method are:
2.1 scanning GPU kernel programs, in definition GPU kernel program, the function numbering Kid of Kernel function is followed successively by 0,1 ... i ... M-1, wherein, 0≤i<M, M are the number of Kernel function in GPU kernel program; And the number of threads that the Kernel function that definition is numbered i starts is T i, then memory access behavior mark sheet add up to M; The list item number being numbered memory access behavior mark sheet corresponding to the Kernel function of i is the number of threads T that this Kernel function starts i.By list item corresponding for the access memory access trace information of the Kernel thread in GPU program write memory access behavior mark sheet and field;
2.2 initialization j=0;
2.3 obtain the number of threads T that function is numbered the Kernel function startup of j j, and initialization k=0;
2.4 memory access trace informations function being numbered the kth thread of the Kernel function of j write field Tid, MemT, Size and Addr of memory access behavior mark sheet respectively, upgrade k=k+1;
If 2.5 k≤T j-1, turn 2.4; Otherwise turn 2.6;
2.6 upgrade j=j+1;
If 2.7 j≤M-1, turn 2.3, otherwise turn 2.8;
In 2.8 each Kernel functions, the memory access trace information record of each thread is complete, performs the 3rd step.
3rd step, memory access difference judge.According to the memory access address of the GPU thread in each Kernel function, judge whether the thread memory access in the GPU thread in same Kernel function occurs memory access difference.Determination methods is as follows:
3.1 initialization j=0;
3.2 obtain the number of threads T that function is numbered the Kernel function startup of j j, define the type of memory S set of this Kernel function access jfor sky, namely and define difference address set A jfor sky, namely initialization m=0;
3.3 inquiry memory access behavior mark sheets obtain thread T mthe type of memory of access represent m thread T maccess type of memory.If with thread T m+1the type of memory of access difference, then judge thread T mthere is memory access difference with thread m+1, will occur that the type of memory of difference adds S set to jin, namely with perform 3.4; Otherwise directly perform 3.4;
3.4 inquiry memory access behavior mark sheets obtain thread T mthe storage space logical address of access represent m thread T maccess storage space logical address.If with the storage space logical address of m+1 thread accesses difference be not equal to thread T mthe size of the data of access, then judge thread T mwith thread T m+1there is memory access difference, will the access type of memory of difference be occurred with two tuple-sets that thread address is formed with be incorporated to set A successively jin, namely with perform 3.5; Otherwise, directly perform 3.5;
3.5 upgrade m=m+1, if m<T j-1, turn 3.2; Otherwise, represent that thread accesses type of memory multilevel iudge does not terminate, perform 3.6;
3.6 upgrade j=j+1;
If 3.7 j≤M-1, turn 3.2, otherwise, show do not have Kernel function to need to carry out the judgement of memory access difference, perform 3.8.
3.8 each Kernel function memory access differences judge complete, if and then represent there is not memory access difference, directly terminate to optimize, turn the 5th step, otherwise, carry out GPUKernel restructuring and optimize, turn the 4th step.
4th step, based on memory access difference GPUKernel restructuring optimize.To occurring in GPU kernel program that the GPUKernel of memory access difference carries out restructuring optimization and mainly comprises two steps: GPUKernel division is become multiple sub-Kernel that there is not memory access difference; Sub-for memory access continuous print Kernel is fused into a new GPUKernel.
GPUKernel based on memory access difference divides by 4.1.To occur that the GPUKernel division of memory access difference becomes multiple sub-Kernel that there is not memory access difference, method is as follows:
4.1.1 initialization j=0;
4.1.2 S set is obtained jelement number that is, Kernel division function being numbered j becomes individual sub-Kernel;
4.1.3 query function is numbered memory access behavior mark sheet corresponding to the Kernel of j, obtains the sets of threads with identical access type of memory, obtains individual sets of threads, with each sets of threads for thread block organizes a sub-Kernel;
4.1.4 j=j+1 is upgraded;
If 4.1.5 j≤M-1, turns 4.1.2, otherwise, perform 4.1.6;
4.1.6 the GPUKernel division based on type of memory is complete, turns 4.1.7.
4.1.7 initialization j=0;
If 4.1.8 S jfor sky, namely turn 4.1.9; Otherwise, for each Kernel function in the GPU program after Kernel division builds memory access behavior mark sheet, turn the first step;
4.1.9 upgrade j=j+1, if j≤M-1, turn 4.1.8; Otherwise, turn 4.1.10;
4.1.10 initialization j=0;
4.1.11 set A is obtained jelement number the Kernel division being numbered j by function becomes individual sub-Kernel;
4.1.12 query function is numbered memory access behavior mark sheet corresponding to the Kernel of j, with A jmiddle element be boundary function is numbered j Kernel division become individual sub-Kernel;
4.1.13 j=j+1 is upgraded;
If 4.1.14 j≤M-1, turns 4.1.11, otherwise, turn 4.1.15;
4.1.15 initialization j=0;
If 4.1.16 A jfor sky, that is, turn 4.1.17; Otherwise, turn 4.1.11;
4.1.17 upgrade j=j+1, if j≤M-1, turn 4.1.16; Otherwise, turn 4.2;
The GPUKernel of 4.2 pairs of continuous memory access merges.May be there is Kernel fragment in the GPU kernel program after Kernel division is formed, affect GPUKernel operational efficiency.A new larger GPUKernel that permeated by the continuous memory access thread with the same memory access can significantly promote GPU system efficiency and application program capacity, at this moment the Kernel formed after the Kernel carrying out merging comprises division and the Kernel not having division, as Kernel1 has split into Kernel11 and Kernel12, Kernel2 is division not, but Kernel fusion needs to consider Kernel11, Kernel12 and Kernel2 to be carried out merging the new Kernel of formation.Concrete Kernel fusion method is as follows:
4.2.1 adopt Create method to be that each Kernel function in GPU program after the 4.1st step process builds memory access behavior mark sheet, then adopt the memory access track of each thread in Record method record each Kernel function;
4.2.2 initialization j=0;
4.2.3 the memory access track characteristic table that 4.2.1 builds is inquired about, obtain the thread accesses type of memory that function is numbered the Kernel of j and j+1, if the thread accesses type of memory that function is numbered the Kernel (being expressed as Kernelj) of j is identical with the Kernel (being expressed as Kernelj+1) being numbered j+1, turn 4.2.4; Otherwise turn 4.2.6;
4.2.4 inquire about memory access track characteristic table, obtain the start address of a slice continuous space and the size of data of termination address and last thread accesses of Kernelj and Kernelj+1 access.If the termination address of the thread accesses address continuous space of function Kernelj differs the size of data of last thread accesses in Kernelj or the thread accesses address continuous space of Kernelj+1 termination address with the start address of Kernelj+1 differs the size of data of last thread accesses in Kernelj with the start address of Kernelj, what then judge Kernelj and Kernelj+1 access is a slice continuous print storage space, turns 4.2.5; Otherwise, turn 4.2.6;
4.2.5 carry out mixing operation to Kernelj and Kernelj+1, concrete operation method is:
Adopt in background technology and reconsolidated by the thread in Kernelj and Kernelj+1 based on many Kernel merging methods that Kernel is concurrent, tissue becomes a larger new GPUKernel.
4.2.6 j=j+1 is upgraded;
If 4.2.7 j<M-1, turn 4.2.3; Otherwise, perform the 5th step.
5th step, end.
Adopt the present invention can reach following technique effect:
1. alleviate GPU program memory access pressure, improve the memory access efficiency of GPU system;
2. accelerate the travelling speed of GPU application program and improve the resource utilization of GPU system.
Accompanying drawing explanation
Fig. 1 is memory access behavior mark sheet structure.
Fig. 2 optimizes overview flow chart based on the GPUKernel restructuring of memory access difference
Embodiment
Fig. 1 is memory access behavior mark sheet structure, and it is as follows that mode set up by concrete mark sheet:
For each Kernel function of GPU program sets up a memory access behavior mark sheet, memory access behavior mark sheet comprises four fields, is respectively: the storage space logical address Addr of the type of memory MemT of thread number Tid, thread accesses, the size of data Size of thread accesses and access.Thread number Tid represents the unique number of thread in this Kernel domain of function; The type of memory MemT of thread accesses represents that the type of memory of thread accesses, type of memory comprise global storage Global, shared storage SharedMemory, Texture memory TextureMemory and constant storer ConstantMemory; The size of data Size of thread accesses represents that the data of this thread accesses occupy the number of storage space byte; The storage space logical address Addr of thread accesses shows that threading operation calculates the address space of the deposit data of needs.
Fig. 2 is overview flow chart of the present invention, and its concrete implementation step is as follows:
The first step, structure memory access behavior mark sheet.
Second step, record the memory access track of each thread in each Kernel function.
3rd step, memory access difference judge.
4th step, based on memory access difference GPUKernel restructuring optimize.
4.1 divide based on the GPUKernel of memory access difference.
4.2 merge based on the GPUKernel of continuous memory access.
5th step, end.

Claims (1)

1., based on a GPU kernel program restructuring optimization method for memory access difference, it is characterized in that comprising the following steps:
The first step, employing Create method build memory access behavior mark sheet, concrete steps are: for each Kernel function of GPU program sets up a memory access behavior mark sheet, memory access behavior mark sheet comprises four fields, is respectively: the storage space logical address Addr of the type of memory MemT of thread number Tid, thread accesses, the size of data Size of thread accesses and access; Thread number Tid represents the unique number of thread in this Kernel domain of function; The type of memory MemT of thread accesses represents the type of memory of thread accesses; The size of data Size of thread accesses represents that the data of this thread accesses occupy the number of storage space byte; The storage space logical address Addr of thread accesses shows that threading operation calculates the address space of the deposit data of needs;
The memory access track of each thread in second step, employing Record method record each Kernel function, concrete steps are:
2.1 scanning GPU kernel programs, in definition GPU kernel program, the function numbering Kid of Kernel function is followed successively by 0,1 ... i ... M-1, wherein, 0≤i<M, M are the number of Kernel function in GPU kernel program; And the number of threads that the Kernel function that definition is numbered i starts is T i, then memory access behavior mark sheet add up to M; The list item number being numbered memory access behavior mark sheet corresponding to the Kernel function of i is T i; By list item corresponding for the access memory access trace information of the Kernel thread in GPU program write memory access behavior mark sheet and field;
2.2 initialization j=0;
2.3 obtain the number of threads T that function is numbered the Kernel function startup of j j, and initialization k=0;
2.4 memory access trace informations function being numbered the kth thread of the Kernel function of j write field Tid, MemT, Size and Addr of memory access behavior mark sheet respectively, upgrade k=k+1;
If 2.5 k≤T j-1, turn 2.4; Otherwise turn 2.6;
2.6 upgrade j=j+1;
If 2.7 j≤M-1, turn 2.3, otherwise turn 2.8;
In 2.8 each Kernel functions, the memory access trace information record of each thread is complete, performs the 3rd step;
3rd step, memory access address according to the GPU thread in each Kernel function, judge whether the thread memory access in the GPU thread in same Kernel function occurs memory access difference, method is as follows:
3.1 initialization j=0;
3.2 obtain the number of threads T that function is numbered the Kernel function startup of j j, define the type of memory S set of this Kernel function access jfor sky, namely and define difference address set A jfor sky, namely initialization m=0;
3.3 inquiry memory access behavior mark sheets obtain thread T mthe type of memory of access represent m thread T maccess type of memory, if with thread T m+1the type of memory of access difference, then judge thread T mthere is memory access difference with thread m+1, will occur that the type of memory of difference adds S set to jin, namely with perform 3.4; Otherwise directly perform 3.4;
3.4 inquiry memory access behavior mark sheets obtain thread T mthe storage space logical address of access represent m thread T maccess storage space logical address, if with the storage space logical address of m+1 thread accesses difference be not equal to thread T mthe size of the data of access, then judge thread T mwith thread T m+1there is memory access difference, will the access type of memory of difference be occurred with two tuple-sets that thread address is formed with be incorporated to set A successively jin, namely with perform 3.5; Otherwise, directly perform 3.5;
3.5 upgrade m=m+1, if m<T j-1, turn 3.2; Otherwise, represent that thread accesses type of memory multilevel iudge does not terminate, perform 3.6;
3.6 upgrade j=j+1;
If 3.7 j≤M-1, turn 3.2, otherwise, show do not have Kernel function to need to carry out the judgement of memory access difference, perform 3.8;
3.8 each Kernel function memory access differences judge complete, if and turn the 5th step, otherwise perform the 4th step;
4th step, based on memory access difference GPUKernel restructuring optimize:
4.1 will occur that the GPUKernel division of memory access difference becomes multiple sub-Kernel that there is not memory access difference, and method is as follows:
4.1.1 initialization j=0;
4.1.2 S set is obtained jelement number Num sj, that is, Kernel division function being numbered j becomes Num sjindividual sub-Kernel;
4.1.3 query function is numbered memory access behavior mark sheet corresponding to the Kernel of j, obtains the sets of threads with identical access type of memory, obtains Num sjindividual sets of threads, with each sets of threads for thread block organizes a sub-Kernel;
4.1.4 j=j+1 is upgraded;
If 4.1.5 j≤M-1, turns 4.1.2, otherwise, perform 4.1.6;
4.1.6 the GPUKernel division based on type of memory is complete, turns 4.1.7;
4.1.7 initialization j=0;
If 4.1.8 S jfor sky, namely turn 4.1.9; Otherwise, turn the first step;
4.1.9 upgrade j=j+1, if j≤M-1, turn 4.1.8; Otherwise, turn 4.1.10;
4.1.10 initialization j=0;
4.1.11 set A is obtained jelement number Num aj, the Kernel division being numbered j by function becomes Num ajindividual sub-Kernel;
4.1.12 query function is numbered memory access behavior mark sheet corresponding to the Kernel of j, with A jmiddle element be boundary function is numbered j Kernel division become Num ajindividual sub-Kernel;
4.1.13 j=j+1 is upgraded;
If 4.1.14 j≤M-1, turns 4.1.11, otherwise, turn 4.1.15;
4.1.15 initialization j=0;
If 4.1.16 A jfor sky, that is, turn 4.1.17; Otherwise, turn 4.1.11;
4.1.17 upgrade j=j+1, if j≤M-1, turn 4.1.16; Otherwise, turn 4.2;
The GPUKernel of 4.2 pairs of continuous memory access merges, and concrete fusion method is as follows:
4.2.1 adopt Create method to be that each Kernel function in GPU program after the 4.1st step process builds memory access behavior mark sheet, then adopt the memory access track of each thread in Record method record each Kernel function;
4.2.2 initialization j=0;
4.2.3 the memory access track characteristic table that 4.2.1 builds is inquired about, obtain the thread accesses type of memory that function is numbered the Kernel of j and j+1, if the thread accesses type of memory that function is numbered Kernel and Kernelj of j is identical with Kernel and Kernelj+1 being numbered j+1, turn 4.2.4; Otherwise turn 4.2.6;
4.2.4 inquire about memory access track characteristic table, obtain the start address of a slice continuous space and the size of data of termination address and last thread accesses of Kernelj and Kernelj+1 access; If the termination address of the thread accesses address continuous space of function Kernelj differs the size of data of last thread accesses in Kernelj or the thread accesses address continuous space of Kernelj+1 termination address with the start address of Kernelj+1 differs the size of data of last thread accesses in Kernelj with the start address of Kernelj, what then judge Kernelj and Kernelj+1 access is a slice continuous print storage space, turns 4.2.5; Otherwise, turn 4.2.6;
4.2.5 carry out mixing operation to Kernelj and Kernelj+1, method is:
Adopt and reconsolidated by the thread in Kernelj and Kernelj+1 based on many Kernel merging methods that Kernel is concurrent, tissue becomes a new GPUKernel;
4.2.6 j=j+1 is upgraded;
If 4.2.7 j<M-1, turns 4.2.3; Otherwise, perform the 5th step;
5th step, end.
CN201310000459.7A 2013-01-03 2013-01-03 Based on the GPU kernel program restructuring optimization method of memory access difference Expired - Fee Related CN103150157B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310000459.7A CN103150157B (en) 2013-01-03 2013-01-03 Based on the GPU kernel program restructuring optimization method of memory access difference

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310000459.7A CN103150157B (en) 2013-01-03 2013-01-03 Based on the GPU kernel program restructuring optimization method of memory access difference

Publications (2)

Publication Number Publication Date
CN103150157A CN103150157A (en) 2013-06-12
CN103150157B true CN103150157B (en) 2015-11-25

Family

ID=48548259

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310000459.7A Expired - Fee Related CN103150157B (en) 2013-01-03 2013-01-03 Based on the GPU kernel program restructuring optimization method of memory access difference

Country Status (1)

Country Link
CN (1) CN103150157B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104199782B (en) * 2014-08-25 2017-04-26 浙江大学城市学院 GPU memory access method
CN107291537A (en) * 2017-06-07 2017-10-24 江苏海平面数据科技有限公司 The optimization method that memory space is used on a kind of GPU pieces
CN109725903A (en) * 2017-10-30 2019-05-07 华为技术有限公司 Program code transform process method, device and compiling system
CN109783222A (en) * 2017-11-15 2019-05-21 杭州华为数字技术有限公司 A kind of method and apparatus for eliminating branch's disagreement

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
甘新标 等.《面向众核GPU结构的椭圆曲线加密流化技术》.《四川大学学报(工程科学版)》.2011,第43卷(第2期),第98-102页. *
马安国.《高效能GPGPU体系结构关键技术研究》.《中国博士学位论文全文数据库》.2012,第2012年卷(第3期),第I138-18页. *

Also Published As

Publication number Publication date
CN103150157A (en) 2013-06-12

Similar Documents

Publication Publication Date Title
US11436400B2 (en) Optimization method for graph processing based on heterogeneous FPGA data streams
CN103150265B (en) The fine-grained data distribution method of isomery storer on Embedded sheet
CN109002659B (en) Fluid machinery simulation program optimization method based on super computer
CN102193830B (en) Many-core environment-oriented division mapping/reduction parallel programming model
CN103150157B (en) Based on the GPU kernel program restructuring optimization method of memory access difference
CN103761215B (en) Matrix transpose optimization method based on graphic process unit
US20140143570A1 (en) Thread consolidation in processor cores
CN105487838A (en) Task-level parallel scheduling method and system for dynamically reconfigurable processor
JP2017091589A (en) Processor core and processor system
CN102253921B (en) Dynamic reconfigurable processor
CN110750265B (en) High-level synthesis method and system for graph calculation
US20210373799A1 (en) Method for storing data and method for reading data
CN104915213A (en) Partial reconfiguration controller of reconfigurable system
CN103699656A (en) GPU-based mass-multimedia-data-oriented MapReduce platform
CN102253919A (en) Parallel numerical simulation method and system based on GPU and CPU cooperative operation
CN104317770A (en) Data storage structure and data access method for multiple core processing system
CN114970294B (en) Three-dimensional strain simulation PCG parallel optimization method and system based on Shenwei architecture
CN112947870A (en) G-code parallel generation method of 3D printing model
CN102591787A (en) Method and device for data processing of JAVA card
CN112463739A (en) Data processing method and system based on ocean mode ROMS
CN106971369B (en) Data scheduling and distributing method based on GPU (graphics processing Unit) for terrain visual field analysis
CN102236632B (en) Method for hierarchically describing configuration information of dynamic reconfigurable processor
WO2021070303A1 (en) Computation processing device
CN101923386B (en) Method and device for reducing CPU power consumption and low power consumption CPU
CN102200961B (en) Expansion method of sub-units in dynamically reconfigurable processor

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20151125

Termination date: 20220103