CN103049304B - A kind ofly removed the method accelerating GPU travelling speed by dead code - Google Patents

A kind ofly removed the method accelerating GPU travelling speed by dead code Download PDF

Info

Publication number
CN103049304B
CN103049304B CN201310020549.2A CN201310020549A CN103049304B CN 103049304 B CN103049304 B CN 103049304B CN 201310020549 A CN201310020549 A CN 201310020549A CN 103049304 B CN103049304 B CN 103049304B
Authority
CN
China
Prior art keywords
function
code
state
gpu
true
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201310020549.2A
Other languages
Chinese (zh)
Other versions
CN103049304A (en
Inventor
迟利华
刘杰
胡庆丰
晏益慧
龚春叶
甘新标
徐涵
蒋杰
杨博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201310020549.2A priority Critical patent/CN103049304B/en
Publication of CN103049304A publication Critical patent/CN103049304A/en
Application granted granted Critical
Publication of CN103049304B publication Critical patent/CN103049304B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Devices For Executing Special Programs (AREA)
  • Stored Programmes (AREA)

Abstract

The invention discloses and a kind ofly removed the method accelerating GPU travelling speed by dead code, object proposes a kind ofly to be removed the method accelerating GPU travelling speed by dead code, improves execution and the compile efficiency of extensive GPU kernel program.Technical scheme is first for all functions in extensive GPU kernel program build state-detection table; Record function essential information, init state detects table; Static analysis GPU program; Then run GPU kernel program, information when record GPU kernel program runs, upgrades the state of state-detection table each function check literary name section, then identifies dead code, dead code finally confirms, according to the dead code collection D finally obtained, delete dead code.The dead code do not performed when the present invention is by removing operation, decrease the code cubage of GPU kernel program, also reduce the final assembly code volume generated, thus improve the hit rate of SIMD instruction scheduling in GPU, significantly improve the operational efficiency of extensive GPU kernel program.

Description

A kind ofly removed the method accelerating GPU travelling speed by dead code
Technical field
The present invention relates to the method accelerating extensive GPU kernel program travelling speed, espespecially accelerating the method for GPU travelling speed by removing dead code.
Background technology
GPU(Graphics Processing Unit, Graphics Processing Unit) be generally used for graph image application in the past, be also widely used in now accelerating various general parallel algorithm and application.These algorithms are usually all fairly simple with the kernel program be applied on GPU, generally only have line code up to a hundred.But some are had to the large-scale application program of actual application value, as uncertainty PARTICLE TRANSPORT FROM program MCNP(Monte Carlo N-particle, N particle Monte Carlo method),, there is a large amount of dead code when specific procedure performs in the usual row up to ten thousand of core code that its GPU realizes simultaneously.Compared with CPU, the instruction buffer of GPU is less, thus very sensitive to the size of the assembly code generated.And GPU generally adopts _ and inline_ instruction is carried out inline to subfunction, needs during compiling to carry out global optimization to whole kernel program.The existence of dead code had both increased the assembly code volume of generation, again reduced the effect of global optimization, had had a strong impact on GPU travelling speed.The method of current quickening GPU travelling speed mainly contains following several:
(1) read-only data is in the layout of GPU constant storage space to improve memory access speed.
(2) by the data layout frequently accessed on GPU sheet shared storage to improve memory access speed.
(3) memory access speed is improved by engagement type access GPU global memory.
(4) adjust GPU kernel program thread block size and improve register resources utilization factor and executing efficiency.
All there is limitation in above-mentioned four class methods.GPU constant memory span is limited and can only deposit read-only data; Shared storage finite capacity on GPU sheet and body conflict (bank conflicts); There is the high problem of memory access latency in GPU global memory; Adjust large-scale GPU kernel program thread block size usually can only attempt one by one determining optimum line block size, efficiency is low.
Dead code be exactly may perform in program process but actual motion time the code segment that not do not perform.The existence of dead code will increase assembly code volume, have a strong impact on program optimization and scheduling, reduce program operational efficiency.For the large-scale application program that some have actual application value, its GPU kernel program assembly code volume is considerable, GPU instruction buffer space is but quite limited simultaneously, in GPU kernel program, the existence of dead code will increase the pressure of GPU instruction buffer, the instruction buffer space of waste GPU preciousness, have a strong impact on scheduling and the optimization of GPU core code, extend the working time of GPU core code.Therefore dead code deletion for the extensive GPU program execution efficiency of raising and faster procedure travelling speed most important.
BODIK proposes the method identifying dead assignment statement in compile optimization process; SWEENEY proposes the detection method of OO unreachable subroutine; XI proposes the detection method of useless function parameter; Zhang Guangmei achieves the detection of invalid program branches from the control structure of program.Above-mentioned dead code detection technology is all to be compiled as basis, and lay particular emphasis on derivation and the proof of dead code detection technology ultimate principle, theoretical proof is full and accurate credible, but practical application is complicated, and poor operability is unfavorable for applying.
For the extensive GPU kernel program with actual application value, remove the execution efficiency that dead code significantly can increase program, therefore, if can detect and delete dead code, GPU travelling speed must be accelerated, but also do not have open source literature to study related art scheme at present.
Summary of the invention
The technical problem to be solved in the present invention is: there is the not high problem of execution efficiency for large-scale GPU kernel program, under the prerequisite ensureing program correctness, propose a kind ofly to be removed the method accelerating GPU travelling speed by dead code, improve execution and the compile efficiency of extensive GPU kernel program.
In order to solve the problems of the technologies described above, concrete technical scheme of the present invention is:
The first step, constructor state-detection table.For all functions in extensive GPU kernel program build state-detection table, the list item number of state-detection table is the number of function in GPU kernel program.State-detection table comprises six fields altogether, is respectively: function numbering ID, function name Name, invocation flags Callee, static analysis mark Static, Dynamic Execution mark Dynamic and delete flag Del.Function numbering ID is the name of the overall uniquely tagged of function, function name Name representative function; Invocation flags Callee representative function whether by routine call, invocation flags Callee be true representative function by routine call, Callee is that false representative function is not by routine call; Static analysis mark Static represent static analysis is carried out to function module after judge whether it can perform, Static is that true representative function may run when program performs, and Static is that false representative function can not run when program performs; When Dynamic Execution mark Dynamic representation program runs, whether function is performed, and Dynamic is that true shows that function ran when program is run, and Dynamic is that false shows that function did not run when program is run; Can delete flag Del representative function code segment deleted, and Del is that true shows that function code section is that dead code should be deleted, and shows that function code section is not that dead code should not be deleted if Del is false.
Second step, record function essential information.Scanning sequence code, from 0, distribute unique function numbering to each function code section in GPU kernel program and by the id field of function numbering write state detection table of distributing, and function name function being numbered q is recorded in id field is in the Name field of the list item of the state-detection table of q.For the GPU kernel program having N number of function, then state-detection table has N number of list item, and the id field of N number of list item is 0 ~ N-1,0≤q≤N-1.
3rd step, init state detect table.In init state detection table, the invocation flags Callee of the function of all list items is true; The static analysis mark Static of the function of all list items of initialization is true; The Dynamic Execution mark Dynamic of the function of all list items of initialization is true; The delete flag Del of the function of all list items of initialization is false.
4th step, carry out static analysis to GPU program source code, upgraded the state of state-detection table each function check literary name section by staticaanalysis results, concrete grammar is as follows:
List item sum N, the initialization q=0 of 4.1 acquisition state-detection tables;
If the function that 4.2 state-detection table list item q are corresponding can not be called, then putting field Callee corresponding to this list item is false, otherwise is set to true;
If the field Callee that 4.3 state-detection table list item q are corresponding is true, then putting field Static corresponding to this list item is true, otherwise to put field Static be false;
4.4 upgrade q=q+1, if q<N, turn 4.2; Otherwise, turn the 5th step.
5th step, operation GPU kernel program, information when record GPU kernel program runs, upgrade the state of state-detection table each function check literary name section, concrete grammar is as follows:
List item sum N, the initialization q=0 of 5.1 acquisition state-detection tables;
If the function that 5.2 state-detection table list item q are corresponding is not performed, then putting field Dynamic corresponding to this list item is false, otherwise to put field Dynamic be true
If the field Static that 5.3 list item q are corresponding and Dynamic is false, then putting this list item corresponding field Del is true, otherwise to put field Del be false;
5.4 upgrade q=q+1, if q<N, turn 5.2; Otherwise, turn the 6th step.
6th step, dead marking code.Definition set S 0, S 1... S is m-1be empty set, according to the definition of each field in state-detection table, list item field Del is that the function that true is corresponding is dead code function.Be that first function that true is corresponding is incorporated to S set by list item field Del in state-detection table 0, be that second function that true is corresponding is incorporated to S set by list item field Del in state-detection table 1..., be that M the function that true is corresponding is incorporated to S set by list item field Del in state-detection table m-1, wherein, 0≤i<M≤N, M is the function number being designated dead code.
7th step, dead code finally confirm.Concrete grammar is as follows:
The dead code function set of 7.1 initialization is empty defining variable j=0;
The 7.2 language notation marks using GPU kernel program corresponding are by S jcorresponding function code section annotation, makes this function code section not run;
7.3 run S jprogram after annotation, compares operation result, if result is correct, upgrades dead code collection D=D ∪ S j; Otherwise, turn 7.4;
7.4 delete current S jthe annotation of corresponding function code section;
7.5j=j+1;
If 7.6 j<M, turn 7.2; Otherwise, turn the 8th step.
8th step, dead code elimination.According to the dead code collection D finally obtained, carry out dead code elimination, idiographic flow is as follows:
The element number P of the dead code collection D of 8.1 acquisition, 0≤P≤M≤N, defining variable k=0;
8.2 delete the function code section that in D, a kth element is corresponding, and concrete delet method is:
In order to keep the readability of program and improve program ease for maintenance, for the method that code dead in program does not take physics to delete, but the function code section being true by state-detection list item field Del carries out precompile process, namely adds precompile code pattern respectively before and after this function code section and is annotated by this function.Thus directly need not delete source code.
8.3 upgrade k=k+1, if k<P, turn 8.2; Otherwise, turn the 9th step.
9th step, end.
Compared with prior art, the present invention is adopted can to reach following technique effect:
1. the present invention can be applicable to various large-scale GPU kernel program, the dead code do not performed during by removing operation, decrease the code cubage of GPU kernel program, make compiling and optimize time shorten, the assembly code volume of final generation reduces, thus improve SIMD(Single Instruction Multiple Data in GPU, single instruction multiple data) hit rate of instruction scheduling, significantly improve the operational efficiency of extensive GPU kernel program.
Accompanying drawing explanation
Fig. 1 is that dead code status detects list structure.
Fig. 2 is overview flow chart of the present invention.
Embodiment
Fig. 1 is that dead code status detects list structure.It is as follows that concrete list structure sets up mode:
The list item number of state-detection table is the number of function in GPU kernel program.State-detection table comprises six fields altogether, is respectively: function numbering ID, function name Name, invocation flags Callee, static analysis mark Static, Dynamic Execution mark Dynamic and delete flag Del.Function numbering ID is the name of the overall uniquely tagged of function, function name Name representative function; Invocation flags Callee representative function whether by routine call, invocation flags Callee be true representative function by routine call, Callee is that false representative function is not by routine call; Static analysis mark Static represent static analysis is carried out to function module after judge whether it can perform, Static is that true representative function may run when program performs, and Static is that false representative function can not run when program performs; When Dynamic Execution mark Dynamic representation program runs, whether function is performed, and Dynamic is that true shows that function ran when program is run, and Dynamic is that false shows that function did not run when program is run; Can delete flag Del representative function code segment deleted, and Del is that true shows that function code section is that dead code should be deleted, and shows that function code section is not that dead code should not be deleted if Del is false.
Fig. 2 is overview flow chart of the present invention, and its concrete implementation step is as follows:
The first step, constructor state-detection table.
Second step, record function essential information.
3rd step, init state detect table.
4th step, static analysis is carried out to GPU program source code.
5th step, renewal state-detection table status.
6th step, dead marking code.
7th step, dead code finally confirm.
8th step, dead code elimination.
9th step, end.

Claims (2)

1. removed the method accelerating GPU travelling speed by dead code, it is characterized in that comprising the following steps:
The first step, be that all functions in extensive GPU kernel program build state-detection tables, the list item number of state-detection table is the number of function in GPU kernel program; State-detection table comprises six fields altogether, is respectively: function numbering ID, function name Name, invocation flags Callee, static analysis mark Static, Dynamic Execution mark Dynamic and delete flag Del; Function numbering ID is the name of the overall uniquely tagged of function, function name Name representative function; Invocation flags Callee representative function whether by routine call, invocation flags Callee be true representative function by routine call, Callee is that false representative function is not by routine call; Static analysis mark Static represent static analysis is carried out to function module after judge whether it can perform, Static is that true representative function may run when program performs, and Static is that false representative function can not run when program performs; When Dynamic Execution mark Dynamic representation program runs, whether function is performed, and Dynamic is that true shows that function was through running when program is run, and Dynamic is that false shows that function did not run when program is run; Can delete flag Del representative function code segment deleted, and Del is that true shows that function code section is that dead code should be deleted, and shows that function code section is not that dead code should not be deleted if Del is false;
Second step, record function essential information: scanning sequence code, from 0, distribute unique function numbering to each function code section in GPU kernel program and by the id field of function numbering write state detection table of distributing, and function name function being numbered q is recorded in id field is in the Name field of the list item of the state-detection table of q; For the GPU kernel program having N number of function, then state-detection table has N number of list item, and the id field of N number of list item is 0 ~ N-1,0≤q≤N-1;
3rd step, init state detect table: in init state detection table, the invocation flags Callee of the function of all list items is true; The static analysis mark Static of the function of all list items of initialization is true; The Dynamic Execution mark Dynamic of the function of all list items of initialization is true; The delete flag Del of the function of all list items of initialization is true;
4th step, carry out static analysis to GPU program source code, upgraded the state of state-detection table each function check literary name section by staticaanalysis results, concrete grammar is as follows:
List item sum N, the initialization q=0 of 4.1 acquisition state-detection tables;
If the function that 4.2 state-detection table list item q are corresponding can not be called, then putting field Callee corresponding to this list item is false, otherwise is set to true;
If the field Callee that 4.3 state-detection table list item q are corresponding is true, then putting field Static corresponding to this list item is true, otherwise to put field Static be false;
4.4 upgrade q=q+1, if q<N, turn 4.2; Otherwise, turn the 5th step;
5th step, operation GPU kernel program, information when record GPU kernel program runs, upgrade the state of state-detection table each function check literary name section, concrete grammar is as follows:
List item sum N, the initialization q=0 of 5.1 acquisition state-detection tables;
If the function that 5.2 state-detection table list item q are corresponding can not be called, then putting field Callee corresponding to this list item is false, otherwise is set to true;
If the field Static that 5.3 list item q are corresponding and field Dynamic is false, then putting this list item corresponding field Del is true, otherwise to put field Del be false;
5.4 upgrade q=q+1, if q<N, turn 5.2; Otherwise, turn the 6th step;
6th step, dead marking code: definition set S 0, S 1... S is m-1being empty set, is that first function that true is corresponding is incorporated to S set by list item field Del in state-detection table 0, be that second function that true is corresponding is incorporated to S set by list item field Del in state-detection table 1..., be that M the function that true is corresponding is incorporated to S set by list item field Del in state-detection table m-1, wherein, 0≤i<M≤N, M is the function number being designated dead code;
7th step, dead code finally confirm, concrete grammar is as follows:
The dead code function set of 7.1 initialization is empty defining variable j=0;
The 7.2 language notation marks using GPU kernel program corresponding are by S jcorresponding function code section annotation, makes this function code section not run;
7.3 run through by S jprogram after annotation, compares operation result, if result is correct, upgrades dead code collection D=D ∪ S j; Otherwise, turn 7.4;
7.4 delete current S jthe annotation of corresponding function code section;
7.5j=j+1;
If 7.6 j<M, turn 7.2; Otherwise, turn the 8th step;
The dead code collection D that 8th step, basis finally obtain, carry out dead code elimination, idiographic flow is as follows:
The element number P of the dead code collection D of 8.1 acquisition, 0≤P≤M≤N, defining variable k=0;
8.2 delete the function code section that in dead code collection D, a kth element is corresponding;
8.3 upgrade k=k+1, if k<P, turn 8.2; Otherwise, turn the 9th step;
9th step, optimization terminate.
2. a kind ofly as claimed in claim 1 removed the method accelerating GPU travelling speed by dead code, it is characterized in that the method for deleting the function code section that a kth element is corresponding in dead code collection D is: the function code section being true by state-detection literary name section Del carries out precompile process, namely before and after this function code section, adds precompile code pattern respectively and is annotated by this function.
CN201310020549.2A 2013-01-21 2013-01-21 A kind ofly removed the method accelerating GPU travelling speed by dead code Expired - Fee Related CN103049304B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310020549.2A CN103049304B (en) 2013-01-21 2013-01-21 A kind ofly removed the method accelerating GPU travelling speed by dead code

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310020549.2A CN103049304B (en) 2013-01-21 2013-01-21 A kind ofly removed the method accelerating GPU travelling speed by dead code

Publications (2)

Publication Number Publication Date
CN103049304A CN103049304A (en) 2013-04-17
CN103049304B true CN103049304B (en) 2015-09-16

Family

ID=48061955

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310020549.2A Expired - Fee Related CN103049304B (en) 2013-01-21 2013-01-21 A kind ofly removed the method accelerating GPU travelling speed by dead code

Country Status (1)

Country Link
CN (1) CN103049304B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104881274B (en) * 2014-02-28 2018-02-13 上海斐讯数据通信技术有限公司 The method for identifying dead code
CN106843993B (en) * 2016-12-26 2019-07-30 中国科学院计算技术研究所 A kind of method and system of resolving inversely GPU instruction
CN110109657B (en) * 2019-03-29 2023-06-20 南京佑驾科技有限公司 GPU micro instruction detection method
CN111104289B (en) * 2019-12-25 2023-03-14 创新奇智(上海)科技有限公司 System and method for checking efficiency of GPU (graphics processing Unit) cluster

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2090983A1 (en) * 2008-02-15 2009-08-19 Siemens Aktiengesellschaft Determining an architecture for executing code in a multi architecture environment
CN102646060A (en) * 2012-02-23 2012-08-22 中国人民解放军国防科学技术大学 Method for detecting nodes not meeting requirement on computational accuracy in high-performance computer system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2090983A1 (en) * 2008-02-15 2009-08-19 Siemens Aktiengesellschaft Determining an architecture for executing code in a multi architecture environment
CN102646060A (en) * 2012-02-23 2012-08-22 中国人民解放军国防科学技术大学 Method for detecting nodes not meeting requirement on computational accuracy in high-performance computer system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Transformation of Scientific Algorithms to Parallel Computing Code: Single GPU and MPI Multi GPU Backends with Subdomain Support;Meyer B等;《2011 Symposium on Application Accelerators in High-Performance Computing》;20110721;全文 *
基于安腾微处理器的程序性能优化与分析;迟利华等;《计算机工程与科学》;20110930;第33卷(第9期);全文 *

Also Published As

Publication number Publication date
CN103049304A (en) 2013-04-17

Similar Documents

Publication Publication Date Title
US8364739B2 (en) Sparse matrix-vector multiplication on graphics processor units
US8161464B2 (en) Compiling source code
CN103617150B (en) A kind of system and method for the large-scale electrical power system power flow parallel computing based on GPU
US8312227B2 (en) Method and apparatus for MPI program optimization
US9424009B2 (en) Handling pointers in program code in a system that supports multiple address spaces
CN103150265B (en) The fine-grained data distribution method of isomery storer on Embedded sheet
US20080184018A1 (en) Speculative Throughput Computing
US9513886B2 (en) Heap data management for limited local memory(LLM) multi-core processors
CN103049304B (en) A kind ofly removed the method accelerating GPU travelling speed by dead code
US20140173216A1 (en) Invalidation of Dead Transient Data in Caches
CN102880509A (en) Compute unified device architecture (CUDA) based grid digital elevation model (DEM) neighborhood analysis system and method
Wang et al. GraSU: A fast graph update library for FPGA-based dynamic graph processing
Janjusic et al. Gleipnir: A memory profiling and tracing tool
Konstantinidis et al. Accelerating the red/black SOR method using GPUs with CUDA
Holey et al. HAccRG: Hardware-accelerated data race detection in GPUs
Miki et al. PACC: a directive-based programming framework for out-of-core stencil computation on accelerators
WO2019147441A1 (en) Wide key hash table for a graphics processing unit
CN103106097A (en) Stack operation optimization method in just-in-time compiling system
CN103092618A (en) Dalvik virtual machine just-in-time compilation (JIT) acceleration method based on software Cache
US9038045B2 (en) Unified parallel C work-sharing loop construct transformation
Laurenzano et al. A static binary instrumentation threading model for fast memory trace collection
Li et al. An application-oblivious memory scheduling system for DNN accelerators
Sakai et al. Towards automating multi-dimensional data decomposition for executing a single-GPU code on a multi-GPU system
Schmalstieg et al. Augmented reality–principles and practice tutorial
CN102722446B (en) Dynamic recorder for local memory access model for stream processor

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20150916

Termination date: 20210121