CN103049304A - Method for accelerating operating speed of graphics processing unit (GPU) through dead code removal - Google Patents

Method for accelerating operating speed of graphics processing unit (GPU) through dead code removal Download PDF

Info

Publication number
CN103049304A
CN103049304A CN2013100205492A CN201310020549A CN103049304A CN 103049304 A CN103049304 A CN 103049304A CN 2013100205492 A CN2013100205492 A CN 2013100205492A CN 201310020549 A CN201310020549 A CN 201310020549A CN 103049304 A CN103049304 A CN 103049304A
Authority
CN
China
Prior art keywords
function
code
state
true
gpu
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013100205492A
Other languages
Chinese (zh)
Other versions
CN103049304B (en
Inventor
迟利华
刘杰
胡庆丰
晏益慧
龚春叶
甘新标
徐涵
蒋杰
杨博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201310020549.2A priority Critical patent/CN103049304B/en
Publication of CN103049304A publication Critical patent/CN103049304A/en
Application granted granted Critical
Publication of CN103049304B publication Critical patent/CN103049304B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Stored Programmes (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

The invention discloses a method for accelerating operating speed of a GPU through dead code removal. By the aid of the method, the implementation and compiling efficiency of a large-scale GPU kernel program can be improved. The technical scheme includes that firstly, a state detection table is established for all functions in the large-scale GPU kernel program; basic information of functions is recorded, and the state detection table is initialized; static analysis is conducted for the GPU program; and then the GPU kernel program is operated, information during operation of the GPU kernel program is recorded, states of all function detection table fields in the state detection table are updated, dead codes are marked and finally certified, and dead codes are deleted according to a dead code set D which is obtained finally. According to the method, dead codes which are not implemented during operation are removed, so that the code size of the GPU kernel program is reduced, the assembly code size which is generated finally is also reduced, the hit rate of single instruction multiple data (SIMD) instruction scheduling in the GPU can be improved, and the operation efficiency of the large-scale GPU kernel program can be greatly improved.

Description

A kind of method that removes to accelerate the GPU travelling speed by dead code
Technical field
The present invention relates to accelerate the method for extensive GPU kernel program travelling speed, espespecially accelerate the method for GPU travelling speed by removing dead code.
Background technology
GPU(Graphics Processing Unit, Graphics Processing Unit) be generally used in the past the graph image application, now also be widely used in accelerating various general parallel algorithms and application.These algorithms and the kernel program that is applied on the GPU are usually all fairly simple, generally only have line codes up to a hundred.But has the large-scale application program of actual application value for some, such as uncertainty PARTICLE TRANSPORT FROM program MCNP(Monte Carlo N-particle, N particle Monte Carlo method), there are a large amount of dead codes simultaneously in the common row up to ten thousand of core code that its GPU realizes when specific procedure is carried out.Compare with CPU, the instruction buffer of GPU is less, thus big or small very sensitive to the assembly code that generates.And GPU generally adopts _ and the inline_ instruction is carried out inlinely to subfunction, need to carry out global optimization to whole kernel program during compiling.The existence of dead code had both increased the assembly code volume that generates, and had reduced again the effect of global optimization, had had a strong impact on the GPU travelling speed.The method of accelerating at present the GPU travelling speed mainly contains following several:
(1) read-only data is in the layout of GPU constant storage space and improves memory access speed.
(2) data layout that will frequently access shared storage on the GPU sheet improves memory access speed.
(3) improve memory access speed by engagement type access GPU global memory.
(4) adjust GPU kernel program thread block size and improve register resources utilization factor and executing efficiency.
All there is limitation in above-mentioned four class methods.GPU constant memory span is limited and can only deposit read-only data; Shared storage finite capacity and body conflict (bank conflicts) on the GPU sheet; There is the high problem of memory access latency in GPU global memory; Adjust large-scale GPU kernel program thread block size and usually can only attempt one by one determining optimum line block size, efficient is low.
Dead code is exactly may carry out in the program process but the code segment do not carried out during actual motion.The existence of dead code will increase the assembly code volume, have a strong impact on program optimization and scheduling, reduce the program operational efficiency.For some large-scale application programs with actual application value, its GPU kernel program assembly code volume is considerable, while GPU instruction buffer space is limited but, the existence of dead code will increase the pressure of GPU instruction buffer in the GPU kernel program, the instruction buffer space of waste GPU preciousness, have a strong impact on scheduling and the optimization of GPU core code, prolong the working time of GPU core code.The deletion of therefore dead code is most important for improving extensive GPU executing efficiency and faster procedure travelling speed.
BODIK has proposed the method for identification dead assignment statement in the compile optimization process; SWEENEY has proposed the detection method of OO unreachable subroutine; XI has proposed the detection method of useless function parameter; Zhang Guangmei has realized the detection of invalid program branches from the control structure of program.Above-mentioned dead code detection technology lays particular emphasis on derivation and the proof of dead code detection technology ultimate principle all to be compiled as the basis, and theoretical proof is full and accurate credible, but practical application is complicated, and poor operability is unfavorable for applying.
For the extensive GPU kernel program with actual application value, removing dead code can significantly increase program implementation efficient, therefore, if can detect and delete dead code, must accelerate the GPU travelling speed, but also not have at present open source literature research related art scheme.
Summary of the invention
The technical problem to be solved in the present invention is: have the not high problem of efficient of carrying out for large-scale GPU kernel program, under the prerequisite that guarantees program correctness, propose a kind of method that removes to accelerate the GPU travelling speed by dead code, improve execution and the compile efficiency of extensive GPU kernel program.
In order to solve the problems of the technologies described above, concrete technical scheme of the present invention is:
The first step, constructor state-detection table.For all functions in the extensive GPU kernel program make up the state-detection table, the list item number of state-detection table is the number of function in the GPU kernel program.The state-detection table comprises six fields altogether, is respectively: function numbering ID, function name Name, invocation flags Callee, static analysis mark Static, Dynamic Execution mark Dynamic and delete flag Del.Function numbering ID is the overall uniquely tagged of function, the name of function name Name representative function; Whether invocation flags Callee representative function by routine call, invocation flags Callee be the true representative function by routine call, Callee is that the false representative function is not by routine call; Static analysis mark Static represents function module to be carried out judging whether it can carry out after the static analysis, Static is that the true representative function may move, Static is that the false representative function can not move when program is carried out when program is carried out; Whether function is performed during the operation of Dynamic Execution mark Dynamic representation program, and Dynamic is that true shows that function moved when program is moved, and Dynamic is that false shows that function did not move when program is moved; Can delete flag Del representative function code segment deleted, and Del is that true shows that the function code section is that dead code should be deleted, and shows that the function code section is not that dead code should not deleted if Del is false.
Second step, record function essential information.The scanning sequence code, distribute in unique function numbering and the id field with the function numbering write state detection table that distributes since 0 to each function code section in the GPU kernel program, and the function name that function is numbered q is recorded in the Name field of list item of state-detection table that id field is q.For the GPU kernel program that N function arranged, then the state-detection table has N list item, and the id field of N list item is 0~N-1,0≤q≤N-1.
The 3rd step, init state detect table.The invocation flags Callee of the function of all list items is true in the init state detection table; The static analysis mark Static of the function of all list items of initialization is true; The Dynamic Execution mark Dynamic of the function of all list items of initialization is true; The delete flag Del of the function of all list items of initialization is false.
The 4th step, the GPU program's source code is carried out static analysis, detect the state of each function check literary name section of table by the staticaanalysis results update mode, concrete grammar is as follows:
4.1 obtain the list item sum N of state-detection table, initialization q=0;
If 4.2 function corresponding to state-detection table list item q can not be called, then putting field Callee corresponding to this list item is false, otherwise is set to true;
If 4.3 field Callee corresponding to state-detection table list item q is true, then putting field Static corresponding to this list item is true, be false otherwise put field Static;
4.4 upgrade q=q+1, if q<N turns 4.2; Otherwise, turned for the 5th step.
The 5th step, operation GPU kernel program, the information during operation of record GPU kernel program, update mode detects the state of each function check literary name section of table, and concrete grammar is as follows:
5.1 obtain the list item sum N of state-detection table, initialization q=0;
If 5.2 function corresponding to state-detection table list item q is not performed, then putting field Dynamic corresponding to this list item is false, be true otherwise put field Dynamic
If 5.3 corresponding field Static and the Dynamic of list item q is false, then putting this list item corresponding field Del is true, is false otherwise put field Del;
5.4 upgrade q=q+1, if q<N turns 5.2; Otherwise, turned for the 6th step.
The 6th step, dead marking code.Definition set S 0, S 1... S iS M-1Be empty set, according to the definition of each field in the state-detection table, list item field Del is that function corresponding to true is dead code function.Be that first function corresponding to true incorporated S set into list item field Del in the state-detection table 0, be that second function corresponding to true incorporated S set into list item field Del in the state-detection table 1..., be that M function corresponding to true incorporated S set into list item field Del in the state-detection table M-1, wherein, 0≤i<M≤N, M are the function number that is designated dead code.
The 7th step, dead code are finally confirmed.Concrete grammar is as follows:
7.1 the dead code function set of initialization is for empty
Figure BDA00002755704800041
Defining variable j=0;
7.2 use language notation sign corresponding to GPU kernel program with S jCorresponding function code section note is not so that this function code section is moved;
7.3 operation is with S jProgram behind the note, relatively operation result if the result is correct, upgrades dead code collection D=D ∪ S jOtherwise, turn 7.4;
7.4 delete current S jThe note of corresponding function code section;
7.5j=j+1;
7.6 if j<M turns 7.2; Otherwise, turned for the 8th step.
The 8th step, dead code elimination.Dead code collection D according to finally obtaining carries out dead code elimination, and idiographic flow is as follows:
8.1 obtain the element number P of dead code collection D, 0≤P≤M≤N, defining variable k=0;
8.2 k function code section that element is corresponding among the deletion D, concrete delet method is:
For readability and the raising program ease for maintenance that keeps program, do not take the method for physics deletion for dead code in the program, but be that the function code section of true is carried out precompile and processed with state-detection list item field Del, namely before and after this function code section, add respectively the precompile code pattern with this function note.Thereby needn't directly delete source code.
8.3 upgrade k=k+1, if k<P turns 8.2; Otherwise, turned for the 9th step.
The 9th step, end.
Compared with prior art, adopt the present invention can reach following technique effect:
1. the present invention can be applicable to various large-scale GPU kernel programs, the dead code of not carrying out when moving by removing, reduced the code volume of GPU kernel program, so that compiling and optimization time shorten, the final assembly code volume that generates reduces, thereby improve SIMD(Single Instruction Multiple Data among the GPU, single instruction multiple data) hit rate of instruction scheduling, significantly improve the operational efficiency of extensive GPU kernel program.
Description of drawings
Fig. 1 is that dead code status detects list structure.
Fig. 2 is overview flow chart of the present invention.
Embodiment
Fig. 1 is that dead code status detects list structure.It is as follows that concrete list structure is set up mode:
The list item number of state-detection table is the number of function in the GPU kernel program.The state-detection table comprises six fields altogether, is respectively: function numbering ID, function name Name, invocation flags Callee, static analysis mark Static, Dynamic Execution mark Dynamic and delete flag Del.Function numbering ID is the overall uniquely tagged of function, the name of function name Name representative function; Whether invocation flags Callee representative function by routine call, invocation flags Callee be the true representative function by routine call, Callee is that the false representative function is not by routine call; Static analysis mark Static represents function module to be carried out judging whether it can carry out after the static analysis, Static is that the true representative function may move, Static is that the false representative function can not move when program is carried out when program is carried out; Whether function is performed during the operation of Dynamic Execution mark Dynamic representation program, and Dynamic is that true shows that function moved when program is moved, and Dynamic is that false shows that function did not move when program is moved; Can delete flag Del representative function code segment deleted, and Del is that true shows that the function code section is that dead code should be deleted, and shows that the function code section is not that dead code should not deleted if Del is false.
Fig. 2 is overview flow chart of the present invention, and its implementation step is as follows:
The first step, constructor state-detection table.
Second step, record function essential information.
The 3rd step, init state detect table.
The 4th the step, the GPU program's source code is carried out static analysis.
The 5th step, update mode detect table status.
The 6th step, dead marking code.
The 7th step, dead code are finally confirmed.
The 8th step, dead code elimination.
The 9th step, end.

Claims (2)

1. method that removes to accelerate the GPU travelling speed by dead code is characterized in that may further comprise the steps:
The first step, be that all functions in the extensive GPU kernel program make up state-detection tables, the list item number of state-detection table is the number of function in the GPU kernel program; The state-detection table comprises six fields altogether, is respectively: function numbering ID, function name Name, invocation flags Callee, static analysis mark Static, Dynamic Execution mark Dynamic and delete flag Del; Function numbering ID is the overall uniquely tagged of function, the name of function name Name representative function; Whether invocation flags Callee representative function by routine call, invocation flags Callee be the true representative function by routine call, Callee is that the false representative function is not by routine call; Static analysis mark Static represents function module to be carried out judging whether it can carry out after the static analysis, Static is that the true representative function may move, Static is that the false representative function can not move when program is carried out when program is carried out; Whether function is performed during the operation of Dynamic Execution mark Dynamic representation program, and Dynamic is that true shows function through moving when program is moved, and Dynamic is that false shows that function did not move when program is moved; Can delete flag Del representative function code segment deleted, and Del is that true shows that the function code section is that dead code should be deleted, and shows that the function code section is not that dead code should not deleted if Del is false;
Second step, record function essential information: scanning sequence code, distribute in unique function numbering and the id field with the function numbering write state detection table that distributes since 0 to each function code section in the GPU kernel program, and the function name that function is numbered q is recorded in the Name field of list item of state-detection table that id field is q; For the GPU kernel program that N function arranged, then the state-detection table has N list item, and the id field of N list item is 0~N-1,0≤q≤N-1;
The 3rd step, init state detect table: the invocation flags Callee of the function of all list items is true in the init state detection table; The static analysis mark Static of the function of all list items of initialization is true; The Dynamic Execution mark Dynamic of the function of all list items of initialization is true; The delete flag Del of the function of all list items of initialization is true;
The 4th step, the GPU program's source code is carried out static analysis, detect the state of each function check literary name section of table by the staticaanalysis results update mode, concrete grammar is as follows:
4.1 obtain the list item sum N of state-detection table, initialization q=0;
If 4.2 function corresponding to state-detection table list item q can not be called, then putting field Callee corresponding to this list item is false, otherwise is set to true;
If 4.3 field Callee corresponding to state-detection table list item q is true, then putting field Static corresponding to this list item is true, be false otherwise put field Static;
4.4 upgrade q=q+1, if q<N turns 4.2; Otherwise, turned for the 5th step.
The 5th step, operation GPU kernel program, the information during operation of record GPU kernel program, update mode detects the state of each function check literary name section of table, and concrete grammar is as follows:
5.1 obtain the list item sum N of state-detection table, initialization q=0;
If 5.2 function corresponding to state-detection table list item q can not be called, then putting field Callee corresponding to this list item is false, otherwise is set to true;
If 5.3 corresponding field Static and the field Dynamic of list item q is false, then putting this list item corresponding field Del is true, is false otherwise put field Del;
5.4 upgrade q=q+1, if q<N turns 5.2; Otherwise, turned for the 6th step;
The 6th step, dead marking code: definition set S 0, S 1... S iS M-1Being empty set, is that first function corresponding to true incorporated S set into list item field Del in the state-detection table 0, be that second function corresponding to true incorporated S set into list item field Del in the state-detection table 1..., be that M function corresponding to true incorporated S set into list item field Del in the state-detection table M-1, wherein, 0≤i<M≤N, M are the function number that is designated dead code;
The 7th step, dead code finally confirm, concrete grammar is as follows:
7.1 the dead code function set of initialization is for empty
Figure FDA00002755704700021
Defining variable j=0;
7.2 use language notation sign corresponding to GPU kernel program with S jCorresponding function code section note is not so that this function code section is moved;
7.3 operation is through with S jProgram behind the note, relatively operation result if the result is correct, upgrades dead code collection D=D ∪ S jOtherwise, turn 7.4;
7.4 delete current S jThe note of corresponding function code section;
7.5j=j+1;
7.6 if j<M turns 7.2; Otherwise, turned for the 8th step;
The dead code collection D that the 8th step, basis finally obtain carries out dead code elimination, and idiographic flow is as follows:
8.1 obtain the element number P of dead code collection D, 0≤P≤M≤N, defining variable k=0;
8.2 delete k function code section that element is corresponding among the dead code collection D;
8.3 upgrade k=k+1, if k<P turns 8.2; Otherwise, turned for the 9th step;
The 9th step, optimization finish.
2. a kind of method that removes to accelerate the GPU travelling speed by dead code as claimed in claim 1, the method that it is characterized in that deleting among the dead code collection D function code section corresponding to k element is: be that the function code section of true is carried out precompile and processed with state-detection literary name section Del, namely add respectively the precompile code pattern with this function note before and after this function code section.
CN201310020549.2A 2013-01-21 2013-01-21 A kind ofly removed the method accelerating GPU travelling speed by dead code Expired - Fee Related CN103049304B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310020549.2A CN103049304B (en) 2013-01-21 2013-01-21 A kind ofly removed the method accelerating GPU travelling speed by dead code

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310020549.2A CN103049304B (en) 2013-01-21 2013-01-21 A kind ofly removed the method accelerating GPU travelling speed by dead code

Publications (2)

Publication Number Publication Date
CN103049304A true CN103049304A (en) 2013-04-17
CN103049304B CN103049304B (en) 2015-09-16

Family

ID=48061955

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310020549.2A Expired - Fee Related CN103049304B (en) 2013-01-21 2013-01-21 A kind ofly removed the method accelerating GPU travelling speed by dead code

Country Status (1)

Country Link
CN (1) CN103049304B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104881274A (en) * 2014-02-28 2015-09-02 上海斐讯数据通信技术有限公司 Method for identifying useless codes
CN106843993A (en) * 2016-12-26 2017-06-13 中国科学院计算技术研究所 A kind of method and system of resolving inversely GPU instructions
CN110109657A (en) * 2019-03-29 2019-08-09 南京佑驾科技有限公司 A kind of GPU microcommand detection method
CN111104289A (en) * 2019-12-25 2020-05-05 创新奇智(上海)科技有限公司 System and method for checking efficiency of GPU (graphics processing Unit) cluster

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2090983A1 (en) * 2008-02-15 2009-08-19 Siemens Aktiengesellschaft Determining an architecture for executing code in a multi architecture environment
CN102646060A (en) * 2012-02-23 2012-08-22 中国人民解放军国防科学技术大学 Method for detecting nodes not meeting requirement on computational accuracy in high-performance computer system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2090983A1 (en) * 2008-02-15 2009-08-19 Siemens Aktiengesellschaft Determining an architecture for executing code in a multi architecture environment
CN102646060A (en) * 2012-02-23 2012-08-22 中国人民解放军国防科学技术大学 Method for detecting nodes not meeting requirement on computational accuracy in high-performance computer system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MEYER B等: "Transformation of Scientific Algorithms to Parallel Computing Code: Single GPU and MPI Multi GPU Backends with Subdomain Support", 《2011 SYMPOSIUM ON APPLICATION ACCELERATORS IN HIGH-PERFORMANCE COMPUTING》 *
迟利华等: "基于安腾微处理器的程序性能优化与分析", 《计算机工程与科学》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104881274A (en) * 2014-02-28 2015-09-02 上海斐讯数据通信技术有限公司 Method for identifying useless codes
CN104881274B (en) * 2014-02-28 2018-02-13 上海斐讯数据通信技术有限公司 The method for identifying dead code
CN106843993A (en) * 2016-12-26 2017-06-13 中国科学院计算技术研究所 A kind of method and system of resolving inversely GPU instructions
CN106843993B (en) * 2016-12-26 2019-07-30 中国科学院计算技术研究所 A kind of method and system of resolving inversely GPU instruction
CN110109657A (en) * 2019-03-29 2019-08-09 南京佑驾科技有限公司 A kind of GPU microcommand detection method
CN110109657B (en) * 2019-03-29 2023-06-20 南京佑驾科技有限公司 GPU micro instruction detection method
CN111104289A (en) * 2019-12-25 2020-05-05 创新奇智(上海)科技有限公司 System and method for checking efficiency of GPU (graphics processing Unit) cluster

Also Published As

Publication number Publication date
CN103049304B (en) 2015-09-16

Similar Documents

Publication Publication Date Title
US8966460B2 (en) Transmission point pattern extraction from executable code in message passing environments
US10740152B2 (en) Technologies for dynamic acceleration of general-purpose code using binary translation targeted to hardware accelerators with runtime execution offload
CN105359129B (en) For providing the method, apparatus, instruction and the logic that are used for group's tally function of gene order-checking and comparison
US8312227B2 (en) Method and apparatus for MPI program optimization
CN108431831B (en) Cyclic code processor optimization
US20120317577A1 (en) Pattern Matching Process Scheduler with Upstream Optimization
US20140347371A1 (en) Graphics processing using dynamic resources
WO2014014486A1 (en) Pattern matching process scheduler in message passing environment
WO2009120981A2 (en) Vector instructions to enable efficient synchronization and parallel reduction operations
US20140173216A1 (en) Invalidation of Dead Transient Data in Caches
You et al. Scaling support vector machines on modern HPC platforms
CN103049304A (en) Method for accelerating operating speed of graphics processing unit (GPU) through dead code removal
Man et al. The approximate string matching on the hierarchical memory machine, with performance evaluation
Liu et al. Parallel pairwise correlation computation on intel xeon phi clusters
Dhraief et al. Parallel computing the longest common subsequence (LCS) on GPUs: efficiency and language suitability
CN101216755B (en) RISC method and its floating-point register non-alignment access method
Su et al. Parallel-META: A high-performance computational pipeline for metagenomic data analysis
Li et al. An application-oblivious memory scheduling system for DNN accelerators
Schmalstieg et al. Augmented reality–principles and practice tutorial
D'azevedo et al. On the effective implementation of a boundary element code on graphics processing units using an out-of-core LU algorithm
Sharma et al. Parallel implementation of DNA sequences matching algorithms using PWM on GPU architecture
Feng et al. Implementing smith-waterman algorithm with two-dimensional cache on GPUs
J. Marzulo et al. DTM@ GPU: Characterizing and evaluating trace redundancy in GPU
US20230315536A1 (en) Dynamic register renaming in hardware to reduce bank conflicts in parallel processor architectures
Clemente et al. PROJECTION algorithm for motif finding on gPUs

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20150916

Termination date: 20210121