CN103049304A

CN103049304A - Method for accelerating operating speed of graphics processing unit (GPU) through dead code removal

Info

Publication number: CN103049304A
Application number: CN2013100205492A
Authority: CN
Inventors: 迟利华; 刘杰; 胡庆丰; 晏益慧; 龚春叶; 甘新标; 徐涵; 蒋杰; 杨博
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2013-01-21
Filing date: 2013-01-21
Publication date: 2013-04-17
Anticipated expiration: 2033-01-21
Also published as: CN103049304B

Abstract

The invention discloses a method for accelerating operating speed of a GPU through dead code removal. By the aid of the method, the implementation and compiling efficiency of a large-scale GPU kernel program can be improved. The technical scheme includes that firstly, a state detection table is established for all functions in the large-scale GPU kernel program; basic information of functions is recorded, and the state detection table is initialized; static analysis is conducted for the GPU program; and then the GPU kernel program is operated, information during operation of the GPU kernel program is recorded, states of all function detection table fields in the state detection table are updated, dead codes are marked and finally certified, and dead codes are deleted according to a dead code set D which is obtained finally. According to the method, dead codes which are not implemented during operation are removed, so that the code size of the GPU kernel program is reduced, the assembly code size which is generated finally is also reduced, the hit rate of single instruction multiple data (SIMD) instruction scheduling in the GPU can be improved, and the operation efficiency of the large-scale GPU kernel program can be greatly improved.

Description

A kind of method that removes to accelerate the GPU travelling speed by dead code

Technical field

The present invention relates to accelerate the method for extensive GPU kernel program travelling speed, espespecially accelerate the method for GPU travelling speed by removing dead code.

Background technology

GPU(Graphics Processing Unit, Graphics Processing Unit) be generally used in the past the graph image application, now also be widely used in accelerating various general parallel algorithms and application.These algorithms and the kernel program that is applied on the GPU are usually all fairly simple, generally only have line codes up to a hundred.But has the large-scale application program of actual application value for some, such as uncertainty PARTICLE TRANSPORT FROM program MCNP(Monte Carlo N-particle, N particle Monte Carlo method), there are a large amount of dead codes simultaneously in the common row up to ten thousand of core code that its GPU realizes when specific procedure is carried out.Compare with CPU, the instruction buffer of GPU is less, thus big or small very sensitive to the assembly code that generates.And GPU generally adopts _ and the inline_ instruction is carried out inlinely to subfunction, need to carry out global optimization to whole kernel program during compiling.The existence of dead code had both increased the assembly code volume that generates, and had reduced again the effect of global optimization, had had a strong impact on the GPU travelling speed.The method of accelerating at present the GPU travelling speed mainly contains following several:

(1) read-only data is in the layout of GPU constant storage space and improves memory access speed.

(2) data layout that will frequently access shared storage on the GPU sheet improves memory access speed.

(3) improve memory access speed by engagement type access GPU global memory.

(4) adjust GPU kernel program thread block size and improve register resources utilization factor and executing efficiency.

All there is limitation in above-mentioned four class methods.GPU constant memory span is limited and can only deposit read-only data; Shared storage finite capacity and body conflict (bank conflicts) on the GPU sheet; There is the high problem of memory access latency in GPU global memory; Adjust large-scale GPU kernel program thread block size and usually can only attempt one by one determining optimum line block size, efficient is low.

Dead code is exactly may carry out in the program process but the code segment do not carried out during actual motion.The existence of dead code will increase the assembly code volume, have a strong impact on program optimization and scheduling, reduce the program operational efficiency.For some large-scale application programs with actual application value, its GPU kernel program assembly code volume is considerable, while GPU instruction buffer space is limited but, the existence of dead code will increase the pressure of GPU instruction buffer in the GPU kernel program, the instruction buffer space of waste GPU preciousness, have a strong impact on scheduling and the optimization of GPU core code, prolong the working time of GPU core code.The deletion of therefore dead code is most important for improving extensive GPU executing efficiency and faster procedure travelling speed.

BODIK has proposed the method for identification dead assignment statement in the compile optimization process; SWEENEY has proposed the detection method of OO unreachable subroutine; XI has proposed the detection method of useless function parameter; Zhang Guangmei has realized the detection of invalid program branches from the control structure of program.Above-mentioned dead code detection technology lays particular emphasis on derivation and the proof of dead code detection technology ultimate principle all to be compiled as the basis, and theoretical proof is full and accurate credible, but practical application is complicated, and poor operability is unfavorable for applying.

For the extensive GPU kernel program with actual application value, removing dead code can significantly increase program implementation efficient, therefore, if can detect and delete dead code, must accelerate the GPU travelling speed, but also not have at present open source literature research related art scheme.

Summary of the invention

The technical problem to be solved in the present invention is: have the not high problem of efficient of carrying out for large-scale GPU kernel program, under the prerequisite that guarantees program correctness, propose a kind of method that removes to accelerate the GPU travelling speed by dead code, improve execution and the compile efficiency of extensive GPU kernel program.

In order to solve the problems of the technologies described above, concrete technical scheme of the present invention is:

The first step, constructor state-detection table.For all functions in the extensive GPU kernel program make up the state-detection table, the list item number of state-detection table is the number of function in the GPU kernel program.The state-detection table comprises six fields altogether, is respectively: function numbering ID, function name Name, invocation flags Callee, static analysis mark Static, Dynamic Execution mark Dynamic and delete flag Del.Function numbering ID is the overall uniquely tagged of function, the name of function name Name representative function; Whether invocation flags Callee representative function by routine call, invocation flags Callee be the true representative function by routine call, Callee is that the false representative function is not by routine call; Static analysis mark Static represents function module to be carried out judging whether it can carry out after the static analysis, Static is that the true representative function may move, Static is that the false representative function can not move when program is carried out when program is carried out; Whether function is performed during the operation of Dynamic Execution mark Dynamic representation program, and Dynamic is that true shows that function moved when program is moved, and Dynamic is that false shows that function did not move when program is moved; Can delete flag Del representative function code segment deleted, and Del is that true shows that the function code section is that dead code should be deleted, and shows that the function code section is not that dead code should not deleted if Del is false.

Second step, record function essential information.The scanning sequence code, distribute in unique function numbering and the id field with the function numbering write state detection table that distributes since 0 to each function code section in the GPU kernel program, and the function name that function is numbered q is recorded in the Name field of list item of state-detection table that id field is q.For the GPU kernel program that N function arranged, then the state-detection table has N list item, and the id field of N list item is 0～N-1,0≤q≤N-1.

The 3rd step, init state detect table.The invocation flags Callee of the function of all list items is true in the init state detection table; The static analysis mark Static of the function of all list items of initialization is true; The Dynamic Execution mark Dynamic of the function of all list items of initialization is true; The delete flag Del of the function of all list items of initialization is false.

The 4th step, the GPU program's source code is carried out static analysis, detect the state of each function check literary name section of table by the staticaanalysis results update mode, concrete grammar is as follows:

4.1 obtain the list item sum N of state-detection table, initialization q=0;

If 4.2 function corresponding to state-detection table list item q can not be called, then putting field Callee corresponding to this list item is false, otherwise is set to true;

If 4.3 field Callee corresponding to state-detection table list item q is true, then putting field Static corresponding to this list item is true, be false otherwise put field Static;

4.4 upgrade q=q+1, if q＜N turns 4.2; Otherwise, turned for the 5th step.

The 5th step, operation GPU kernel program, the information during operation of record GPU kernel program, update mode detects the state of each function check literary name section of table, and concrete grammar is as follows:

5.1 obtain the list item sum N of state-detection table, initialization q=0;

If 5.2 function corresponding to state-detection table list item q is not performed, then putting field Dynamic corresponding to this list item is false, be true otherwise put field Dynamic

If 5.3 corresponding field Static and the Dynamic of list item q is false, then putting this list item corresponding field Del is true, is false otherwise put field Del;

5.4 upgrade q=q+1, if q＜N turns 5.2; Otherwise, turned for the 6th step.

The 6th step, dead marking code.Definition set S ₀, S ₁... S _iS _M-1Be empty set, according to the definition of each field in the state-detection table, list item field Del is that function corresponding to true is dead code function.Be that first function corresponding to true incorporated S set into list item field Del in the state-detection table ₀, be that second function corresponding to true incorporated S set into list item field Del in the state-detection table ₁..., be that M function corresponding to true incorporated S set into list item field Del in the state-detection table _M-1, wherein, 0≤i＜M≤N, M are the function number that is designated dead code.

The 7th step, dead code are finally confirmed.Concrete grammar is as follows:

7.1 the dead code function set of initialization is for empty

Defining variable j=0;

7.2 use language notation sign corresponding to GPU kernel program with S _jCorresponding function code section note is not so that this function code section is moved;

7.3 operation is with S _jProgram behind the note, relatively operation result if the result is correct, upgrades dead code collection D=D ∪ S _jOtherwise, turn 7.4;

7.4 delete current S _jThe note of corresponding function code section;

7.5j=j+1；

7.6 if j＜M turns 7.2; Otherwise, turned for the 8th step.

The 8th step, dead code elimination.Dead code collection D according to finally obtaining carries out dead code elimination, and idiographic flow is as follows:

8.1 obtain the element number P of dead code collection D, 0≤P≤M≤N, defining variable k=0;

8.2 k function code section that element is corresponding among the deletion D, concrete delet method is:

For readability and the raising program ease for maintenance that keeps program, do not take the method for physics deletion for dead code in the program, but be that the function code section of true is carried out precompile and processed with state-detection list item field Del, namely before and after this function code section, add respectively the precompile code pattern with this function note.Thereby needn't directly delete source code.

8.3 upgrade k=k+1, if k＜P turns 8.2; Otherwise, turned for the 9th step.

The 9th step, end.

Compared with prior art, adopt the present invention can reach following technique effect:

1. the present invention can be applicable to various large-scale GPU kernel programs, the dead code of not carrying out when moving by removing, reduced the code volume of GPU kernel program, so that compiling and optimization time shorten, the final assembly code volume that generates reduces, thereby improve SIMD(Single Instruction Multiple Data among the GPU, single instruction multiple data) hit rate of instruction scheduling, significantly improve the operational efficiency of extensive GPU kernel program.

Description of drawings

Fig. 1 is that dead code status detects list structure.

Fig. 2 is overview flow chart of the present invention.

Embodiment

Fig. 1 is that dead code status detects list structure.It is as follows that concrete list structure is set up mode:

The list item number of state-detection table is the number of function in the GPU kernel program.The state-detection table comprises six fields altogether, is respectively: function numbering ID, function name Name, invocation flags Callee, static analysis mark Static, Dynamic Execution mark Dynamic and delete flag Del.Function numbering ID is the overall uniquely tagged of function, the name of function name Name representative function; Whether invocation flags Callee representative function by routine call, invocation flags Callee be the true representative function by routine call, Callee is that the false representative function is not by routine call; Static analysis mark Static represents function module to be carried out judging whether it can carry out after the static analysis, Static is that the true representative function may move, Static is that the false representative function can not move when program is carried out when program is carried out; Whether function is performed during the operation of Dynamic Execution mark Dynamic representation program, and Dynamic is that true shows that function moved when program is moved, and Dynamic is that false shows that function did not move when program is moved; Can delete flag Del representative function code segment deleted, and Del is that true shows that the function code section is that dead code should be deleted, and shows that the function code section is not that dead code should not deleted if Del is false.

Fig. 2 is overview flow chart of the present invention, and its implementation step is as follows:

The first step, constructor state-detection table.

Second step, record function essential information.

The 3rd step, init state detect table.

The 4th the step, the GPU program's source code is carried out static analysis.

The 5th step, update mode detect table status.

The 6th step, dead marking code.

The 7th step, dead code are finally confirmed.

The 8th step, dead code elimination.

The 9th step, end.

Claims

1. method that removes to accelerate the GPU travelling speed by dead code is characterized in that may further comprise the steps:

The first step, be that all functions in the extensive GPU kernel program make up state-detection tables, the list item number of state-detection table is the number of function in the GPU kernel program; The state-detection table comprises six fields altogether, is respectively: function numbering ID, function name Name, invocation flags Callee, static analysis mark Static, Dynamic Execution mark Dynamic and delete flag Del; Function numbering ID is the overall uniquely tagged of function, the name of function name Name representative function; Whether invocation flags Callee representative function by routine call, invocation flags Callee be the true representative function by routine call, Callee is that the false representative function is not by routine call; Static analysis mark Static represents function module to be carried out judging whether it can carry out after the static analysis, Static is that the true representative function may move, Static is that the false representative function can not move when program is carried out when program is carried out; Whether function is performed during the operation of Dynamic Execution mark Dynamic representation program, and Dynamic is that true shows function through moving when program is moved, and Dynamic is that false shows that function did not move when program is moved; Can delete flag Del representative function code segment deleted, and Del is that true shows that the function code section is that dead code should be deleted, and shows that the function code section is not that dead code should not deleted if Del is false;

Second step, record function essential information: scanning sequence code, distribute in unique function numbering and the id field with the function numbering write state detection table that distributes since 0 to each function code section in the GPU kernel program, and the function name that function is numbered q is recorded in the Name field of list item of state-detection table that id field is q; For the GPU kernel program that N function arranged, then the state-detection table has N list item, and the id field of N list item is 0～N-1,0≤q≤N-1;

The 3rd step, init state detect table: the invocation flags Callee of the function of all list items is true in the init state detection table; The static analysis mark Static of the function of all list items of initialization is true; The Dynamic Execution mark Dynamic of the function of all list items of initialization is true; The delete flag Del of the function of all list items of initialization is true;

4.1 obtain the list item sum N of state-detection table, initialization q=0;

4.4 upgrade q=q+1, if q＜N turns 4.2; Otherwise, turned for the 5th step.

5.1 obtain the list item sum N of state-detection table, initialization q=0;

If 5.2 function corresponding to state-detection table list item q can not be called, then putting field Callee corresponding to this list item is false, otherwise is set to true;

If 5.3 corresponding field Static and the field Dynamic of list item q is false, then putting this list item corresponding field Del is true, is false otherwise put field Del;

5.4 upgrade q=q+1, if q＜N turns 5.2; Otherwise, turned for the 6th step;

The 6th step, dead marking code: definition set S ₀, S ₁... S _iS _M-1Being empty set, is that first function corresponding to true incorporated S set into list item field Del in the state-detection table ₀, be that second function corresponding to true incorporated S set into list item field Del in the state-detection table ₁..., be that M function corresponding to true incorporated S set into list item field Del in the state-detection table _M-1, wherein, 0≤i＜M≤N, M are the function number that is designated dead code;

The 7th step, dead code finally confirm, concrete grammar is as follows:

7.1 the dead code function set of initialization is for empty

Defining variable j=0;

7.3 operation is through with S _jProgram behind the note, relatively operation result if the result is correct, upgrades dead code collection D=D ∪ S _jOtherwise, turn 7.4;

7.4 delete current S _jThe note of corresponding function code section;

7.5j=j+1；

7.6 if j＜M turns 7.2; Otherwise, turned for the 8th step;

The dead code collection D that the 8th step, basis finally obtain carries out dead code elimination, and idiographic flow is as follows:

8.2 delete k function code section that element is corresponding among the dead code collection D;

8.3 upgrade k=k+1, if k＜P turns 8.2; Otherwise, turned for the 9th step;

The 9th step, optimization finish.

2. a kind of method that removes to accelerate the GPU travelling speed by dead code as claimed in claim 1, the method that it is characterized in that deleting among the dead code collection D function code section corresponding to k element is: be that the function code section of true is carried out precompile and processed with state-detection literary name section Del, namely add respectively the precompile code pattern with this function note before and after this function code section.