CN106843993A

CN106843993A - A kind of method and system of resolving inversely GPU instructions

Info

Publication number: CN106843993A
Application number: CN201611215249.XA
Authority: CN
Inventors: 谭光明; 张秀霞
Original assignee: Chinese Academy Of Sciences State Owned Assets Management Co ltd; Institute of Computing Technology of CAS
Current assignee: Chinese Academy Of Sciences State Owned Assets Management Co ltd; Institute of Computing Technology of CAS
Priority date: 2016-12-26
Filing date: 2016-12-26
Publication date: 2017-06-13
Anticipated expiration: 2036-12-26
Also published as: CN106843993B

Abstract

The present invention proposes a kind of method and system of resolving inversely GPU instructions, it is related to GPU microarchitectures, compiler code generation technique and program optimization technical field, the method includes being compiled GPU instructions, generation compiling file, the compiling file is carried out into dis-assembling, generation dis-assembling file, by the resolver that collects by the dis-assembling representation of file into instMap variables, wherein the types of variables of the instMap variables includes command code, modification code, instruction, operand and corresponding operand type；The instMap variables are input to decoding solver, the decoding solver judges the types of variables of the instMap variables, and the command code or the corresponding coding of modification code lookup by having determined.The present invention, with reference to PTX documents, can construct GPU assemblers on the basis of instruction encoding is cracked；For GPU compiler provides some compiling miscellaneous functions, the efficiency of GPU program is improved；A series of micro benchmark test program can be designed and standardize to detect GPU micro-architectures characteristic and parameter.

Description

A kind of method and system of resolving inversely GPU instructions

Technical field

The present invention relates to GPU microarchitectures, compiler code generation technique and program optimization technical field, more particularly to A kind of method and system of resolving inversely GPU instructions.

Background technology

For many years, GPU manufacturers are only provided a user with by driving the upper strata API of encapsulation, and expose former inside it as little as possible Reason and details, software architecture, the micro-architecture of GPU, the instruction set for such as driving.This causes academia in GPU architecture research field, Industrial quarters is significantly lagged behind, is stagnated for a long time, the age of figure acceleration is served only in GPU, this conservative strategy is in reality Do not turn into a distinct issues in the application of border, or even with certain reasonability：Initial drawing API realizes strong with hardware Correlation, API is the simplified package of hardware capability, and API has exposed enough hardware details in itself, earlier version OpenGL interfaces are even more the assembler language referred to as three-dimensional drawing；In addition, the person of directly invoking of drawing API is considerably less, game is all It is based on rendering engine exploitation, as long as video card manufacturer assists to have optimized the rendering engine of main flow, it is possible to ensure most trips The smooth operation of play；So as to, video card manufacturer can more freely improve and innovate micro-architecture, without after bottom is carried out to simultaneous Hold, only need to be in backward compatibility on the drawing api layer of software encapsulation time.

GPU manufacturers Nvidia in monopoly position maintains the inertia of technology closing, does not provide assembler, does not support most The assemble programming of bottom, also underground those the only hardware structure characteristics that could control on compilation level, used as supplement, it is carried Supplied physical layer interface PTX, although be with collect very close to intermediate representation, but PTX to the control ability of hardware will less than compilation, Such as, PTX be unable to control register distribution, can not precise control instruction scheduling behavior, the class C interface on upper strata is to hardware Control ability is weaker, and developer can only place hope on Compiler Optimization for improving performance, however, " Daniel J Bernstein,Hsieh-Chung Chen,Chen-Mou Cheng,Tanja Lange,Ruben Niederhagen,Peter Schwabe,and Bo-Yin Yang.Usable assembly language for gpus:a success story.IACR Cryptology ePrint Archive,2012:137,2012. " compiler that Nvidia is provided is pointed out The code efficiency of NVCC generations is not high, such as register distribution has a large amount of bank conflict, in fact, Nvidia issues are permitted Multiple parallel algorithms library, is all based on the assembler of inside, then carries out hand assemble optimization, just reaches ideal efficiency, asks Topic is that there was only a small amount of rendering engine developer different from three-dimensional drawing field, and the customer group of GPGPU is extensive and various, and Nvidia has only carried out hand assemble optimization to a small amount of algorithms library, only supports that remaining is a large amount of there is provided official to a small amount of big customer User cannot but squeeze out the performance of maximum from expensive GPGPU hardware, and this is the huge waste to computing resource, further worsened , many to apply extremely wide core algorithm, Nvidia also not to optimize in place, such as single-precision floating point Matrix Multiplication (singe-precision matrix multiplication), Nvidia is excellent for the hand assemble of main flow Kepler frameworks Change version, efficiency only reaches the 74% of theoretical peak, be that the single-precision floating point of third party's optimization multiplies cuBLAS with NVIDIA manufacturers In SGEMM performance comparisons, third party's assembly code optimizing it is higher than cuBLAS performance, these research show assembly code optimizing for dig The performance for digging GPU is very valuable.

Some researchers have some scattered progress, such as micro benchmark test program on GPU Performance tunings and instrument “Zhang,Yao,and John D.Owens."A quantitative performance analysis model for GPU architectures."In 2011 IEEE 17th International Symposium on High Performance Computer Architecture,pp.382-393.IEEE,2011.”“Xinxin Mei,Kaiyong Zhao,Chengjian Liu,and Xiaowen Chu.Benchmarking the memory hierarchy of modern gpus.In Network and Parallel Computing,pages 144–156.Springer,2014.” “Henry Wong,Misel-Myrto Papadopoulou,Maryam Sadooghi-Alvandi,and Andreas Moshovos.Demystifying gpu microarchitecture through microbenchmarking.In Performance Analysis of Systems&Software(ISPASS),2010 IEEE International Symposium on, pages 235-246.IEEE, 2010. ", assembler and the other optimization of GPU assembly levels, however, they Work is all only concentrated in certain single aspect, without a current techique that can continue on framework of new generation is proposed, such as Instruction crack method and corresponding automation tools, GPU is most of also without the other open benchmark of assembly level in addition The benchmark being currently in use is all based on CUDA, so as to cause the result and unreliable of test.

The content of the invention

In view of the shortcomings of the prior art, the present invention proposes a kind of method and system of resolving inversely GPU instructions.

The present invention proposes a kind of method of resolving inversely GPU instructions, including：

Step 1, GPU instructions are compiled, and generate compiling file, and the compiling file is carried out into dis-assembling, raw Into dis-assembling file, by the resolver that collects by the dis-assembling representation of file into instMap variables, wherein the instMap The types of variables of variable includes command code, modification code, instruction, operand and corresponding operand type；

Step 2, decoding solver is input to by the instMap variables, and the decoding solver judges the instMap The types of variables of variable, and the command code or the corresponding coding of modification code lookup by having determined.

If the decoding solver is detected to 64 each for encoding of the instMap variables respectively, then is led to Cross dis-assembling carries out dis-assembling by described 64 codings, if the instruction of the new dis-assembling of generation and the described 64 original fingers of coding The instruction name of order is different, then illustrate that 64 codings present bit represents command code, according to the present bit, to command code Enumerated in space encoder.

The title and operand type that will be instructed in the instMap variables are right as keyword query visited dictionaries In each instruction in the instMap variables, other positions in addition to operand, a certain position that return has been changed are detected Operand, will<Instruction, operand type>1 is labeled as in visited dictionaries, expression was accessed.

Carry out XOR by turn by will be instructed in the instMap variables, by modify code whether change completion detection repair Decorations code, after finding every space encoder of the modification code of instruction, is enumerated in the space encoder of modification code, finds out all of repairing The title of code is adornd, then according to the title of a certain modification code, all common factors of all instructions with a certain modification code is found out, To finally encode and do XOR with the coding of the command code of all instructions with a certain modification code, obtain the coding of modification code.

According to the title of operand, coding corresponding thereto is obtained.

The present invention also proposes a kind of system of resolving inversely GPU instructions, it is characterised in that including：

Generation variable module, for GPU instructions to be compiled, generates compiling file, and the compiling file is entered Row dis-assembling, generates dis-assembling file, by the resolver that collects by the dis-assembling representation of file into instMap variables, wherein The types of variables of the instMap variables includes command code, modification code, operand and corresponding operand type；

Coding module is searched, for the instMap variables to be input into decoding solver, the decoding solver judges The types of variables of the instMap variables, and the command code by having determined or modification code search remaining coding.

Whether XOR is carried out by turn by by instruction corresponding with coding in the instMap variables, by modifying code Change and complete detection modification code, after finding every space encoder of the modification code of instruction, carried out piece in the space encoder of modification code Lift, find out the title of all of modification code, then according to the title of a certain modification code, find out all with a certain modification code All common factors of instruction, will finally encode and do XOR with the coding of the command code of all instructions with a certain modification code, obtain Take the coding of modification code.

According to the title of operand, coding corresponding thereto is obtained.

From above scheme, the advantage of the invention is that：

Invention can successfully manage GPU sealing techniques system and to compiling and the limitation of program optimization：

1. because NVIDIA does not provide instruction encoding, based on the existing tools chains of NVIDIA, method solution proposed by the present invention GPU instruction encodings are separated out, the basic format of instruction is as shown in Figure 1.63~54 represent command code, and 42~23 represent 20 immediates, 21~18 represent criterion register, and 17~10 represent source register, and 9~2 represent destination register, and 1~0 represents, specific domain It is related to instruction syntax；

2. on the basis of instruction encoding is cracked, with reference to PTX documents, GPU assemblers can be constructed；

3. some compiling miscellaneous functions can be provided for GPU compiler, improve the efficiency of GPU program；

4. a series of micro benchmark test program can be designed and standardize to detect GPU micro-architectures characteristic and parameter.

Brief description of the drawings

Fig. 1 is the coded format exemplary plot of instruction；

Fig. 2 is instruction analytical algorithm flow chart；

Fig. 3 is operation algebraic method device algorithm (algorithm 1) figure；

Fig. 4 is command code solution musical instruments used in a Buddhist or Taoist mass algorithm (algorithm 2) figure；

Fig. 5 is modification code solution musical instruments used in a Buddhist or Taoist mass algorithm (algorithm 3) figure.

Specific embodiment

It is below present invention instruction analytical algorithm flow, it is as follows：

Instruction decoding needs to generate the corresponding relation of 64 bit instructions coding and assembly instruction, as shown in Fig. 2 algorithm flow is such as Under：

First with PTX instruction generators, automatically generate all instructions in NVIDIA PTX documents and its modify code These PTX files ptxas, is then compiled into cubin by combination, and by cuobjdump dis-assemblings, finally dis-assembling Information, instMap variables are expressed as by the resolver that collects, the input for decoding solver, the structure bag of wherein instMap Include：Command code, instruction, modification code, all of operand and corresponding operand type etc..

Operand can be register (R5), global memory ([R6+0x20]), constant internal memory (C [0x2] [0x40]), altogether Enjoy internal memory ([0x50]), immediate (0x9and1.5) and criterion register (P3), it has been found that the name of operand always and Digital correlation, therefore can be represented with its name with the coding of speculative operand, such as the two of register operand R5 Scale coding is 101, and the coding of immediate 0x9 is 1001, conversely, command code and modification code are then memonic symbols, it is impossible to directly lead to Cross name and be expressed as binary system, therefore command code and modification code need to be enumerated in their space encoder, it has been found that repair Decorations code is related instruction, and the modification code of same name is likely to difference, such as the class of LD and LDG in the coding of different instruction Type modification code name is all .32 .64 .128 .S16 .U16 .S8 .U8, but the position of mask is different, therefore for Modification code, we need to be processed respectively according to specific instruction.

It is below command code solution musical instruments used in a Buddhist or Taoist mass algorithm of the present invention, as shown in Figure 3：

Command code and modification code can not judge that it is encoded by name, and algorithm 1 illustrates command code solution process, according to The pseudo-assembly PTX documents that NVIDIA is provided, write PTX codes, are then compiled into cubin with ptxas, then anti-with cuobjudump Compilation, the assembler code (instMap variables) for obtaining as algorithm 1 input (1 row), we are discussed in detail lower command code solver Detailed process, for dis-assembling file in every a line instruction, respectively to its 64 coding each detected the (the 9th OK), then converged by the way that disassemblers nvdisasm is counter again mainly by each (11 row) of toggling command here Compile (13 row), if the instruction name of the instruction of new dis-assembling and original instruction is different (15 row), illustrate that this represents behaviour Make code, then this position is stored in opBits, obtain after position here, command code can be enumerated in space encoder.

By writing the combination of different modifying code and being verified, modification code coding is further obtained, however, due to The PTX documents that NVIDIA is provided are not complete, do not ensure that so and find all of command code coding and modification code coding, by calculating Method 1 we can find out all of instruction.

It is below present invention operation algebraic method device algorithm, as shown in Figure 4：

Algorithm 2 is the solution procedure of operand decoder.Initially set up a dictionary (being designated as visited dictionaries), dictionary Key be the name and operand type of instruction where operand, value is then marked and is changed whether operand has been decoded.Input It is operand and operand number by the input of command code solver, it is sequentially related to type and specifically instruction, therefore we Whether one group of operand of mark is detected, it is necessary to the name and operand type (array) of specific instruction are looked into as keyword Ask visited dictionaries, for dis-assembling file in each article of instruction (the 4th row), detect other positions in addition to operand (eighth row), is also to be obtained by (the 10th row) overturn in instruction encoding, and what wichChange returned to modification is which Operand, is then put into these positions in suitable array,<Instruction, operand type>Position is marked in visited dictionaries 1, expression was accessed (the 18th row).

It is below present invention modification code solution musical instruments used in a Buddhist or Taoist mass algorithm, as shown in Figure 5：

Modification code (Modifier), defines the concrete behavior of a certain bar instruction, such as LD has type to modify code：.U8, .S8 .U16 .32 .64 .128, also cache operation modification code：.CS(cache streaming),.CG(cache at Global level) etc..Modification code-phase it is increasingly complex for command code, its position cross over many operative positions, and with behaviour Make code-phase pass, such as, be equally type modification code, the position of the modification code of LD and LDG in instruction just difference, a kind of solution party Method is, by the way that XOR bit by bit is instructed, then whether observation modification code changes to be detected (the 6th row to 13 rows), to find every Bar instruction modification code space encoder after, modification code space encoder enumerated (the 15th row), next step it needs to be determined that The coding of specific certain modification code, such as the coding of .U8, the coding of .S8,20 to 29 rows of this process correspondence code, first The name (the 20th row) of all of modifier is found out, then the name of specific a certain modification code, found out all with this modification The coding of this coding and the command code of this instruction, is finally done XOR, so by all common factors (23-25 rows) of the instruction of code Just leave behind the coding of modification code.

The present invention also proposes a kind of system of resolving inversely GPU instructions, including：

According to the title of operand, coding corresponding thereto is obtained.

Claims

1. a kind of method that resolving inversely GPU is instructed, it is characterised in that including：

Step 1, GPU instructions are compiled, and generate compiling file, and the compiling file is carried out into dis-assembling, and generation is anti- Assembling file, by the resolver that collects by the dis-assembling representation of file into instMap variables, wherein the instMap variables Types of variables include command code, modification code, instruction, operand and corresponding operand type；

Step 2, decoding solver is input to by the instMap variables, and the decoding solver judges the instMap variables Types of variables, and the command code by having determined or modification code search corresponding coding.

2. the method that resolving inversely GPU as claimed in claim 1 is instructed, it is characterised in that if the decoding solver difference 64 each for encoding to the instMap variables are detected, then are carried out described 64 codings by dis-assembling anti- Compilation, if the instruction of the new dis-assembling of generation is different with the instruction name of the described 64 original instructions of coding, illustrates institute State 64 coding present bits and represent command code, according to the present bit, command code is enumerated in space encoder.

3. the method that resolving inversely GPU as claimed in claim 1 is instructed, it is characterised in that by the instMap variables middle finger The title of order, as keyword query visited dictionaries, refers to operand type for each in the instMap variables Order, detects other positions in addition to operand, and a certain positional operand that return has been changed will<Instruction, operand type> 1 is labeled as in visited dictionaries, expression was accessed.

4. the method for resolving inversely GPU as claimed in claim 1 instruction, it is characterised in that by by the instMap variables Middle instruction carries out XOR by turn, by modifying whether code changes completion detection modification code, finds every volume of the modification code of instruction After code space, enumerated in the space encoder of modification code, the title of all of modification code is found out, then according to a certain modification code Title, find out all common factors of all instructions with a certain modification code, finally coding a certain is repaiied with all with described The coding for adoring the command code of the instruction of code does XOR, obtains the coding of modification code.

5. the method that resolving inversely GPU as claimed in claim 1 is instructed, it is characterised in that according to the title of operand, obtains Coding corresponding thereto.

6. the system that a kind of resolving inversely GPU is instructed, it is characterised in that including：

Generation variable module, for GPU instructions to be compiled, generates compiling file, and the compiling file is carried out instead Compilation, generates dis-assembling file, by the resolver that collects by the dis-assembling representation of file into instMap variables, wherein described The types of variables of instMap variables includes command code, modification code, operand and corresponding operand type；

Coding module is searched, for the instMap variables to be input into decoding solver, the decoding solver judges described The types of variables of instMap variables, and the command code by having determined or modification code search remaining coding.

7. the system that resolving inversely GPU as claimed in claim 6 is instructed, it is characterised in that if the decoding solver difference 64 each for encoding to the instMap variables are detected, then are carried out described 64 codings by dis-assembling anti- Compilation, if the instruction of the new dis-assembling of generation is different with the instruction name of the described 64 original instructions of coding, illustrates institute State 64 coding present bits and represent command code, according to the present bit, command code is enumerated in space encoder.

8. the system that resolving inversely GPU as claimed in claim 6 is instructed, it is characterised in that by the instMap variables middle finger The title of order, as keyword query visited dictionaries, refers to operand type for each in the instMap variables Order, detects other positions in addition to operand, and a certain positional operand that return has been changed will<Instruction, operand type> 1 is labeled as in visited dictionaries, expression was accessed.

9. the system of resolving inversely GPU as claimed in claim 6 instruction, it is characterised in that by by the instMap variables In the instruction corresponding with coding carry out XOR by turn, by modifying whether code changes completion detection modification code, find every finger After the space encoder of the modification code of order, enumerated in the space encoder of modification code, found out the title of all of modification code, then According to the title of a certain modification code, all common factors of all instructions with a certain modification code are found out, finally will coding and institute The coding for having the command code of the instruction with a certain modification code does XOR, obtains the coding of modification code.

10. the system that resolving inversely GPU as claimed in claim 6 is instructed, it is characterised in that according to the title of operand, obtain Take coding corresponding thereto.