CN114253821B

CN114253821B - Method and device for analyzing GPU performance and computer storage medium

Info

Publication number: CN114253821B
Application number: CN202210192669.XA
Authority: CN
Inventors: 齐航空; 张竞丹; 李亮
Original assignee: Xi'an Xintong Semiconductor Technology Co ltd
Current assignee: Xi'an Xintong Semiconductor Technology Co ltd
Priority date: 2022-03-01
Filing date: 2022-03-01
Publication date: 2022-05-27
Anticipated expiration: 2042-03-01
Also published as: CN114253821A

Abstract

The embodiment of the invention discloses a method and a device for analyzing GPU performance and a computer storage medium; the method can comprise the following steps: acquiring an instruction list obtained by the operation of a target program under a set environment, the number of threads to be started and an execution result of each thread on each instruction in the instruction list; starting a thread simulator in the GPU performance model to be analyzed according to the number of threads to be started through a simulation scheduler in the GPU performance model to be analyzed; each thread simulator traverses each instruction in the instruction list, and executes the instruction according to the instruction execution control value of each instruction in the traversing process so as to measure the time length for executing the traversed instruction; and when all the instructions in the instruction list are traversed, acquiring the total execution duration of all the instructions in the instruction list executed by all the thread simulators.

Description

Method and device for analyzing GPU performance and computer storage medium

Technical Field

The embodiment of the invention relates to the technical field of Graphic Processing Units (GPUs), in particular to a method and a device for analyzing the performance of a GPU and a computer storage medium.

Background

In GPU performance statistics, the instruction number Per Cycle (IPC) is a relatively important GPU performance index, which represents how many Instructions the GPU can process in total in each clock Cycle; in general, IPC can be obtained by performing calculations based on thread execution time and the dominant frequency of the system.

In the performance statistics of the GPU, modeling of the GPU performance is typically required. Specifically, two ways are typically employed to model the performance of a GPU: one is simulation modeling, for example, a simulation model of the GPU is constructed by software simulation, and a real execution process is performed according to the simulation model to obtain real performance data of the GPU; the other is analysis modeling, for example, a certain mapping function (also referred to as an analysis model) is constructed to analyze and process the input of the GPU, so as to calculate a corresponding performance result.

For the simulation modeling mode, the hardware execution process can be truly simulated, and real simulation data can be acquired; however, because the simulation model needs to simulate the execution of the real GPU, the operation efficiency is low, and if the GPU architecture needs to be adjusted, the simulation model needs to be reconstructed for the GPU with the adjusted architecture, so that the GPU performance statistics is performed by adopting a simulation modeling method, which has the defects of relatively poor expandability and long development period. For the analysis modeling mode, the analysis model does not need the real operation process of the simulation instruction, and performance result data can be obtained only by carrying out modeling analysis operation on input instruction information, so that the operation efficiency of carrying out GPU performance statistics by adopting the analysis modeling mode is very high, the structural design is simple, and the expandability is strong; however, in the implementation, if the analysis model does not process the output commands finely enough, the error rate of the GPU performance statistics obtained is large.

Disclosure of Invention

In view of this, embodiments of the present invention are directed to a method, an apparatus, and a computer storage medium for analyzing GPU performance; the error rate of the GPU performance statistical result based on the analysis modeling mode can be reduced, and more accurate performance data about the GPU can be provided.

The technical scheme of the embodiment of the invention is realized as follows:

in a first aspect, an embodiment of the present invention provides a method for analyzing GPU performance, where the method includes:

acquiring an instruction list obtained by the operation of a target program under a set environment, the number of threads to be started and an execution result of each thread on each instruction in the instruction list; wherein the execution result comprises an instruction execution control value for each instruction;

starting a thread simulator in the GPU performance model to be analyzed according to the number of threads to be started through a simulation scheduler in the GPU performance model to be analyzed;

each thread simulator traverses each instruction in the instruction list, and executes the instruction according to the instruction execution control value of each instruction in the traversing process so as to measure the time length for executing the traversed instruction;

and when all the instructions in the instruction list are traversed, acquiring the total execution duration of all the instructions in the instruction list executed by all the thread simulators.

In a second aspect, an embodiment of the present invention provides an apparatus for analyzing GPU performance, where the apparatus includes: the device comprises an acquisition part, a simulation scheduler, a thread simulator and a statistic part; wherein the content of the first and second substances,

the acquisition part is configured to acquire an instruction list obtained by running a target program under a set environment, the number of threads to be started and an execution result of each thread on each instruction in the instruction list; wherein the execution result comprises an instruction execution control value for each instruction;

the simulation scheduler is configured to start a thread simulator in the GPU performance model to be analyzed according to the number of threads to be started;

each thread simulator is configured to traverse each instruction in the instruction list, and execute a currently traversed instruction according to an instruction execution control value and an instruction type of each instruction in the traversing process so as to measure the time length for executing the currently traversed instruction;

the statistical part is configured to acquire the total execution duration of all the instructions in the instruction list executed by all the thread simulators when all the instructions in the instruction list are traversed.

In a third aspect, embodiments of the present invention provide a computer storage medium storing a program for analyzing GPU performance, where the program for analyzing GPU performance implements the method steps for analyzing GPU performance of the first aspect when executed by at least one processor.

The embodiment of the invention provides a method and a device for analyzing GPU performance and a computer storage medium; in the process of performance analysis, the execution duration of each type of instruction in the instruction list of each simulation thread execution target program in the GPU performance model is measured, and the total execution duration is counted, so that the processing of simulating all current GPU instruction types can be covered, the error rate of GPU performance statistics results based on an analysis modeling mode is reduced, and more accurate performance data about the GPU is provided.

Drawings

FIG. 1 is a diagram illustrating sequential instructions executed in multiple threads in SIMT according to one embodiment of the present invention.

Fig. 2 is a schematic flowchart of a method for analyzing GPU performance according to an embodiment of the present invention.

Fig. 3 is a flowchart illustrating obtaining an instruction list of a target program and an execution result according to an embodiment of the present invention.

Fig. 4 is a flowchart illustrating a specific implementation of measuring, by each thread simulator, a time length for executing a traversed instruction in the process of traversing an instruction list and obtaining a total execution time length for all the thread simulators to execute all the instructions in the instruction list according to the embodiment of the present invention.

Fig. 5 is a schematic structural diagram of an apparatus for analyzing GPU performance according to an embodiment of the present invention.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

Currently, instructions that a GPU can process can be generally divided into three types, namely arithmetic logic instructions, memory access instructions, and branch jump instructions. At present, the conventional GPU performance statistics scheme adopting an analysis modeling mode relates to arithmetic logic instructions and memory access instructions, and can accurately calculate and obtain the execution duration of each arithmetic logic instruction and memory access instruction. For the branch jump instruction, on one hand, the conventional schemes do not consider the processing of the branch jump instruction, and on the other hand, the branch jump instruction has a great influence on the execution performance of the instruction. Therefore, the embodiment of the present invention is expected to provide a scheme for analyzing the performance of the GPU, which can accurately obtain the execution condition of each thread with respect to the branch jump instruction, so that the performance of the GPU operation instruction can be more accurately analyzed and obtained compared with the conventional scheme.

It should be noted that the GPU executes instructions in a Single Instruction Multiple Threads (SIMT) manner, that is, each Instruction is fetched and then Multiple Threads are scheduled to execute simultaneously. Based on the SIMT method, if a branch jump instruction is encountered during execution, different branch processing procedures are entered according to different data processed by each thread, generally speaking, when a GPU executes the branch jump instruction, the GPU will continue to fetch instructions sequentially, and at the same time, each thread will control whether its current instruction needs to be executed by setting a mask bit, and if the current mask bit of the thread is true, execute the current instruction; if the mask bit is false, the current instruction does not enter the execution pipeline for execution, thereby controlling the processing and execution of the SIMT branch jump instruction. Taking fig. 1 as an example, the GPU is configured to include n +1 threads, which are identified as Thread0, Thread1, Thread2, … …, and Thread. The instruction sequence is shown on the left side of fig. 1, and sequentially comprises: ADD, SUB, IF A, ADD, SUB, ELSE, MUL, DIV, ENDIF, ADD, and SUB. Therefore, the branch jump instructions are IF a and ELSE, and the GPU schedules different threads to execute different instruction segments when the branch jump instruction occurs during the execution of the above instructions by SIMT, and the mask bit of the threads Thread1 and Thread2 in fig. 1 is false during the execution of the instruction segment of the branch jump instruction IF a, so that the instruction segment of the branch jump instruction IF a does not enter the threads Thread1 and Thread2 for execution, which is shown by the dashed box in fig. 1; in addition, the mask bit of the threads Thread0 and Thread in fig. 1 is false during the execution of the instruction segment of the branch jump instruction ELSE, so that the instruction segment of the branch jump instruction ELSE does not enter the threads Thread0 and Thread for execution, which is also shown by the dashed box in fig. 1.

In conjunction with the above description about sequential instruction fetching execution during execution of instructions by a GPU in an SIMT manner and using a mask bit to control whether a thread executes a current branch jump instruction, referring to fig. 2, a method for analyzing GPU performance according to an embodiment of the present invention is shown, where the method may include:

s201: acquiring an instruction list obtained by the operation of a target program under a set environment, the number of threads to be started and an execution result of each thread on each instruction in the instruction list; wherein the execution result comprises an instruction execution control value for each instruction;

s202: starting a thread simulator in the GPU performance model to be analyzed according to the number of threads to be started through a simulation scheduler in the GPU performance model to be analyzed;

s203: each thread simulator traverses each instruction in the instruction list, and executes the currently traversed instruction according to the instruction execution control value and the instruction type of each instruction in the traversing process so as to measure the time length for executing the currently traversed instruction;

s204: and when all the instructions in the instruction list are traversed, acquiring the total execution duration of all the instructions in the instruction list executed by all the thread simulators.

With the technical solution shown in fig. 2, the target program is run in advance under the set environment to obtain the instruction list and the execution result of each thread on each instruction in the instruction list; secondly, in the process of performance analysis, measuring the execution duration of each type of instruction in an instruction list of each simulation thread executing the target program in the GPU performance model, and counting the total execution duration; therefore, the processing of simulating all the current GPU instruction types can be covered, the error rate of GPU performance statistical results based on an analysis modeling mode is reduced, and more accurate performance data about the GPU is provided. And the structure of the GPU to be analyzed can be conveniently adjusted, so that the analysis process of the performance indexes of the same instruction under different GPU architectures can be realized.

Based on the technical solution shown in fig. 2, in some possible implementation manners, the obtaining an instruction list obtained by running the target program in a set environment, the number of threads to be started, and an execution result of each thread on each instruction in the instruction list includes:

running the target program through a real environment or a simulation environment, and acquiring an instruction list of the target program, the number of threads needing to be started and an execution result of executing each instruction by each thread in the running process; wherein the execution result includes an operand register value and an instruction execution control value.

For the above implementation, in detail, the target program may exemplarily select an application program for performing GPU performance analysis, as shown in fig. 3, where the application program may be run in an existing real running environment or a simulation running environment, detect and obtain an instruction code, and obtain an instruction list of the application program according to the detected instruction code; in addition, during the process of running the application program by the SIMT method, the number of threads enabled by the real running environment or the simulation running environment for executing the instruction list and the execution result of each instruction in each thread execution instruction list may also be tracked. For each execution result, the embodiment of the present invention is preferably represented in a list form, i.e., [ dest, src0, src1, mask](ii) a The instruction execution control method includes that dest, src0, and src1 all represent operand register values, specifically, dest represents a destination operand register value as an instruction execution result, src0 represents a first source operand register value, src1 represents a second source operand register value, and mask represents an instruction execution control value, where in combination with the foregoing, the mask value is used to control whether a thread needs to execute a current instruction, for example, if the current mask value of the thread is true, the current instruction is executed; if the mask bit is false, the current instruction does not enter the execution pipeline for execution. Furthermore, for the execution results, depending on the type of instruction, the dest, src0, and src1 need not be complete, but the dest is definitely present in each execution result, and the mask value is also definitely present in each execution result. For example, referring to Table 1, the number of enabled threads is set to n +1, respectively identified as T₀、T₁、T₂、……、T_nFor each instruction of the instruction list shown in the first column, eachEach thread corresponds to an execution result, and the specific content is shown in table 1.

TABLE 1

As can be seen from Table 1, the IF R3 and the ELSE R3 in the instruction list are branch jump instructions, each followed by a corresponding instruction segment. For IF R3, as can be seen from Table 1, thread T₀And T_nThe mask value of (1) is true, indicating that the execution of the instruction segment corresponding to IF R3 includes instructions "ADD R4, R1, R2" and "SUB R5, R4, R1", and thread T₁And T₂The mask value of (1) is false, which indicates that the instruction sections corresponding to the IF R3 are not executed; further, when thread T is running₀And T_nDuring execution of an instruction segment comprising the instructions ADD R4, R1, R2 and SUB R5, R4, R1, a thread T is executed₁And T₂May be viewed as executing NOP instructions.

For ELSE R3, as can be seen from Table 1, thread T₁And T₂The mask value of (1) is true, indicating that executing ELSE R3 corresponds to an instruction segment including instructions "MUL R4, R1, R2" and "ADD R5, R4, R1", and thread T₁And T₂The mask value of (1) is false, which indicates that the instruction segments corresponding to the ELSE R3 are not executed; further, when thread T is running₁And T₂During execution of the instruction section containing the instructions "MUL R4, R1, R2" and "ADD R5, R4, R1", the thread T₀And T_nMay be viewed as executing NOP instructions.

Referring next to table 1, for other types of instructions except branch jump instructions, the mask value of each thread is true, which means that for other types of instructions, such as arithmetic logic instructions and memory access instructions, during execution, the SIMT method schedules multiple threads to execute simultaneously.

Through the explanation of the execution result shown in table 1, it can be understood that, as the execution result includes the instruction execution control value mask of each thread for each instruction, and the mask value can be used to control whether the thread executes the instruction, the instruction list can be input to the performance analysis model of the GPU to be analyzed, and the execution condition of each thread model is controlled according to the execution result shown in table 1, so that the performance data of executing the branch jump instruction can be obtained more accurately, the error rate of the GPU performance statistical result based on the analysis modeling mode is reduced, and more accurate performance data about the GPU is provided.

For the technical solution shown in fig. 2, in some possible implementations, each thread simulator traverses each instruction in the instruction list, and executes a currently traversed instruction according to an instruction execution control value and an instruction type of each instruction during the traversal process to measure a time length for executing the currently traversed instruction, including:

for each thread simulator, judging whether an instruction execution control value in the execution result of the current traversed instruction represents the execution of the current traversed instruction;

corresponding to the instruction execution control value, the current traversed instruction is not executed, and the time length for executing the fixed NOP instruction is used as the time length for executing the current traversed instruction;

determining an instruction type of the currently traversed instruction corresponding to the instruction execution control value indicating execution of the currently traversed instruction;

the instruction type corresponding to the current traversed instruction is a memory access instruction, and the time length for executing the current traversed instruction is measured according to the mode of executing the memory access instruction;

and taking the fixed time length for executing the arithmetic logic instruction as the time length for executing the current traversed instruction, wherein the instruction type corresponding to the current traversed instruction is the arithmetic logic instruction.

For the foregoing implementation manner, in some examples, the measuring a duration of executing the currently traversed instruction according to a manner of executing a memory access instruction includes:

and according to the access address corresponding to the operand register value in the execution result of the current traversed instruction, measuring the time length for executing the current traversed instruction according to a set Cache access analysis model.

For the above implementation and examples thereof, in a specific implementation process, in combination with the foregoing, the instruction execution control value mask may be true or false; if the current mask value of the thread is true, executing the current instruction; if the current mask bit of the thread is false, the current instruction does not enter the execution pipeline for execution, i.e., the currently traversed instruction is not executed.

For the above implementation, in some examples, the method further comprises:

for each thread simulator, judging whether the current traversed instruction is an end instruction in the instruction list in the traversing process:

if not, judging whether an instruction execution control value in the execution result of the current traversed instruction represents the execution of the current traversed instruction or not;

and if so, determining that the thread simulator finishes traversing all the instructions in the instruction list.

For the above example, it should be noted that, since each thread simulator needs to sequentially execute instructions in the instruction list, when traversing each instruction in the instruction list, it should first be determined whether the currently traversed instruction is an end instruction in the instruction list, where the end instruction represents the end of the instruction execution in the instruction list; if the instruction is a finish instruction, determining that the thread simulator has already traversed all instructions in the instruction list, and further executing the process of acquiring the total execution duration of all instructions in the instruction list executed by all thread simulators in step S204; if not, the process of executing the currently traversed instruction according to the instruction execution control value and the instruction type of each instruction to measure the time length for executing the currently traversed instruction as described in the above implementation manner and the foregoing example is required.

For the solution shown in fig. 2, in some possible implementations, for each of the thread simulators, after the measuring the duration of executing the currently traversed instruction, the method further includes:

for each thread simulator, adding the time length for executing the currently traversed instruction to the total execution time length of the corresponding thread simulator; before traversing the first instruction in the instruction list, the starting value of the total execution time is 0.

For the above implementation, it should be noted that, for each thread simulator, before traversing the instructions in the instruction list, the total execution time with a start value of 0 may be set; then, continuously adding the time length of the traversed instruction to the total execution time length along with the process that the thread simulator traverses each instruction in the instruction list; finally, after all the instructions in the instruction list are completed by traversal, the finally obtained total execution time length is the total execution time length for the thread simulator to execute all the instructions in the instruction list.

In summary of the foregoing implementation manner and examples, referring to fig. 4, the embodiment of the present invention uses a thread simulator as an example to show a specific process for implementing S203 and S204, as shown in fig. 4, the specific process may include:

s401: setting the initial value of the total execution time of the thread simulator as 0;

s402: traversing the instruction list by the thread simulator;

s403: judging whether the current traversed instruction is an end instruction in the instruction list: if yes, go to S410, determine that the thread simulator has traversed all instructions in the instruction list, and execute S411: acquiring the total execution duration of all instructions in the thread simulator execution instruction list; if not, go to S404: judging whether the mask value of the execution result of the current traversed instruction is tune: if not, it indicates that the currently traversed instruction is a branch jump instruction or an instruction segment corresponding to the branch jump instruction, and the thread simulator is controlled not to execute the branch jump instruction and the instruction segment corresponding to the branch jump instruction, so the process goes to S405: taking the time length for executing the fixed NOP instruction as the time length for executing the current traversed instruction; if true, the thread simulator is controlled to execute the currently traversed instruction regardless of whether the instruction is a branch jump instruction or an instruction segment of a branch jump instruction, and therefore, the process goes to S406: the instruction type of the currently traversed instruction is determined.

If the instruction type is a memory access instruction, S407 is executed: measuring the time length for executing the current traversed instruction according to the mode of executing the access instruction; specifically, if the current traversed instruction is a memory access instruction, the execution duration of the thread simulator for the current memory access instruction is calculated through the Cache access analysis model according to the corresponding memory access address in the execution result of the thread.

Corresponding to the instruction type being an arithmetic logic instruction, since the execution duration of the arithmetic logic instruction is fixed, S408 is executed: and taking the fixed time length for executing the arithmetic logic instruction as the time length for executing the current traversed instruction.

After completing S405, S407, and S408 to obtain the time length for executing the currently traversed instruction, the thread simulator executes S409: and adding the time length for executing the current traversed instruction into the total execution time length of the corresponding thread simulator. And to S402 to cause the thread simulator to traverse the next instruction in the instruction list until traversing to the end instruction in the instruction list.

By using the specific flow example shown in fig. 4, the execution result of each instruction by each thread obtained in S201 does not need to actually simulate the execution process of the instruction in the performance analysis process, and finally obtains the total execution duration of all instructions in the instruction list executed by the thread simulator, and further obtains the IPC by performing calculation according to the thread execution time and the main frequency of the system. Therefore, the performance data of the execution branch jump instruction can be more accurately obtained, the error rate of the GPU performance statistical result based on the analysis modeling mode is reduced, and the more accurate performance data about the GPU is provided.

Based on the same inventive concept of the foregoing technical solution, referring to fig. 5, an apparatus 50 for analyzing GPU performance provided by an embodiment of the present invention is shown, where the apparatus 50 includes: an acquisition part 501, a simulation scheduler 502, a thread simulator 503, and a statistic part 504; wherein the content of the first and second substances,

the acquiring part 501 is configured to acquire an instruction list obtained by running a target program in a set environment, the number of threads to be started, and an execution result of each thread on each instruction in the instruction list; wherein the execution result comprises an instruction execution control value for each instruction;

the simulation scheduler 502 is configured to start the thread simulator 503 in the GPU performance model to be analyzed according to the number of threads to be started;

each thread simulator 503 is configured to traverse each instruction in the instruction list, and execute the currently traversed instruction according to the instruction execution control value and the instruction type of each instruction in the traversal process, so as to measure the time length for executing the currently traversed instruction;

the counting portion 504 is configured to obtain a total execution time length for all the instructions in the instruction list executed by all the thread simulators 503 when all the instructions in the instruction list are traversed.

In some examples, the obtaining part 501 is configured to run the target program through a real environment or a simulation environment, and obtain an instruction list of the target program, the number of threads to be started, and an execution result of each thread executing each instruction during running; wherein the execution result includes an operand register value and an instruction execution control value.

In some examples, each of the thread simulators 503 is configured to:

judging whether an instruction execution control value in an execution result of the current traversed instruction represents the execution of the current traversed instruction or not;

In some examples, each of the thread simulators 503 is configured to:

in the traversing process, judging whether the current traversed instruction is an ending instruction in the instruction list:

if yes, it is determined that the thread simulator 503 has completed traversing all instructions in the instruction list.

In some examples, each of the thread simulators 503 is further configured to: after the time length for executing the currently traversed instruction is measured, adding the time length for executing the currently traversed instruction to the total execution time length of the corresponding thread simulator 503; before traversing the first instruction in the instruction list, the starting value of the total execution time is 0.

It is understood that in this embodiment, "part" may be part of a circuit, part of a processor, part of a program or software, etc., and may also be a unit, and may also be a module or a non-modular.

In addition, each component in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit. The integrated unit can be realized in a form of hardware or a form of a software functional module.

Based on the understanding that the technical solution of the present embodiment essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the method of the present embodiment. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Therefore, the present embodiment provides a computer storage medium, which stores a program for analyzing GPU performance, and when the program for analyzing GPU performance is executed by at least one processor, the method steps for analyzing GPU performance in the above technical solutions are implemented.

It should be understood that the exemplary technical solution of the apparatus 50 for analyzing GPU performance described above belongs to the same concept as the technical solution of the aforementioned method for analyzing GPU performance, and therefore, the details of the technical solution of the apparatus 50 for analyzing GPU performance, which are not described in detail above, can be referred to the description of the technical solution of the aforementioned method for analyzing GPU performance. The embodiments of the present invention will not be described in detail herein.

It should be noted that: the technical schemes described in the embodiments of the present invention can be combined arbitrarily without conflict.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A method for analyzing GPU performance, the method comprising:

each thread simulator traverses each instruction in the instruction list, and executes the currently traversed instruction according to the instruction execution control value and the instruction type of each instruction in the traversing process so as to measure the time length for executing the currently traversed instruction;

when all the instructions in the instruction list are traversed, acquiring the total execution duration of all the instructions in the instruction list executed by all the thread simulators;

wherein each thread simulator traverses each instruction in the instruction list, and executes a currently traversed instruction according to an instruction execution control value and an instruction type of each instruction in the traversal process to measure the time length for executing the currently traversed instruction, and the method comprises the following steps:

and taking the fixed time length for executing the arithmetic logic instruction as the time length for executing the current traversed instruction, wherein the instruction type corresponding to the current traversed instruction is an arithmetic logic instruction.

2. The method of claim 1, wherein the obtaining of the instruction list obtained by the target program running under the set environment, the number of threads to be started, and the execution result of each thread on each instruction in the instruction list comprises:

running the target program through a real environment or a simulation environment, and acquiring an instruction list of the target program, the number of threads needing to be started and an execution result of each instruction executed by each thread in the running process; wherein the execution result includes an operand register value and an instruction execution control value.

3. The method of claim 1, wherein said metering a duration of execution of said currently traversed instruction by way of executing a memory access instruction comprises:

4. The method of claim 1, further comprising:

5. The method of claim 1, wherein for each of the thread simulators, after metering a duration of execution of the currently traversed instruction, the method further comprises:

for each thread simulator, adding the time length for executing the currently traversed instruction into the total execution time length of the corresponding thread simulator; before traversing the first instruction in the instruction list, the starting value of the total execution time is 0.

6. An apparatus for analyzing GPU performance, the apparatus comprising: the device comprises an acquisition part, a simulation scheduler, a thread simulator and a statistic part; wherein the content of the first and second substances,

the simulation scheduler is configured to start a thread simulator in a GPU performance model to be analyzed according to the number of threads to be started;

the counting part is configured to acquire the total execution duration of all the instructions in the instruction list executed by all the thread simulators when all the instructions in the instruction list are traversed;

wherein each of the thread simulators is configured to:

7. The apparatus according to claim 6, wherein the acquiring section is configured to execute the target program through a real environment or a simulation environment, and acquire an instruction list of the target program, the number of threads to be started, and an execution result of each thread executing each instruction during execution; wherein the execution result includes an operand register value and an instruction execution control value.

8. A computer storage medium storing a program for analyzing GPU performance, the program for analyzing GPU performance implementing the method steps of any of claims 1 to 5 when executed by at least one processor.