CN114253821B - Method and device for analyzing GPU performance and computer storage medium - Google Patents

Method and device for analyzing GPU performance and computer storage medium Download PDF

Info

Publication number
CN114253821B
CN114253821B CN202210192669.XA CN202210192669A CN114253821B CN 114253821 B CN114253821 B CN 114253821B CN 202210192669 A CN202210192669 A CN 202210192669A CN 114253821 B CN114253821 B CN 114253821B
Authority
CN
China
Prior art keywords
instruction
thread
execution
traversed
executing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210192669.XA
Other languages
Chinese (zh)
Other versions
CN114253821A (en
Inventor
齐航空
张竞丹
李亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xi'an Xintong Semiconductor Technology Co ltd
Original Assignee
Xi'an Xintong Semiconductor Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xi'an Xintong Semiconductor Technology Co ltd filed Critical Xi'an Xintong Semiconductor Technology Co ltd
Priority to CN202210192669.XA priority Critical patent/CN114253821B/en
Publication of CN114253821A publication Critical patent/CN114253821A/en
Application granted granted Critical
Publication of CN114253821B publication Critical patent/CN114253821B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3457Performance evaluation by simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The embodiment of the invention discloses a method and a device for analyzing GPU performance and a computer storage medium; the method can comprise the following steps: acquiring an instruction list obtained by the operation of a target program under a set environment, the number of threads to be started and an execution result of each thread on each instruction in the instruction list; starting a thread simulator in the GPU performance model to be analyzed according to the number of threads to be started through a simulation scheduler in the GPU performance model to be analyzed; each thread simulator traverses each instruction in the instruction list, and executes the instruction according to the instruction execution control value of each instruction in the traversing process so as to measure the time length for executing the traversed instruction; and when all the instructions in the instruction list are traversed, acquiring the total execution duration of all the instructions in the instruction list executed by all the thread simulators.

Description

Method and device for analyzing GPU performance and computer storage medium
Technical Field
The embodiment of the invention relates to the technical field of Graphic Processing Units (GPUs), in particular to a method and a device for analyzing the performance of a GPU and a computer storage medium.
Background
In GPU performance statistics, the instruction number Per Cycle (IPC) is a relatively important GPU performance index, which represents how many Instructions the GPU can process in total in each clock Cycle; in general, IPC can be obtained by performing calculations based on thread execution time and the dominant frequency of the system.
In the performance statistics of the GPU, modeling of the GPU performance is typically required. Specifically, two ways are typically employed to model the performance of a GPU: one is simulation modeling, for example, a simulation model of the GPU is constructed by software simulation, and a real execution process is performed according to the simulation model to obtain real performance data of the GPU; the other is analysis modeling, for example, a certain mapping function (also referred to as an analysis model) is constructed to analyze and process the input of the GPU, so as to calculate a corresponding performance result.
For the simulation modeling mode, the hardware execution process can be truly simulated, and real simulation data can be acquired; however, because the simulation model needs to simulate the execution of the real GPU, the operation efficiency is low, and if the GPU architecture needs to be adjusted, the simulation model needs to be reconstructed for the GPU with the adjusted architecture, so that the GPU performance statistics is performed by adopting a simulation modeling method, which has the defects of relatively poor expandability and long development period. For the analysis modeling mode, the analysis model does not need the real operation process of the simulation instruction, and performance result data can be obtained only by carrying out modeling analysis operation on input instruction information, so that the operation efficiency of carrying out GPU performance statistics by adopting the analysis modeling mode is very high, the structural design is simple, and the expandability is strong; however, in the implementation, if the analysis model does not process the output commands finely enough, the error rate of the GPU performance statistics obtained is large.
Disclosure of Invention
In view of this, embodiments of the present invention are directed to a method, an apparatus, and a computer storage medium for analyzing GPU performance; the error rate of the GPU performance statistical result based on the analysis modeling mode can be reduced, and more accurate performance data about the GPU can be provided.
The technical scheme of the embodiment of the invention is realized as follows:
in a first aspect, an embodiment of the present invention provides a method for analyzing GPU performance, where the method includes:
acquiring an instruction list obtained by the operation of a target program under a set environment, the number of threads to be started and an execution result of each thread on each instruction in the instruction list; wherein the execution result comprises an instruction execution control value for each instruction;
starting a thread simulator in the GPU performance model to be analyzed according to the number of threads to be started through a simulation scheduler in the GPU performance model to be analyzed;
each thread simulator traverses each instruction in the instruction list, and executes the instruction according to the instruction execution control value of each instruction in the traversing process so as to measure the time length for executing the traversed instruction;
and when all the instructions in the instruction list are traversed, acquiring the total execution duration of all the instructions in the instruction list executed by all the thread simulators.
In a second aspect, an embodiment of the present invention provides an apparatus for analyzing GPU performance, where the apparatus includes: the device comprises an acquisition part, a simulation scheduler, a thread simulator and a statistic part; wherein the content of the first and second substances,
the acquisition part is configured to acquire an instruction list obtained by running a target program under a set environment, the number of threads to be started and an execution result of each thread on each instruction in the instruction list; wherein the execution result comprises an instruction execution control value for each instruction;
the simulation scheduler is configured to start a thread simulator in the GPU performance model to be analyzed according to the number of threads to be started;
each thread simulator is configured to traverse each instruction in the instruction list, and execute a currently traversed instruction according to an instruction execution control value and an instruction type of each instruction in the traversing process so as to measure the time length for executing the currently traversed instruction;
the statistical part is configured to acquire the total execution duration of all the instructions in the instruction list executed by all the thread simulators when all the instructions in the instruction list are traversed.
In a third aspect, embodiments of the present invention provide a computer storage medium storing a program for analyzing GPU performance, where the program for analyzing GPU performance implements the method steps for analyzing GPU performance of the first aspect when executed by at least one processor.
The embodiment of the invention provides a method and a device for analyzing GPU performance and a computer storage medium; in the process of performance analysis, the execution duration of each type of instruction in the instruction list of each simulation thread execution target program in the GPU performance model is measured, and the total execution duration is counted, so that the processing of simulating all current GPU instruction types can be covered, the error rate of GPU performance statistics results based on an analysis modeling mode is reduced, and more accurate performance data about the GPU is provided.
Drawings
FIG. 1 is a diagram illustrating sequential instructions executed in multiple threads in SIMT according to one embodiment of the present invention.
Fig. 2 is a schematic flowchart of a method for analyzing GPU performance according to an embodiment of the present invention.
Fig. 3 is a flowchart illustrating obtaining an instruction list of a target program and an execution result according to an embodiment of the present invention.
Fig. 4 is a flowchart illustrating a specific implementation of measuring, by each thread simulator, a time length for executing a traversed instruction in the process of traversing an instruction list and obtaining a total execution time length for all the thread simulators to execute all the instructions in the instruction list according to the embodiment of the present invention.
Fig. 5 is a schematic structural diagram of an apparatus for analyzing GPU performance according to an embodiment of the present invention.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.
Currently, instructions that a GPU can process can be generally divided into three types, namely arithmetic logic instructions, memory access instructions, and branch jump instructions. At present, the conventional GPU performance statistics scheme adopting an analysis modeling mode relates to arithmetic logic instructions and memory access instructions, and can accurately calculate and obtain the execution duration of each arithmetic logic instruction and memory access instruction. For the branch jump instruction, on one hand, the conventional schemes do not consider the processing of the branch jump instruction, and on the other hand, the branch jump instruction has a great influence on the execution performance of the instruction. Therefore, the embodiment of the present invention is expected to provide a scheme for analyzing the performance of the GPU, which can accurately obtain the execution condition of each thread with respect to the branch jump instruction, so that the performance of the GPU operation instruction can be more accurately analyzed and obtained compared with the conventional scheme.
It should be noted that the GPU executes instructions in a Single Instruction Multiple Threads (SIMT) manner, that is, each Instruction is fetched and then Multiple Threads are scheduled to execute simultaneously. Based on the SIMT method, if a branch jump instruction is encountered during execution, different branch processing procedures are entered according to different data processed by each thread, generally speaking, when a GPU executes the branch jump instruction, the GPU will continue to fetch instructions sequentially, and at the same time, each thread will control whether its current instruction needs to be executed by setting a mask bit, and if the current mask bit of the thread is true, execute the current instruction; if the mask bit is false, the current instruction does not enter the execution pipeline for execution, thereby controlling the processing and execution of the SIMT branch jump instruction. Taking fig. 1 as an example, the GPU is configured to include n +1 threads, which are identified as Thread0, Thread1, Thread2, … …, and Thread. The instruction sequence is shown on the left side of fig. 1, and sequentially comprises: ADD, SUB, IF A, ADD, SUB, ELSE, MUL, DIV, ENDIF, ADD, and SUB. Therefore, the branch jump instructions are IF a and ELSE, and the GPU schedules different threads to execute different instruction segments when the branch jump instruction occurs during the execution of the above instructions by SIMT, and the mask bit of the threads Thread1 and Thread2 in fig. 1 is false during the execution of the instruction segment of the branch jump instruction IF a, so that the instruction segment of the branch jump instruction IF a does not enter the threads Thread1 and Thread2 for execution, which is shown by the dashed box in fig. 1; in addition, the mask bit of the threads Thread0 and Thread in fig. 1 is false during the execution of the instruction segment of the branch jump instruction ELSE, so that the instruction segment of the branch jump instruction ELSE does not enter the threads Thread0 and Thread for execution, which is also shown by the dashed box in fig. 1.
In conjunction with the above description about sequential instruction fetching execution during execution of instructions by a GPU in an SIMT manner and using a mask bit to control whether a thread executes a current branch jump instruction, referring to fig. 2, a method for analyzing GPU performance according to an embodiment of the present invention is shown, where the method may include:
s201: acquiring an instruction list obtained by the operation of a target program under a set environment, the number of threads to be started and an execution result of each thread on each instruction in the instruction list; wherein the execution result comprises an instruction execution control value for each instruction;
s202: starting a thread simulator in the GPU performance model to be analyzed according to the number of threads to be started through a simulation scheduler in the GPU performance model to be analyzed;
s203: each thread simulator traverses each instruction in the instruction list, and executes the currently traversed instruction according to the instruction execution control value and the instruction type of each instruction in the traversing process so as to measure the time length for executing the currently traversed instruction;
s204: and when all the instructions in the instruction list are traversed, acquiring the total execution duration of all the instructions in the instruction list executed by all the thread simulators.
With the technical solution shown in fig. 2, the target program is run in advance under the set environment to obtain the instruction list and the execution result of each thread on each instruction in the instruction list; secondly, in the process of performance analysis, measuring the execution duration of each type of instruction in an instruction list of each simulation thread executing the target program in the GPU performance model, and counting the total execution duration; therefore, the processing of simulating all the current GPU instruction types can be covered, the error rate of GPU performance statistical results based on an analysis modeling mode is reduced, and more accurate performance data about the GPU is provided. And the structure of the GPU to be analyzed can be conveniently adjusted, so that the analysis process of the performance indexes of the same instruction under different GPU architectures can be realized.
Based on the technical solution shown in fig. 2, in some possible implementation manners, the obtaining an instruction list obtained by running the target program in a set environment, the number of threads to be started, and an execution result of each thread on each instruction in the instruction list includes:
running the target program through a real environment or a simulation environment, and acquiring an instruction list of the target program, the number of threads needing to be started and an execution result of executing each instruction by each thread in the running process; wherein the execution result includes an operand register value and an instruction execution control value.
For the above implementation, in detail, the target program may exemplarily select an application program for performing GPU performance analysis, as shown in fig. 3, where the application program may be run in an existing real running environment or a simulation running environment, detect and obtain an instruction code, and obtain an instruction list of the application program according to the detected instruction code; in addition, during the process of running the application program by the SIMT method, the number of threads enabled by the real running environment or the simulation running environment for executing the instruction list and the execution result of each instruction in each thread execution instruction list may also be tracked. For each execution result, the embodiment of the present invention is preferably represented in a list form, i.e., [ dest, src0, src1, mask](ii) a The instruction execution control method includes that dest, src0, and src1 all represent operand register values, specifically, dest represents a destination operand register value as an instruction execution result, src0 represents a first source operand register value, src1 represents a second source operand register value, and mask represents an instruction execution control value, where in combination with the foregoing, the mask value is used to control whether a thread needs to execute a current instruction, for example, if the current mask value of the thread is true, the current instruction is executed; if the mask bit is false, the current instruction does not enter the execution pipeline for execution. Furthermore, for the execution results, depending on the type of instruction, the dest, src0, and src1 need not be complete, but the dest is definitely present in each execution result, and the mask value is also definitely present in each execution result. For example, referring to Table 1, the number of enabled threads is set to n +1, respectively identified as T0、T1、T2、……、TnFor each instruction of the instruction list shown in the first column, eachEach thread corresponds to an execution result, and the specific content is shown in table 1.
TABLE 1
Figure 968594DEST_PATH_IMAGE001
As can be seen from Table 1, the IF R3 and the ELSE R3 in the instruction list are branch jump instructions, each followed by a corresponding instruction segment. For IF R3, as can be seen from Table 1, thread T0And TnThe mask value of (1) is true, indicating that the execution of the instruction segment corresponding to IF R3 includes instructions "ADD R4, R1, R2" and "SUB R5, R4, R1", and thread T1And T2The mask value of (1) is false, which indicates that the instruction sections corresponding to the IF R3 are not executed; further, when thread T is running0And TnDuring execution of an instruction segment comprising the instructions ADD R4, R1, R2 and SUB R5, R4, R1, a thread T is executed1And T2May be viewed as executing NOP instructions.
For ELSE R3, as can be seen from Table 1, thread T1And T2The mask value of (1) is true, indicating that executing ELSE R3 corresponds to an instruction segment including instructions "MUL R4, R1, R2" and "ADD R5, R4, R1", and thread T1And T2The mask value of (1) is false, which indicates that the instruction segments corresponding to the ELSE R3 are not executed; further, when thread T is running1And T2During execution of the instruction section containing the instructions "MUL R4, R1, R2" and "ADD R5, R4, R1", the thread T0And TnMay be viewed as executing NOP instructions.
Referring next to table 1, for other types of instructions except branch jump instructions, the mask value of each thread is true, which means that for other types of instructions, such as arithmetic logic instructions and memory access instructions, during execution, the SIMT method schedules multiple threads to execute simultaneously.
Through the explanation of the execution result shown in table 1, it can be understood that, as the execution result includes the instruction execution control value mask of each thread for each instruction, and the mask value can be used to control whether the thread executes the instruction, the instruction list can be input to the performance analysis model of the GPU to be analyzed, and the execution condition of each thread model is controlled according to the execution result shown in table 1, so that the performance data of executing the branch jump instruction can be obtained more accurately, the error rate of the GPU performance statistical result based on the analysis modeling mode is reduced, and more accurate performance data about the GPU is provided.
For the technical solution shown in fig. 2, in some possible implementations, each thread simulator traverses each instruction in the instruction list, and executes a currently traversed instruction according to an instruction execution control value and an instruction type of each instruction during the traversal process to measure a time length for executing the currently traversed instruction, including:
for each thread simulator, judging whether an instruction execution control value in the execution result of the current traversed instruction represents the execution of the current traversed instruction;
corresponding to the instruction execution control value, the current traversed instruction is not executed, and the time length for executing the fixed NOP instruction is used as the time length for executing the current traversed instruction;
determining an instruction type of the currently traversed instruction corresponding to the instruction execution control value indicating execution of the currently traversed instruction;
the instruction type corresponding to the current traversed instruction is a memory access instruction, and the time length for executing the current traversed instruction is measured according to the mode of executing the memory access instruction;
and taking the fixed time length for executing the arithmetic logic instruction as the time length for executing the current traversed instruction, wherein the instruction type corresponding to the current traversed instruction is the arithmetic logic instruction.
For the foregoing implementation manner, in some examples, the measuring a duration of executing the currently traversed instruction according to a manner of executing a memory access instruction includes:
and according to the access address corresponding to the operand register value in the execution result of the current traversed instruction, measuring the time length for executing the current traversed instruction according to a set Cache access analysis model.
For the above implementation and examples thereof, in a specific implementation process, in combination with the foregoing, the instruction execution control value mask may be true or false; if the current mask value of the thread is true, executing the current instruction; if the current mask bit of the thread is false, the current instruction does not enter the execution pipeline for execution, i.e., the currently traversed instruction is not executed.
For the above implementation, in some examples, the method further comprises:
for each thread simulator, judging whether the current traversed instruction is an end instruction in the instruction list in the traversing process:
if not, judging whether an instruction execution control value in the execution result of the current traversed instruction represents the execution of the current traversed instruction or not;
and if so, determining that the thread simulator finishes traversing all the instructions in the instruction list.
For the above example, it should be noted that, since each thread simulator needs to sequentially execute instructions in the instruction list, when traversing each instruction in the instruction list, it should first be determined whether the currently traversed instruction is an end instruction in the instruction list, where the end instruction represents the end of the instruction execution in the instruction list; if the instruction is a finish instruction, determining that the thread simulator has already traversed all instructions in the instruction list, and further executing the process of acquiring the total execution duration of all instructions in the instruction list executed by all thread simulators in step S204; if not, the process of executing the currently traversed instruction according to the instruction execution control value and the instruction type of each instruction to measure the time length for executing the currently traversed instruction as described in the above implementation manner and the foregoing example is required.
For the solution shown in fig. 2, in some possible implementations, for each of the thread simulators, after the measuring the duration of executing the currently traversed instruction, the method further includes:
for each thread simulator, adding the time length for executing the currently traversed instruction to the total execution time length of the corresponding thread simulator; before traversing the first instruction in the instruction list, the starting value of the total execution time is 0.
For the above implementation, it should be noted that, for each thread simulator, before traversing the instructions in the instruction list, the total execution time with a start value of 0 may be set; then, continuously adding the time length of the traversed instruction to the total execution time length along with the process that the thread simulator traverses each instruction in the instruction list; finally, after all the instructions in the instruction list are completed by traversal, the finally obtained total execution time length is the total execution time length for the thread simulator to execute all the instructions in the instruction list.
In summary of the foregoing implementation manner and examples, referring to fig. 4, the embodiment of the present invention uses a thread simulator as an example to show a specific process for implementing S203 and S204, as shown in fig. 4, the specific process may include:
s401: setting the initial value of the total execution time of the thread simulator as 0;
s402: traversing the instruction list by the thread simulator;
s403: judging whether the current traversed instruction is an end instruction in the instruction list: if yes, go to S410, determine that the thread simulator has traversed all instructions in the instruction list, and execute S411: acquiring the total execution duration of all instructions in the thread simulator execution instruction list; if not, go to S404: judging whether the mask value of the execution result of the current traversed instruction is tune: if not, it indicates that the currently traversed instruction is a branch jump instruction or an instruction segment corresponding to the branch jump instruction, and the thread simulator is controlled not to execute the branch jump instruction and the instruction segment corresponding to the branch jump instruction, so the process goes to S405: taking the time length for executing the fixed NOP instruction as the time length for executing the current traversed instruction; if true, the thread simulator is controlled to execute the currently traversed instruction regardless of whether the instruction is a branch jump instruction or an instruction segment of a branch jump instruction, and therefore, the process goes to S406: the instruction type of the currently traversed instruction is determined.
If the instruction type is a memory access instruction, S407 is executed: measuring the time length for executing the current traversed instruction according to the mode of executing the access instruction; specifically, if the current traversed instruction is a memory access instruction, the execution duration of the thread simulator for the current memory access instruction is calculated through the Cache access analysis model according to the corresponding memory access address in the execution result of the thread.
Corresponding to the instruction type being an arithmetic logic instruction, since the execution duration of the arithmetic logic instruction is fixed, S408 is executed: and taking the fixed time length for executing the arithmetic logic instruction as the time length for executing the current traversed instruction.
After completing S405, S407, and S408 to obtain the time length for executing the currently traversed instruction, the thread simulator executes S409: and adding the time length for executing the current traversed instruction into the total execution time length of the corresponding thread simulator. And to S402 to cause the thread simulator to traverse the next instruction in the instruction list until traversing to the end instruction in the instruction list.
By using the specific flow example shown in fig. 4, the execution result of each instruction by each thread obtained in S201 does not need to actually simulate the execution process of the instruction in the performance analysis process, and finally obtains the total execution duration of all instructions in the instruction list executed by the thread simulator, and further obtains the IPC by performing calculation according to the thread execution time and the main frequency of the system. Therefore, the performance data of the execution branch jump instruction can be more accurately obtained, the error rate of the GPU performance statistical result based on the analysis modeling mode is reduced, and the more accurate performance data about the GPU is provided.
Based on the same inventive concept of the foregoing technical solution, referring to fig. 5, an apparatus 50 for analyzing GPU performance provided by an embodiment of the present invention is shown, where the apparatus 50 includes: an acquisition part 501, a simulation scheduler 502, a thread simulator 503, and a statistic part 504; wherein the content of the first and second substances,
the acquiring part 501 is configured to acquire an instruction list obtained by running a target program in a set environment, the number of threads to be started, and an execution result of each thread on each instruction in the instruction list; wherein the execution result comprises an instruction execution control value for each instruction;
the simulation scheduler 502 is configured to start the thread simulator 503 in the GPU performance model to be analyzed according to the number of threads to be started;
each thread simulator 503 is configured to traverse each instruction in the instruction list, and execute the currently traversed instruction according to the instruction execution control value and the instruction type of each instruction in the traversal process, so as to measure the time length for executing the currently traversed instruction;
the counting portion 504 is configured to obtain a total execution time length for all the instructions in the instruction list executed by all the thread simulators 503 when all the instructions in the instruction list are traversed.
In some examples, the obtaining part 501 is configured to run the target program through a real environment or a simulation environment, and obtain an instruction list of the target program, the number of threads to be started, and an execution result of each thread executing each instruction during running; wherein the execution result includes an operand register value and an instruction execution control value.
In some examples, each of the thread simulators 503 is configured to:
judging whether an instruction execution control value in an execution result of the current traversed instruction represents the execution of the current traversed instruction or not;
corresponding to the instruction execution control value, the current traversed instruction is not executed, and the time length for executing the fixed NOP instruction is used as the time length for executing the current traversed instruction;
determining an instruction type of the currently traversed instruction corresponding to the instruction execution control value indicating execution of the currently traversed instruction;
the instruction type corresponding to the current traversed instruction is a memory access instruction, and the time length for executing the current traversed instruction is measured according to the mode of executing the memory access instruction;
and taking the fixed time length for executing the arithmetic logic instruction as the time length for executing the current traversed instruction, wherein the instruction type corresponding to the current traversed instruction is the arithmetic logic instruction.
In some examples, each of the thread simulators 503 is configured to:
and according to the access address corresponding to the operand register value in the execution result of the current traversed instruction, measuring the time length for executing the current traversed instruction according to a set Cache access analysis model.
In some examples, each of the thread simulators 503 is configured to:
in the traversing process, judging whether the current traversed instruction is an ending instruction in the instruction list:
if not, judging whether an instruction execution control value in the execution result of the current traversed instruction represents the execution of the current traversed instruction or not;
if yes, it is determined that the thread simulator 503 has completed traversing all instructions in the instruction list.
In some examples, each of the thread simulators 503 is further configured to: after the time length for executing the currently traversed instruction is measured, adding the time length for executing the currently traversed instruction to the total execution time length of the corresponding thread simulator 503; before traversing the first instruction in the instruction list, the starting value of the total execution time is 0.
It is understood that in this embodiment, "part" may be part of a circuit, part of a processor, part of a program or software, etc., and may also be a unit, and may also be a module or a non-modular.
In addition, each component in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit. The integrated unit can be realized in a form of hardware or a form of a software functional module.
Based on the understanding that the technical solution of the present embodiment essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the method of the present embodiment. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Therefore, the present embodiment provides a computer storage medium, which stores a program for analyzing GPU performance, and when the program for analyzing GPU performance is executed by at least one processor, the method steps for analyzing GPU performance in the above technical solutions are implemented.
It should be understood that the exemplary technical solution of the apparatus 50 for analyzing GPU performance described above belongs to the same concept as the technical solution of the aforementioned method for analyzing GPU performance, and therefore, the details of the technical solution of the apparatus 50 for analyzing GPU performance, which are not described in detail above, can be referred to the description of the technical solution of the aforementioned method for analyzing GPU performance. The embodiments of the present invention will not be described in detail herein.
It should be noted that: the technical schemes described in the embodiments of the present invention can be combined arbitrarily without conflict.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (8)

1. A method for analyzing GPU performance, the method comprising:
acquiring an instruction list obtained by the operation of a target program under a set environment, the number of threads to be started and an execution result of each thread on each instruction in the instruction list; wherein the execution result comprises an instruction execution control value for each instruction;
starting a thread simulator in the GPU performance model to be analyzed according to the number of threads to be started through a simulation scheduler in the GPU performance model to be analyzed;
each thread simulator traverses each instruction in the instruction list, and executes the currently traversed instruction according to the instruction execution control value and the instruction type of each instruction in the traversing process so as to measure the time length for executing the currently traversed instruction;
when all the instructions in the instruction list are traversed, acquiring the total execution duration of all the instructions in the instruction list executed by all the thread simulators;
wherein each thread simulator traverses each instruction in the instruction list, and executes a currently traversed instruction according to an instruction execution control value and an instruction type of each instruction in the traversal process to measure the time length for executing the currently traversed instruction, and the method comprises the following steps:
for each thread simulator, judging whether an instruction execution control value in the execution result of the current traversed instruction represents the execution of the current traversed instruction;
corresponding to the instruction execution control value, the current traversed instruction is not executed, and the time length for executing the fixed NOP instruction is used as the time length for executing the current traversed instruction;
determining an instruction type of the currently traversed instruction corresponding to the instruction execution control value indicating execution of the currently traversed instruction;
the instruction type corresponding to the current traversed instruction is a memory access instruction, and the time length for executing the current traversed instruction is measured according to the mode of executing the memory access instruction;
and taking the fixed time length for executing the arithmetic logic instruction as the time length for executing the current traversed instruction, wherein the instruction type corresponding to the current traversed instruction is an arithmetic logic instruction.
2. The method of claim 1, wherein the obtaining of the instruction list obtained by the target program running under the set environment, the number of threads to be started, and the execution result of each thread on each instruction in the instruction list comprises:
running the target program through a real environment or a simulation environment, and acquiring an instruction list of the target program, the number of threads needing to be started and an execution result of each instruction executed by each thread in the running process; wherein the execution result includes an operand register value and an instruction execution control value.
3. The method of claim 1, wherein said metering a duration of execution of said currently traversed instruction by way of executing a memory access instruction comprises:
and according to the access address corresponding to the operand register value in the execution result of the current traversed instruction, measuring the time length for executing the current traversed instruction according to a set Cache access analysis model.
4. The method of claim 1, further comprising:
for each thread simulator, judging whether the current traversed instruction is an end instruction in the instruction list in the traversing process:
if not, judging whether an instruction execution control value in the execution result of the current traversed instruction represents the execution of the current traversed instruction or not;
and if so, determining that the thread simulator finishes traversing all the instructions in the instruction list.
5. The method of claim 1, wherein for each of the thread simulators, after metering a duration of execution of the currently traversed instruction, the method further comprises:
for each thread simulator, adding the time length for executing the currently traversed instruction into the total execution time length of the corresponding thread simulator; before traversing the first instruction in the instruction list, the starting value of the total execution time is 0.
6. An apparatus for analyzing GPU performance, the apparatus comprising: the device comprises an acquisition part, a simulation scheduler, a thread simulator and a statistic part; wherein the content of the first and second substances,
the acquisition part is configured to acquire an instruction list obtained by running a target program under a set environment, the number of threads to be started and an execution result of each thread on each instruction in the instruction list; wherein the execution result comprises an instruction execution control value for each instruction;
the simulation scheduler is configured to start a thread simulator in a GPU performance model to be analyzed according to the number of threads to be started;
each thread simulator is configured to traverse each instruction in the instruction list, and execute a currently traversed instruction according to an instruction execution control value and an instruction type of each instruction in the traversing process so as to measure the time length for executing the currently traversed instruction;
the counting part is configured to acquire the total execution duration of all the instructions in the instruction list executed by all the thread simulators when all the instructions in the instruction list are traversed;
wherein each of the thread simulators is configured to:
judging whether an instruction execution control value in an execution result of the current traversed instruction represents the execution of the current traversed instruction or not;
corresponding to the instruction execution control value, the current traversed instruction is not executed, and the time length for executing the fixed NOP instruction is used as the time length for executing the current traversed instruction;
determining an instruction type of the currently traversed instruction corresponding to the instruction execution control value indicating execution of the currently traversed instruction;
the instruction type corresponding to the current traversed instruction is a memory access instruction, and the time length for executing the current traversed instruction is measured according to the mode of executing the memory access instruction;
and taking the fixed time length for executing the arithmetic logic instruction as the time length for executing the current traversed instruction, wherein the instruction type corresponding to the current traversed instruction is the arithmetic logic instruction.
7. The apparatus according to claim 6, wherein the acquiring section is configured to execute the target program through a real environment or a simulation environment, and acquire an instruction list of the target program, the number of threads to be started, and an execution result of each thread executing each instruction during execution; wherein the execution result includes an operand register value and an instruction execution control value.
8. A computer storage medium storing a program for analyzing GPU performance, the program for analyzing GPU performance implementing the method steps of any of claims 1 to 5 when executed by at least one processor.
CN202210192669.XA 2022-03-01 2022-03-01 Method and device for analyzing GPU performance and computer storage medium Active CN114253821B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210192669.XA CN114253821B (en) 2022-03-01 2022-03-01 Method and device for analyzing GPU performance and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210192669.XA CN114253821B (en) 2022-03-01 2022-03-01 Method and device for analyzing GPU performance and computer storage medium

Publications (2)

Publication Number Publication Date
CN114253821A CN114253821A (en) 2022-03-29
CN114253821B true CN114253821B (en) 2022-05-27

Family

ID=80800129

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210192669.XA Active CN114253821B (en) 2022-03-01 2022-03-01 Method and device for analyzing GPU performance and computer storage medium

Country Status (1)

Country Link
CN (1) CN114253821B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117850882B (en) * 2024-03-07 2024-05-24 北京壁仞科技开发有限公司 Single instruction multithreading processing device and method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109308191A (en) * 2017-07-28 2019-02-05 华为技术有限公司 Branch prediction method and device
CN110688153A (en) * 2019-09-04 2020-01-14 深圳芯英科技有限公司 Instruction branch execution control method, related equipment and instruction structure
WO2020181670A1 (en) * 2019-03-11 2020-09-17 Huawei Technologies Co., Ltd. Control flow optimization in graphics processing unit
CN112579164A (en) * 2020-12-05 2021-03-30 西安翔腾微电子科技有限公司 SIMT conditional branch processing device and method
CN112579555A (en) * 2019-09-27 2021-03-30 中国科学院深圳先进技术研究院 Method and apparatus for characterizing a blockchain system at a microarchitectural level
CN113986702A (en) * 2021-10-21 2022-01-28 四川腾盾科技有限公司 Simulator accurate timing method and system based on instruction cycle and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019101543A (en) * 2017-11-29 2019-06-24 サンケン電気株式会社 Processor and pipeline processing method
CN109324987B (en) * 2018-09-27 2021-06-01 海信视像科技股份有限公司 Time sequence control method and device of analog communication interface and electronic equipment
CN112083954A (en) * 2019-06-13 2020-12-15 华夏芯(北京)通用处理器技术有限公司 Mask operation method of explicit independent mask register in GPU

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109308191A (en) * 2017-07-28 2019-02-05 华为技术有限公司 Branch prediction method and device
WO2020181670A1 (en) * 2019-03-11 2020-09-17 Huawei Technologies Co., Ltd. Control flow optimization in graphics processing unit
CN110688153A (en) * 2019-09-04 2020-01-14 深圳芯英科技有限公司 Instruction branch execution control method, related equipment and instruction structure
CN112579555A (en) * 2019-09-27 2021-03-30 中国科学院深圳先进技术研究院 Method and apparatus for characterizing a blockchain system at a microarchitectural level
CN112579164A (en) * 2020-12-05 2021-03-30 西安翔腾微电子科技有限公司 SIMT conditional branch processing device and method
CN113986702A (en) * 2021-10-21 2022-01-28 四川腾盾科技有限公司 Simulator accurate timing method and system based on instruction cycle and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Low power pipelined 8-bit RISC processor design and implementation on FPGA;Jikku Jeemon;《ICCICCT》;20151231;全文 *
基于RISC-V架构的物联网节点处理器研究与设计;张雷;《中国优秀硕士学位论文全文数据库》;20210715;全文 *

Also Published As

Publication number Publication date
CN114253821A (en) 2022-03-29

Similar Documents

Publication Publication Date Title
US8972785B2 (en) Classifying processor testcases
US7779393B1 (en) System and method for efficient verification of memory consistency model compliance
CN104375941B (en) Executable program test use cases binary code coverage rate automates appraisal procedure
US8612944B2 (en) Code evaluation for in-order processing
Taylor et al. A micro-benchmark suite for AMD GPUs
US8359291B2 (en) Architecture-aware field affinity estimation
WO2017114472A1 (en) Method and apparatus for data mining from core traces
CN114253821B (en) Method and device for analyzing GPU performance and computer storage medium
Wang et al. Accurate source-level simulation of embedded software with respect to compiler optimizations
JP5514211B2 (en) Simulating processor execution with branch override
US6212493B1 (en) Profile directed simulation used to target time-critical crossproducts during random vector testing
CN102520984B (en) Computing method for worst time of object software in specified hardware environment
US20110252408A1 (en) Performance optimization based on data accesses during critical sections
US7110934B2 (en) Analysis of the performance of a portion of a data processing system
CN107769987B (en) Message forwarding performance evaluation method and device
Estrin et al. Modeling, measurement and computer power
Razouk The use of Petri Nets for modeling pipelined processors
CN116149917A (en) Method and apparatus for evaluating processor performance, computing device, and readable storage medium
US7885806B2 (en) Simulation method and simulation system of instruction scheduling
CN116521231A (en) Reference model for SPARC V8 instruction set dynamic simulation verification
Ikram et al. Measuring power and energy consumption of programs running on kepler GPUs
JP3688479B2 (en) Fault simulation apparatus, fault simulation method, and computer-readable recording medium recording the simulation program
US7302556B2 (en) Method, apparatus and computer program product for implementing level bias function for branch prediction control for generating test simulation vectors
JP5937530B2 (en) Software error analysis device, error information creation device
Helm et al. Measurement of Main Memory Bandwidth and Memory Access Latency in Intel Processors

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: Room 301, Building D, Yeda Science and Technology Park, No. 300 Changjiang Road, Yantai Area, China (Shandong) Pilot Free Trade Zone, Yantai City, Shandong Province, 265503

Patentee after: Xi'an Xintong Semiconductor Technology Co.,Ltd.

Address before: 710065 Room 301, block B, Taiwei intelligent chain center, No. 8, Tangyan South Road, high tech Zone, Xi'an, Shaanxi Province

Patentee before: Xi'an Xintong Semiconductor Technology Co.,Ltd.

CP03 Change of name, title or address