CN109063340B

CN109063340B - Simulation-based GPU performance testing method and device

Info

Publication number: CN109063340B
Application number: CN201810879971.6A
Authority: CN
Inventors: 韩明月; 林建明; 赵璇
Original assignee: Glenfly Tech Co Ltd
Current assignee: Glenfly Tech Co Ltd
Priority date: 2018-08-03
Filing date: 2018-08-03
Publication date: 2023-08-25
Anticipated expiration: 2038-08-03
Also published as: CN109063340A

Abstract

The disclosure relates to a simulation-based GPU performance test method and device. The method comprises the following steps: in the process of simulating the GPU on the GPU simulation equipment, taking a sampling frame as the input of the simulated GPU; collecting test data of each frame sampling frame in the process of processing the sampling frame based on the simulated GPU; the performance of the emulated GPU is calculated from the test data for each frame of sample frames. Therefore, according to the simulation-based GPU performance testing method and device, the GPU is simulated on the GPU simulation equipment by taking the sampling frame as the input of the simulated GPU, so that the performance of the GPU can be predicted more accurately in the chip design stage, the running time of the GPU simulation equipment can be reduced, and the simulation efficiency is improved.

Description

Simulation-based GPU performance testing method and device

Technical Field

The disclosure relates to the technical field of integrated circuit design, in particular to a simulation-based GPU performance test method and device.

Background

The development of system chips has led to the rapid development of intelligent devices such as mobile phones, tablets, intelligent set-top boxes, etc., which become more and more powerful, and the need for users to run 3D games and applications on these devices is also increasing. These require a system-on-chip graphics processor (GPU, graphics Processing Unit) to have sufficient capabilities to support these 3D applications, while GPU performance is evaluated primarily using a number of widely used GPU evaluation tools (GPU benchmarks) and game-on-board evaluation tools.

The GPU evaluation tools that are widely used mainly include GFXBench, 3DMark, baseMark, etc., where the test scene is usually a complex scene or a game scene, and the evaluation result is usually a Frame rate (fps), a Frame number, or a score calculated according to the Frame rate and/or the Frame number.

The performance of the prediction method is mainly judged from two aspects: the magnitude of the prediction error and the difficulty of prediction.

1. The chip design specification is directly used for predicting the GPU performance, so that the method is simple and convenient, but the prediction error is larger, and only rough estimation can be performed;

2. the prediction method using low-level performance also has a certain limitation: firstly, the fitted empirical formula has larger prediction errors on different GPU architectures, and the errors can be reduced by fitting the empirical formula with complete training data; secondly, when the ratio of the frequencies of the GPU and the video memory is changed, the change curve of the low-level performance is not completely consistent with the change curve of the fraction predicted by the GPU evaluating tool, so that the prediction error is increased when the ratio of the frequencies is changed; and thirdly, in the design stage, the results of low-level performance evaluation tests such as filling rate, triangle throughput rate and the like are required to be estimated, and the estimated accuracy rate also influences the prediction result of the 3D performance of the chip.

3. The running speed of the simulation GPU chip is very slow, namely, the running speed of the simulation GPU chip is several orders of magnitude slower than that of the finally produced GPU chip, whether on the CMODEL or the hardware simulation device or the FPGA. Therefore, in the chip design stage, it is not feasible to directly run a common GPU evaluating tool on the hardware simulation device.

Disclosure of Invention

In view of this, the present disclosure provides a GPU performance testing method and apparatus based on simulation, which can more accurately predict the performance of a GPU in a chip design stage, and simultaneously reduce the running time of GPU simulation equipment and improve the simulation efficiency.

According to an aspect of the present disclosure, there is provided a simulation-based GPU performance testing method, the method including:

in the process of simulating the GPU on the GPU simulation equipment, taking a sampling frame as the input of the simulated GPU;

inserting acquisition commands into the head and tail of each command buffer zone corresponding to the sampling frame;

in the process of processing sampling frames based on the simulated GPU, aiming at each frame sampling frame, collecting test data of commands in the command buffer corresponding to the frame sampling frame in the running process of the GPU simulation equipment through the collecting commands of the head part and the tail part of the command buffer corresponding to the frame sampling frame;

The performance of the emulated GPU is calculated from the test data for each frame of sample frames.

In one possible implementation, the test data includes a run time of each frame sample frame,

calculating the performance of the emulated GPU from the test data of each frame of sample frames, comprising:

the frame rate is calculated from the run time of each frame sample frame.

In one possible implementation, calculating the frame rate from the run time of each frame sample frame includes:

according to the running time of each frame sampling frame, utilizing a linear interpolation algorithm to conduct frame insertion between the first frame sampling frame and the last frame sampling frame, and determining the total frame number of the inserted frames;

calculating a frame rate according to the total frame number and the first sampling time interval;

wherein the first sampling time interval is a sampling time interval between a last frame of samples and a first frame of samples.

In one possible implementation, according to the running time of each frame sampling frame, a linear interpolation algorithm is used to perform frame interpolation between the first frame sampling frame and the last frame sampling frame, and the total frame number of the frame interpolation is determined, including:

calculating the running time of the inserted frame by using a linear interpolation algorithm according to the first time interval, the second sampling time interval and the running time of two sampling frames closest to the inserted frame aiming at each inserted frame;

Wherein, the first time interval is: the time interval between the start time of the inserted frame and the sampling time of the previous sampling frame closest to the inserted frame is: the time interval between the start time of the inserted frame and the sampling time of the next sampling frame closest to the inserted frame, the second sampling time interval being: a sampling time interval between two sampling frames nearest to the insertion frame;

if the sum of the start time of the inserted frame and the running time of the inserted frame is smaller than the sampling time of the sampling frame of the last frame, taking the sum of the start time of the inserted frame and the running time of the inserted frame as the running end time of the inserted frame;

if the sum of the start time of the inserted frame and the running time of the inserted frame is greater than or equal to the sampling time of the last frame sampling frame, ending the inserted frame and calculating the total frame number of the inserted frame;

the start time of the first inserted frame is the sampling time of the first frame sampling frame, and the start time of the inserted frame after the first inserted frame is the operation end time of the last inserted frame of the inserted frame.

In one possible implementation, calculating the run time of the interpolated frame using a linear interpolation algorithm based on the first time interval, the second time interval, a sampling time interval between two sampling frames closest to the interpolated frame, and the run time of two sampling frames closest to the interpolated frame, includes:

Calculating the ratio of the second time interval to the second sampling time interval as a first weight;

calculating the ratio of the first time interval to the second sampling time interval as a second weight;

calculating a first product of the first weight and the first run time and a second product of the second weight and the second run time;

a run time for taking the sum of the first product and the second product as an interpolated frame;

wherein, the first runtime is: the run time of the sample frame of the previous frame closest to the interpolated frame,

the second run time is: the run time of the sample frame of the next frame closest to the interpolated frame.

determining the average value of the running time of one frame of sampling frame according to the running time of each frame of sampling frame;

the frame rate is calculated from the running time average of a frame of sample frames.

In one possible implementation, the test data includes a value of a hardware counter.

In one possible implementation, calculating the performance of the emulated GPU from the test data of each frame sample frame includes:

aiming at each command buffer zone, obtaining the running time of commands in the command buffer zone on the GPU simulation equipment according to the difference value of the hardware counter acquired by the acquisition commands at the tail part of the command buffer zone and the hardware counter acquired by the acquisition commands at the head part of the command buffer zone;

Taking the sum of the running time of commands in the command buffer corresponding to each frame of sampling frame on the GPU simulation equipment as the running time of the sampling frame;

the frame rate is calculated from the run time of each frame sample frame.

In one possible implementation, the method further includes:

and sampling the original data to obtain the sampling frame.

In one possible implementation, sampling the original data to obtain the sampling frame includes:

and sampling the original data by adopting a fixed sampling time interval to obtain the sampling frame.

In one possible implementation, the test data of each frame sample frame includes a memory bandwidth corresponding to each frame sample frame in operation,

calculating the performance of the GPU according to the test data of each frame sampling frame comprises the following steps:

and calculating the memory bandwidth utilization rate of the GPU according to the proportion of the memory bandwidth corresponding to the operation of each frame sampling frame to the maximum value of the memory bandwidth.

According to another aspect of the present disclosure, there is provided a simulation-based GPU performance testing apparatus, the apparatus comprising:

the input module is used for taking the sampling frame as the input of the simulated GPU in the process of simulating the GPU on the GPU simulation equipment;

The inserting module is used for inserting acquisition commands into the head part and the tail part of each command buffer area corresponding to the sampling frame;

the system comprises an acquisition module, a command buffer area and a command buffer area, wherein the acquisition module is used for acquiring test data of commands in the command buffer area corresponding to each frame of sampling frames in the running process of the GPU simulation equipment through acquisition commands of the head part and acquisition commands of the tail part of the command buffer area corresponding to each frame of sampling frames in the process of processing the sampling frames based on the simulated GPU;

and the calculation module is used for calculating the performance of the simulated GPU according to the test data of each frame sampling frame.

the computing module includes:

a first calculation sub-module for calculating a frame rate based on the run time of each frame sample frame.

In one possible implementation, the first computing submodule includes:

the frame inserting sub-module is used for carrying out frame inserting between the first frame sampling frame and the last frame sampling frame by utilizing a linear interpolation algorithm according to the running time of each frame sampling frame, and determining the total frame number of the inserted frames;

a first frame rate calculation sub-module for calculating a frame rate according to the total frame number and a first sampling time interval;

In a possible implementation manner, the frame inserting sub-module is further used for

the frame inserting sub-module is further configured to, if the sum of the start time of the inserted frame and the running time of the inserted frame is less than the sampling time of the last frame sampling frame, use the sum of the start time of the inserted frame and the running time of the inserted frame as the running end time of the inserted frame;

the frame inserting sub-module is further used for ending the frame inserting and calculating the total frame number of the inserted frame if the sum of the starting time of the inserted frame and the running time of the inserted frame is greater than or equal to the sampling time of the last frame sampling frame; the start time of the first inserted frame is the sampling time of the first frame sampling frame, and the start time of the inserted frame after the first inserted frame is the operation end time of the last inserted frame of the inserted frame.

In one possible implementation, the frame inserting sub-module is further configured to:

In one possible implementation, the first computing sub-module further includes:

the average value determining sub-module is used for determining the average value of the running time of one frame of sampling frame according to the running time of each frame of sampling frame;

and the second frame rate computing sub-module is used for computing the frame rate according to the average value of the running time of one frame of sampling frame.

In one possible implementation, the computing module includes:

The first operation time calculation sub-module is used for obtaining the operation time of the command in each command buffer area on the GPU simulation equipment according to the difference value of the hardware counter acquired by the acquisition command at the tail part of the command buffer area and the hardware counter acquired by the acquisition command at the head part of the command buffer area;

a second operation time calculation sub-module, configured to use the sum of operation times of commands in the command buffer area corresponding to each frame of sampling frame on the GPU emulation device as the operation time of the sampling frame;

In one possible implementation, the apparatus further includes:

and the sampling module is used for sampling the original data to obtain the sampling frame.

In one possible implementation, the sampling module includes:

and the sampling sub-module is used for sampling the original data by adopting a fixed sampling time interval to obtain the sampling frame.

the computing module includes:

And the second computing sub-module is used for computing the memory bandwidth utilization rate of the GPU according to the proportion of the memory bandwidth corresponding to the memory bandwidth maximum value in the operation of each frame sampling frame.

According to another aspect of the present disclosure, there is provided a simulation-based GPU performance testing apparatus, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to perform the above method.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, implement the above-described method.

The beneficial effects are that:

in the chip design stage, the GPU simulation equipment is adopted to simulate the GPU more accurately than the mode of directly using the chip design specification or a low-level chip to estimate the GPU performance, and the sampling frame can reduce the running time of the GPU simulation equipment.

Therefore, according to the simulation-based GPU performance testing method and device, the GPU is simulated on the GPU simulation equipment by taking the sampling frame as the input of the simulated GPU, so that the performance of the GPU can be predicted more accurately in the chip design stage, the running time of the GPU simulation equipment can be reduced, and the simulation efficiency is improved.

By inserting the acquisition command into the head and tail of each command buffer area corresponding to the sampling frame, the test data corresponding to the sampling frame can be acquired through the acquisition command in the process of processing the sampling frame by the simulated GPU, so that the acquired data more accords with the actual running condition, and the performance of the GPU can be predicted more accurately.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features and aspects of the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 illustrates a flowchart of a simulation-based GPU performance testing method according to an embodiment of the present disclosure.

FIG. 2 illustrates a flowchart of a simulation-based GPU performance testing method according to an embodiment of the present disclosure.

Fig. 3 shows a schematic diagram after insertion of a capture command at the head and tail of a command buffer, according to an embodiment of the present disclosure.

Fig. 4 shows a flowchart of a method of step S13 according to an embodiment of the present disclosure.

Fig. 5 shows a flowchart of a method of step S133 according to an embodiment of the present disclosure.

Fig. 6 shows a schematic diagram of a plug frame according to an embodiment of the present disclosure.

FIG. 7 illustrates a block diagram of a simulation-based GPU performance testing apparatus, according to an embodiment of the present disclosure.

FIG. 8 illustrates a block diagram of a simulation-based GPU performance testing apparatus, according to an embodiment of the present disclosure.

FIG. 9 is a block diagram of a GPU performance testing apparatus for simulation-based according to an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

In addition, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order not to obscure the present disclosure.

The evaluation method mainly comprises the following steps:

one is that the GPU evaluation tool renders a dynamic test scene using a fixed time (e.g., 1s clock), looks at how many frames were rendered in the fixed time, and then calculates the frame rate from the frames;

alternatively, for a given number of frames (e.g., 10 frames), the GPU evaluation tool takes some time to render to completion, and then calculates the frame rate.

The frame rate calculation modes of the two evaluation methods are as shown in the formula (1):

fps＝total_frames/total_seconds (1)

where total_frames is the total number of frames rendered at run-time and total_seconds is the total run-time.

In addition, the memory bandwidth utilization rate of the GPU can be calculated by evaluating the proportion of the memory bandwidth used by the GPU in rendering, so as to know whether the design of the memory bandwidth is sufficient, that is, whether the memory bandwidth forms a bottleneck of the rendering capability of the GPU.

Various methods exist for predicting GPU performance in the GPU design phase, for example, (1) 3D rendering performance of the GPU is predicted directly using chip design specifications, such as number of stream processors in the chip, frequency, bandwidth, etc.; (2) Predicting chip 3D rendering performance using low-level performance (low-level performance); (3) In the chip design stage, CMODEL (C hardware model) is used to describe the hardware, and the designed chip is simulated on a hardware simulation device or FPGA, and the performance counter data is obtained using the hardware simulation device, so as to predict the performance of the actual chip.

However, as described above, the above method has a large prediction error or low efficiency, and thus the present disclosure provides a GPU performance test method capable of realizing more accurate prediction of the performance of the GPU in the chip design stage, while reducing the running time of the GPU simulation device and improving the simulation efficiency.

FIG. 1 illustrates a flowchart of a simulation-based GPU performance testing method according to an embodiment of the present disclosure. As shown in fig. 1, the method may include:

step S11, taking a sampling frame as the input of the simulated GPU in the process of simulating the GPU on the GPU simulation equipment.

The GPU emulation device may be an FPGA (english: field Programmable Gate Array, chinese: field programmable gate array), a device running CMODEL (C hardware model), or a hardware emulation device.

For example, according to the established design specification of the GPU chip, the GPU may be described by a hardware description language (english full name: hardware Description Language, english abbreviation: HDL) to form an HDL code of the GPU, and then the HDL code of the GPU is input into a C hardware model or burned into an FPGA or the like to simulate the GPU.

The sampling frame in step S11 may be obtained by sampling the original data in advance on other devices. Other devices may be devices running 3D evaluation tools, and 3D evaluation tools may be widely used GPU evaluation tools GFXBench,3DMark,BaseMark, etc. The raw data may represent data with 3D scenes, e.g., API running sequence 3D scripts of a 3D evaluation tool, 3D multimedia resources of 3D video, etc., etc.

In one example, an API run sequence 3D script of a 3D evaluation tool may be sampled to obtain m frames of samples, which may be denoted as F1, F2, F3 … … Fm.

In another example, the 3D multimedia resource may also be sampled to obtain m frames of sampled frames, which is not limited by the present disclosure. In this example, the original sample frame may be obtained by first sampling the 3D multimedia resource, and a 3D script of the sample frame may be generated from the original sample frame.

The 3D script of m frames of sample frames may then be input into the GPU emulation device as input to the emulated GPU.

The frequency of sampling may be fixed, e.g., the original data is sampled at fixed sampling time intervals; the sampling frequency can also be changed, specifically, the sampling frequency can be set according to the change of the 3D scene, for example, the part with continuously changed 3D scene in the original data can be sampled with a smaller frequency, and the part with more 3D scene switching in the original data can be sampled with a larger frequency, so that the predicted result is more accurate.

The higher the sampling frequency, the denser the sampling, the more the number of frames sampled, and the more accurate the performance prediction, but the longer the emulated GPU will take to process the sampled frames. Therefore, according to the data volume of the original data, the scene switching volume of the 3D scene and factors of various aspects of the simulated GPU, the accuracy of the prediction result and the simulation efficiency are considered, and the sampling frequency is set.

Step S12, collecting test data of each frame of sampling frame in the process of processing the sampling frame based on the simulated GPU.

After inputting the 3D script of m frames of samples into the GPU emulation device, the emulated GPU may process the m frames of samples, e.g., render the samples frame from frame to frame, etc.

The test data may be parameters that are related to the performance of the GPU, such as, for example, the runtime of the emulated GPU processing a frame of sample frames (i.e., the runtime of a frame of sample frames on the emulated GPU), the memory bandwidth used by the emulated GPU in processing a frame of sample frames (i.e., the memory bandwidth used in each frame of sample frames running, respectively), etc.

In one example, during the processing of a frame of sample frames by the emulated GPU, test data may be obtained by reading data from a counter, or data from memory, etc.

Step S13, calculating the performance of the simulated GPU according to the test data of each frame sampling frame.

The performance of the emulated GPU may include frame rate, memory bandwidth usage, etc. For example, the test data may include a runtime of each frame sample frame, and the GPU emulation device may calculate a frame rate of the emulated GPU based on the runtime of each frame sample frame; the test data may further include a memory bandwidth corresponding to the memory bandwidth used in each frame of the sample frame operation, and the GPU emulation device may calculate a memory bandwidth usage rate of the GPU according to a proportion of the memory bandwidth corresponding to the memory bandwidth used in each frame of the sample frame operation, where the average bandwidth usage rate may refer to, for example, an instantaneous bandwidth usage rate or an average bandwidth usage rate: and the average value of the proportion of the maximum memory bandwidth corresponding to the memory bandwidth used in the operation of each frame sampling frame.

In the chip design stage, the GPU simulation equipment is adopted to simulate the GPU more accurately than the mode of directly using the chip design specification or a low-level chip to estimate the GPU performance, and the sampling frame can reduce the running time of the GPU simulation equipment. Therefore, according to the simulation-based GPU performance test method disclosed by the invention, the GPU is simulated on the GPU simulation equipment by taking the sampling frame as the input of the simulated GPU, so that the performance of the GPU can be predicted more accurately in the chip design stage, the running time of the GPU simulation equipment can be reduced, and the simulation efficiency is improved.

FIG. 2 illustrates a flowchart of a simulation-based GPU performance testing method according to an embodiment of the present disclosure. As shown in fig. 2, the method may further include:

in step S14, the acquisition command is inserted into the header and the trailer of each command buffer corresponding to the sampling frame.

As an example, as shown in fig. 2, step S14 may be performed before step S12, after step S11. The step S14 may be performed before the step S12, for example, the acquisition command may be inserted before the sample frame is input to the GPU to be emulated, or the acquisition command may be inserted when the sample frame is input to the GPU to be emulated, which is not limited in this disclosure.

The size of the command buffer (command buffer) can be different in different systems, and the user can set the size of the command buffer according to the requirement. The 3D script for each frame of sample frames may be stored in one or more command buffers, and thus, one sample frame may correspond to one or more command buffers.

The acquisition command may be a section of GPU command, which is used to acquire test data of each frame of sampling frame in the process of processing the sampling frame by the emulated GPU. For example, the acquisition command may be used to save the current value of the hardware counter to a pre-allocated memory, such as dump counter command (dump counter command); acquisition commands may also be used to read memory bandwidth.

The hardware counter may be provided on the GPU emulation device or may be provided in the emulated GPU, which is not limited in this disclosure.

Taking the data of the hardware counter as an example, when a 3D evaluating tool is operated, a 3D context (3D context) is created, the present disclosure may control whether the hardware counter needs to be acquired for one or more 3D contexts in a GPU driver in the emulated GPU, after the acquisition is enabled, the GPU driver inserts a section of GPU command into the head and tail of each command buffer of the 3D context where the acquisition is enabled, and the GPU driver may drive the GPU command, thereby saving the current value of the hardware counter into a pre-allocated video memory.

Commands are inserted when performance needs to be analyzed, so in implementation, the register controlling the dump counter is set by setprop function or file reading method. The GPU driver may determine the head and tail of each command buffer and insert dump counter command if the dump counter register is set.

In one possible implementation, the acquisition command may be inserted at the head and tail of each command buffer corresponding to the sample frame during the input of the sample frame to the GPU hardware emulation device.

For example, in the process of inputting the 3D script of the m frame sample frame into the GPU emulation device, taking the first frame sample frame as an example, the GPU driver determines whether the register of the dump counter is set, if the register of the dump counter is set, inserts the acquisition command dump counter command into the head of the command buffer, then takes the 3D script of the appropriate first frame sample frame according to the size of the command buffer, stores the 3D script into the command buffer, and inserts the acquisition command dump counter command into the tail of the command buffer; and then to the next command buffer. If the first frame sample frame is already stored in the command buffer, but the last command buffer corresponding to the first frame sample frame is still not full, at this time, in order to make statistics on test data of each frame sample frame respectively, the 3D script of the second frame sample frame is not stored in the last command buffer, but enters the next new command buffer to store the second frame sample frame. Fig. 3 shows a schematic diagram after insertion of a capture command at the head and tail of a command buffer, according to an embodiment of the present disclosure.

In this embodiment, the "acquisition of test data for each frame of sampling frames" in step S12 may include:

and for each frame sampling frame, collecting test data of commands in the command buffer corresponding to the frame sampling frame in the running process of the GPU simulation equipment through the collection commands of the head part and the collection commands of the tail part of the command buffer corresponding to the frame sampling frame.

In one possible implementation, each command buffer may carry an identification of the corresponding sample frame. The identification of the sampling frames may be information that uniquely identifies each sampling frame, such as a serial number or the like that is previously assigned to the sampling frame. Thus, after the test data is collected, the test data of each frame of sampling frame can be determined according to the identification of the sampling frame carried by the command buffer. For example, the command buffers command buffer 1, command buffer 2, and command buffer 3 all carry the identifier of the sampling frame F1, so that the test data of the sampling frame F1 can be determined according to the test data collected by the collection commands of the head and tail of the command buffer 1, command buffer 2, and command buffer 3.

In one possible implementation, the test data may include a value of a hardware counter. Then, as shown in fig. 4, step S13, calculating the performance of the emulated GPU according to the test data of each frame sampling frame may include:

step S131, for each command buffer, obtaining the running time of the command in the command buffer on the GPU simulation equipment according to the difference value between the value of the hardware counter acquired by the acquisition command at the tail part of the command buffer and the value of the hardware counter acquired by the acquisition command at the head part of the command buffer.

After the GPU driver drives the acquisition command to store the current value of the hardware counter in the pre-allocated video memory, the GPU driver can store the value read number of the hardware counter stored in the pre-allocated video memory into a data file. The data file may include an identifier of each command buffer, and a value of a hardware counter for the corresponding acquisition command of the header and a value of a hardware counter for the acquisition command of the tail, for example, the data file may be stored in a table, and a plurality of data may be recorded in the table, and each data may include an identifier of the command buffer, and a value of a hardware counter for the corresponding acquisition command of the header and a value of a hardware counter for the acquisition command of the tail.

In one possible implementation, the emulated GPU may further include a data processing module (e.g., a processor), and the GPU emulation device may calculate a runtime of commands in each command buffer on the GPU emulation device from the values in the data file based on the data processing module. For example, taking command buffer 1 as an example, the GPU emulation device may calculate, based on the data processing module, a difference between a value of a hardware counter for acquisition of a command at a tail portion of command buffer 1 and a value of a hardware counter for acquisition of a command at a head portion, and use the difference as a running time of a command in command buffer 1 on the GPU emulation device.

Step S132, taking the sum of the running time of commands in the command buffer corresponding to each frame of sampling frame on the GPU simulation equipment as the running time of the sampling frame.

The run time of a sample frame may represent: based on the emulated GPU, the runtime of the frame on the GPU emulation device is sampled.

As described above, for the sampling frame F1, the running times of the commands in the command buffer 1, the command buffer 2, and the command buffer 3 on the GPU emulation device may be calculated, respectively, and then the sum of the running times of the commands in the command buffer 1, the command buffer 2, and the command buffer 3 on the GPU emulation device is taken as the running time of the sampling frame F1.

The run time RT (F1), RT (F2), RT (F3) … … RT (Fm) for each of the sampled frames F1, F2, F3 … … Fm can be calculated by the above procedure.

It should be noted that, the above process describes a calculation manner of the operation time, for the memory bandwidth correspondingly used in operation of each frame of sampling frame, the test data may further include a value of a counter (counter) related to the memory bandwidth used in operation, and the memory bandwidth correspondingly used in operation of each frame of sampling frame may be calculated in a manner similar to that of step S131 and step S132.

Step S133, calculating the frame rate according to the running time of each frame sampling frame.

The frame rate may represent the number of frames displayed per second.

In one possible implementation, the average of the running times of the one-frame sample frame may be determined according to the running time of each one-frame sample frame, and then the frame rate may be calculated according to the average of the running times of the one-frame sample frame, for example, taking the inverse of the average of the running times of the one-frame sample frame as the frame rate.

Taking the m frame sample frames F1, F2, F3 … … Fm as an example, the frame rate can be calculated using the following equation (2):

fps＝1/((RT(F1)+RT(F2)+……+RT(Fm))/m) (2)。

by the GPU performance test method, the running time of each frame of sampling frame can be accurately obtained, and therefore the frame rate of the GPU can be accurately predicted.

Fig. 5 shows a flowchart of a method of step S133 according to an embodiment of the present disclosure. As shown in fig. 5, step S133 may include:

in step S1331, according to the running time of each frame sampling frame, a linear interpolation algorithm is utilized to perform frame interpolation between the first frame sampling frame and the last frame sampling frame, and the total frame number of the frame interpolation is determined.

Step S1332, calculating the frame rate according to the total frame number and the first sampling time interval.

Fig. 6 shows a schematic diagram of a plug frame according to an embodiment of the present disclosure. As shown in fig. 6, the abscissa of the point where the sampling frames F1, F2, F3 … … Fm are located may represent the sampling time of the sampling frames, and the time interval dTfixtime between the two sampling frames may represent the sampling time interval.

In one possible implementation, the insertion may be performed starting from F1, where the start time of the first insertion frame is the sampling time of the first frame sampling frame F1. The start time of an inserted frame after the first inserted frame is the running end time of the last inserted frame of the inserted frame, that is, the running end time of one inserted frame is taken as the start time of the next inserted frame of the inserted frame.

For each interpolated frame, the run time of the interpolated frame is calculated using a linear interpolation algorithm based on the first time interval, the second sampling time interval, and the run times of the two sampling frames closest to the interpolated frame.

Wherein, the first time interval is: the time interval between the start time of the inserted frame and the sampling time of the previous sampling frame closest to the inserted frame is: the time interval between the start time of the inserted frame and the sampling time of the next sampling frame closest to the inserted frame, the second sampling time interval being: the second sampling time interval, which is the dTfixtime as described above, is the sampling time interval between two sampling frames closest to the interpolated frame, with a fixed sampling frequency.

As shown in fig. 6, taking an inserted frame from point a as an example, the start time of the inserted frame is noted as T (a), that is, the running time of an inserted frame from point a is to be calculated.

At this time, the first time interval is dt1=t (a) -T (Fn), where T (Fn) is a sampling time of the sampling frame Fn, and the sampling frame Fn may represent a previous sampling frame closest to the insertion frame.

The second time interval is dt2=t (fn+1) -T (a), where T (fn+1) is the sampling time of sampling frame fn+1, and sampling frame fn+1 may represent the next sampling frame closest to the insertion frame.

The second sampling time interval dtfixtime=t (fn+1) -T (Fn), i.e. the sampling time interval between two sampling frames closest to the insertion frame, may be inconvenient to fix for the case where the frequency of sampling is fixed.

In one possible implementation, the run-time of the inserted frame may be calculated as:

The run time of the interpolated frame can be calculated as follows equation (3):

dT＝RT(Fn)*weight1+RT(Fn+1)*weight2 (3)

Wherein, weight 1=dT2/dTfixtime, weight 2=dT1/dTfixtime.

After calculating the run time of the inserted frame, the sum of the start time of the inserted frame and the run time of the inserted frame may be calculated.

If the sum of the start time of the inserted frame and the running time of the inserted frame is smaller than the sampling time of the sampling frame of the last frame, the sum of the start time of the inserted frame and the running time of the inserted frame is used as the running end time of the inserted frame. And when the sum of the start time of the inserted frame and the running time of the inserted frame is smaller than the sampling time of the sampling frame of the last frame, the inserted frame can be continued. As shown in fig. 5, T (B) =t (a) +dt may be taken as the running end time of the insertion frame from point a, and at the same time T (B) may also be taken as the start time of the next insertion frame, and the above process is repeated to continue the insertion frame.

If the sum of the start time of the inserted frame and the running time of the inserted frame is greater than or equal to the sampling time of the last frame sampling frame, ending the inserted frame and calculating the total frame number of the inserted frame.

In one example, if the sum of the start time of the interpolated frame and the run time of the interpolated frame is greater than the sample time of the last frame sample frame, the fraction of the difference between the sample time of the last frame sample frame and the start time of the last interpolated frame to the run time of the last interpolated frame may be taken as the fractional portion of the total frame number of the interpolated frame. For example, if T (B) =t (a) +dt > T (Fm), the fractional part of the total frame number of the inserted frame may be (T (Fm) -T (a))/dT. It will be appreciated by those skilled in the art that this example is merely one example of determining the total number of frames inserted, and not limiting the present disclosure in any way, for example, the fractional portion of the total number of frames inserted may not be determined, but T (a) may be taken as the time to end the insertion.

In one example, after the total number of frames inserted is calculated, the ratio of the total number of frames to the first sampling time interval may be taken as the frame rate of the emulated GPU. As shown in the following formula (4):

fps＝total_frame1/(T(Fm)-T(F1)) (4)

where total_frame1 is the total number of intervening frames and (T (Fm) -T (F1)) is the sampling time interval between the last and first frame sample frames.

During data processing, the 3D performance of the simulated GPU is estimated by using a linear interpolation and recurrence method, so that the accuracy of a prediction result is improved, and the prediction error is reduced. According to the simulation-based GPU performance test method, the purpose of accurately and conveniently predicting the GPU performance in the chip design stage is achieved.

FIG. 7 illustrates a block diagram of a simulation-based GPU performance testing apparatus, according to an embodiment of the present disclosure. As shown in fig. 7, the apparatus may include:

an input module 71, configured to take a sampling frame as an input of the simulated GPU in a process of simulating the GPU on the GPU simulation device;

an inserting module 72, configured to insert an acquisition command at a head and a tail of each command buffer corresponding to a sampling frame;

the acquisition module 73 is configured to acquire, for each frame of sampling frame, test data of a command in a command buffer area corresponding to the frame of sampling frame in a process of operating on the GPU simulation device by an acquisition command of a head and an acquisition command of a tail of the command buffer area corresponding to the frame of sampling frame during a process of processing the sampling frame based on the GPU to be simulated;

A calculation module 74 for calculating the performance of the emulated GPU based on the test data for each frame of sample frames.

In the chip design stage, the GPU simulation equipment is adopted to simulate the GPU more accurately than the mode of directly using the chip design specification or a low-level chip to estimate the GPU performance, and the sampling frame can reduce the running time of the GPU simulation equipment. Therefore, according to the simulation-based GPU performance testing device, the sampling frame is used as the input of the simulated GPU, the GPU is simulated on the GPU simulation equipment, so that the performance of the GPU can be predicted more accurately in the chip design stage, the running time of the GPU simulation equipment can be reduced, and the simulation efficiency is improved. By inserting the acquisition command into the head and tail of each command buffer area corresponding to the sampling frame, the test data corresponding to the sampling frame can be acquired through the acquisition command in the process of processing the sampling frame by the simulated GPU, so that the acquired data more accords with the actual running condition, and the performance of the GPU can be predicted more accurately.

FIG. 8 illustrates a block diagram of a simulation-based GPU performance testing apparatus, according to an embodiment of the present disclosure. As shown in fig. 8, in one possible implementation, the test data includes a run time of each frame sample frame, and the calculation module 74 may include:

A first calculation sub-module 741 for calculating a frame rate from the run time of each frame sample frame.

In one possible implementation, the first computing sub-module 741 may include:

the frame inserting submodule 7411 is used for carrying out frame inserting between the first frame sampling frame and the last frame sampling frame by utilizing a linear interpolation algorithm according to the running time of each frame sampling frame and determining the total frame number of the inserted frames;

a first frame rate calculation sub-module 7412 for calculating a frame rate based on the total frame number and the first sampling time interval;

In one possible implementation, the frame inserting submodule 7411 is further configured to calculate, for each inserted frame, an operation time of the inserted frame according to the first time interval, the second sampling time interval, and operation times of two sampling frames closest to the inserted frame, using a linear interpolation algorithm;

The frame inserting submodule 7411 is further configured to take the sum of the start time of the inserted frame and the running time of the inserted frame as the running end time of the inserted frame if the sum of the start time of the inserted frame and the running time of the inserted frame is less than the sampling time of the last frame sampling frame;

the frame inserting submodule 7411 is further configured to end frame inserting and calculate a total frame number of the inserted frame if a sum of a start time of the inserted frame and an operation time of the inserted frame is greater than or equal to a sampling time of a sampling frame of a last frame;

In one possible implementation, the frame insertion submodule 7411 is further configured to:

In one possible implementation, the first computing sub-module 741 may further include:

a mean value determination submodule 7413, configured to determine a mean value of the running time of each frame of sampling frame according to the running time of each frame of sampling frame;

a second frame rate calculation sub-module 7414 for calculating the frame rate from the running time average of a frame of sample frames.

The calculation module 74 may include:

a first runtime calculation sub-module 742, configured to obtain, for each command buffer, a runtime of a command in the command buffer on the GPU emulation device according to a difference between a value of a hardware counter acquired by a collection command at a tail portion of the command buffer and a value of a hardware counter acquired by a collection command at a head portion of the command buffer;

a second runtime calculation sub-module 743, configured to take the sum of the runtime of the commands in the command buffer corresponding to each frame sample frame on the GPU emulation device as the runtime of the sample frame.

In one possible implementation, the apparatus may further include:

a sampling module 75, configured to sample the original data to obtain the sampling frame.

In one possible implementation, the sampling module 75 may include:

and the sampling submodule 751 is used for sampling the original data at fixed sampling time intervals to obtain the sampling frame.

In one possible implementation, the test data of each frame sample frame includes a memory bandwidth corresponding to a memory bandwidth used in operation of each frame sample frame,

the calculation module 74 may include:

a second computing sub-module 744 is configured to compute the memory bandwidth usage of the GPU, such as the instantaneous bandwidth usage or the average bandwidth usage, according to the proportion of the memory bandwidth corresponding to each frame sampling frame in operation that is occupied by the designed maximum memory bandwidth.

FIG. 9 is a block diagram illustrating a GPU performance testing apparatus 900 for simulation based according to an example embodiment. Referring to fig. 9, the apparatus 900 may include a processor 901, a machine-readable storage medium 902 storing machine-executable instructions. The processor 901 and the machine readable storage medium 902 may communicate via a system bus 903. Also, the processor 901 performs the emulation-based GPU performance testing method described above by reading machine executable instructions in the machine-readable storage medium 902 corresponding to the emulation-based GPU performance testing logic.

The machine-readable storage medium 902 referred to herein may be any electronic, magnetic, optical, or other physical storage device that can contain or store information, such as executable instructions, data, or the like. For example, a machine-readable storage medium may be: RAM (Radom Access Memory, random access memory), volatile memory, non-volatile memory, flash memory, a storage drive (e.g., hard drive), a solid state drive, any type of storage disk (e.g., optical disk, dvd, etc.), or a similar storage medium, or a combination thereof.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvement of the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A simulation-based GPU performance testing method, the method comprising:

2. The method of claim 1, wherein the test data comprises a run time of each frame sample frame,

the frame rate is calculated from the run time of each frame sample frame.

3. The method of claim 2, wherein calculating the frame rate from the run time of each frame sample frame comprises:

4. A method according to claim 3, wherein interpolating frames between a first frame sample frame and a last frame sample frame using a linear interpolation algorithm based on the run time of each frame sample frame, and determining the total number of frames of the interpolated frames, comprises:

5. The method of claim 4, wherein calculating the run time of the interpolated frame using the linear interpolation algorithm based on the first time interval, the second time interval, the sampling time interval between two sampling frames closest to the interpolated frame, and the run time of two sampling frames closest to the interpolated frame, comprises:

6. The method of claim 2, wherein calculating the frame rate from the run time of each frame sample frame comprises:

7. The method of any one of claims 1 to 6, wherein the test data comprises a hardware counter value.

8. The method of claim 7, wherein calculating the performance of the emulated GPU from the test data for each frame of sample frames comprises:

the frame rate is calculated from the run time of each frame sample frame.

9. The method according to claim 1, wherein the method further comprises:

and sampling the original data to obtain the sampling frame.

10. The method of claim 9, wherein sampling the raw data to obtain the sampled frame comprises:

11. The method of claim 1, wherein the test data for each frame of sample frames includes a memory bandwidth corresponding to each frame of sample frames in operation,

12. A simulation-based GPU performance testing apparatus, the apparatus comprising:

13. The apparatus of claim 12, wherein the test data comprises a run time of each frame sample frame,

the computing module includes:

14. The apparatus of claim 13, wherein the first computing submodule comprises:

15. The apparatus of claim 14, wherein the interpolation sub-module is further configured to calculate, for each interpolated frame, an operation time of the interpolated frame using a linear interpolation algorithm based on the first time interval, the second sampling time interval, and the operation time of two sampling frames closest to the interpolated frame;

The frame inserting sub-module is further used for ending the frame inserting and calculating the total frame number of the inserted frame if the sum of the starting time of the inserted frame and the running time of the inserted frame is greater than or equal to the sampling time of the last frame sampling frame;

16. The apparatus of claim 15, wherein the frame insertion sub-module is further configured to:

17. The apparatus of claim 13, wherein the first computing sub-module further comprises:

18. The apparatus of any one of claims 14 to 17, wherein the test data comprises a hardware counter value.

19. The apparatus of claim 18, wherein the computing module comprises:

20. The apparatus of claim 12, wherein the apparatus further comprises:

21. The apparatus of claim 20, wherein the sampling module comprises:

22. The apparatus of claim 12, wherein the test data for each frame of sample frames includes a memory bandwidth corresponding to each frame of sample frames in operation,

the computing module includes:

23. A simulation-based GPU performance testing apparatus, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to implement the method of any of claims 1-11 when executing instructions.

24. A non-transitory computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1 to 11.