CN116701143A

CN116701143A - Performance analysis method, device, system, computing equipment and storage medium

Info

Publication number: CN116701143A
Application number: CN202310671146.8A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Moore Threads Technology Co Ltd
Current assignee: Moore Threads Technology Co Ltd
Priority date: 2023-06-07
Filing date: 2023-06-07
Publication date: 2023-09-05

Abstract

The present disclosure relates to a performance analysis method, apparatus, system, computing device, and storage medium, the method comprising: obtaining kernel function performance data when the graphics processor runs a target kernel function; determining the computing intensity and the actual computing force maximum value when the graphics processor runs the target kernel function according to the kernel function performance data; and adding a point representing the performance of the graphics processor running the objective kernel function in the roof line model according to the calculated intensity and the actual calculated force maximum value, indicating that the objective kernel function has a performance bottleneck when the position of the point falls in a performance bottleneck area in the roof line model, and setting the objective kernel function as the kernel function to be optimized. According to the performance analysis method, whether the performance of the objective kernel function is bottleneck is determined by means of roof line model analysis, accuracy of performance analysis results is guaranteed, and meanwhile a performance optimization mode is determined conveniently according to the performance analysis results.

Description

Performance analysis method, device, system, computing equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to a performance analysis method, apparatus, system, computing device, and storage medium.

Background

In recent years, research and landing of artificial intelligence (artificial intelligence, AI) is expanding well, and AI applications are entering people's daily lives. AI applications require extensive matrix operations, and therefore graphics processors (graphics processing unit, GPUs) that are adept at matrix operations are typically used to accelerate AI models used by applications. The performance of the GPU running the AI model is closely related to the performance of the AI software stack, and the deep neural network (deep neural networks, DNN) library in the AI software stack is a deep neural network primitive library that helps the GPU to accelerate, enabling standard routines (such as forward and backward convolution, pooling layers, normalization and activation layers) to be implemented in a highly optimized manner. The kernel function (kernel) of the operator in the DNN library is a data parallel processing program executed on the GPU, and its performance greatly determines the performance of the DNN library, and thus determines the performance of the AI software stack and AI model. To enable these kernel functions to operate with high performance, it is necessary to analyze whether the kernel functions have performance bottlenecks, and then selectively tune the kernel functions having the performance bottlenecks.

The performance analysis method in the prior art is more focused on a model layer and an operator layer, and the performance of a kernel function layer cannot be directly determined. How to accurately determine the performance of a graphics processor running kernel function is a hotspot in the art.

Disclosure of Invention

In view of this, the disclosure provides a performance analysis method, a device, a system, a computing device and a storage medium, where the performance analysis method determines whether the performance of the objective kernel function is bottleneck by means of roof line model analysis, and is convenient to determine a performance optimization mode according to the performance analysis result while ensuring the accuracy of the performance analysis result.

According to a first aspect of the present disclosure, there is provided a performance analysis method for analyzing whether an objective kernel function run by a graphics processor needs to be optimized, the method comprising: obtaining kernel function performance data when the graphics processor runs the target kernel function; determining the maximum value of calculation intensity and actual calculation force when the graphics processor runs the target kernel function according to the kernel function performance data, wherein the calculation intensity represents the number of floating point operations completed by each unit memory exchange when the target kernel function runs, and the actual calculation force represents the throughput of the target kernel function; and adding a point representing the performance of the graphics processor running the objective kernel function in a roof line model according to the calculated intensity and the actual calculated force maximum value, indicating that the objective kernel function has a performance bottleneck when the position of the point falls in a performance bottleneck area in the roof line model, and setting the objective kernel function as a kernel function to be optimized.

In one possible implementation, the horizontal axis of the roof line model is the computational intensity and the vertical axis is the computational force, and before adding the point in the roof line model representing the performance of the graphics processor running the objective kernel function, the method further comprises: determining a first broken line in the roof line model according to the theoretical calculated force maximum value and the theoretical bandwidth maximum value of the graphic processor; and determining the area between the first fold line and the transverse axis as the performance bottleneck area.

In one possible implementation, the performance bottleneck region includes a bandwidth bottleneck region and a computation bottleneck region, and the method further includes: determining a second fold line in the roof line model according to the calculated intensity and the actual calculated intensity maximum value of the graphic processor when the reference kernel function is operated, wherein the second fold line is below the first fold line; when the position of the point falls in the bandwidth bottleneck area and is positioned below the second folding line, optimizing the memory access function called by the function to be optimized; when the position of the point falls in the calculation bottleneck area and is positioned below the second folding line, optimizing the calculation function called by the function to be optimized; and when the position of the point is in the bandwidth bottleneck area and above the second folding line, optimizing a calculation method used when the function to be optimized runs, the transmission time of input data when the function to be optimized runs and the transmission time of output data when the function to be optimized runs.

In one possible implementation, the kernel performance data includes one or more of an index, a name, a runtime time-consuming, a memory size, a computational size, and a bandwidth of the target kernel.

In one possible implementation manner, the obtaining kernel performance data of the graphics processor when running the target kernel includes: acquiring the input matrix size of the objective kernel function and the block size of parallel computation in the graphic processor; determining the access quantity and the calculated quantity according to the input matrix size and the partition size; determining the bandwidth according to the access quantity and the operation time consumption; one or more of the index, the name, the runtime consumption, the computational effort, the memory amount, the bandwidth are stored as the kernel function performance data.

In one possible implementation, the calculated amount represents a number of floating point operations completed when the graphics processor runs the target kernel, and the accessed amount represents an amount of memory swap completed for a single input sample when the graphics processor runs the target kernel; the determining the computing intensity and the actual computing force maximum value when the graphics processor runs the target kernel function according to the kernel function performance data comprises the following steps: determining the calculation intensity of the graphics processor when running the objective kernel function according to the ratio of the calculation amount to the access amount; and determining the actual calculated force maximum value when the graphics processor runs the objective kernel function according to the calculated quantity and the running time consumption ratio.

According to a second aspect of the present disclosure, there is provided a performance analysis method applied to a computing device having a physical graphics processor and a virtual container disposed thereon, the method comprising: receiving configuration information from a cloud platform, wherein the configuration information is uploaded by user equipment and indicates a model to be analyzed and operation parameters of the model; copying a code repository file from the cloud platform, the code repository file indicating a manner in which the computing device analyzes performance of the model running on the graphics processor and a preset data structure; downloading resources required for running the model from the cloud platform according to the mode indicated by the code warehouse file, running the model by using the graphic processor based on the resources, acquiring performance original data run by the model by using the container, and analyzing the performance original data to obtain performance display data, wherein the performance display data accords with the preset data structure; uploading the performance display data to the cloud platform, wherein the performance display data is downloaded from the cloud platform by the user equipment and displayed to a user; the method comprises the steps that a kernel function called by a model to be analyzed when the model is operated is taken as a target kernel function, performance original data comprise kernel function performance data when the graphics processor operates the target kernel function, performance display data comprise calculation intensity and actual calculation force maximum values when the graphics processor operates the target kernel function, the calculation intensity represents the number of floating point operations completed per unit memory exchange when the target kernel function operates, and the actual calculation force represents throughput of the target kernel function; the performance display data are displayed as points in a roof line model, the points represent the performance of the graphics processor running the objective kernel function, when the positions of the points fall in a performance bottleneck area in the roof line model, the performance bottleneck is indicated to exist in the objective kernel function, and the objective kernel function is set as a kernel function to be optimized.

In one possible implementation, the obtaining performance raw data of the model operation using the container and analyzing the performance raw data to obtain performance display data includes: according to the performance analysis method described in the first aspect of the present disclosure or any one of possible implementation manners of the first aspect, the performance display data is obtained.

According to a third aspect of the present disclosure, there is provided a performance analysis system for analyzing whether an objective kernel function run by a graphics processor needs to be optimized, the system comprising: the first acquisition module is used for acquiring kernel function performance data when the graphics processor runs the target kernel function; the first determining module is used for determining the maximum value of calculation intensity and actual calculation force when the graphics processor runs the target kernel function according to the kernel function performance data, wherein the calculation intensity represents the number of floating point operations completed by per unit memory exchange when the target kernel function runs, and the actual calculation force represents the throughput of the target kernel function; and the second determining module is used for adding a point representing the performance of the graphics processor running the target kernel function into a roof line model according to the calculated intensity and the actual calculated force maximum value, indicating that the target kernel function has a performance bottleneck when the position of the point falls in a performance bottleneck area in the roof line model, and setting the target kernel function as the kernel function to be optimized.

The first acquisition module and the first determination module can be arranged on the same equipment, and the second determination module, the first receiving module and the first acquisition module can be arranged on the same equipment or different equipment. When provided on the same device, may be a computing device as described above. When the first determining module is arranged on different equipment, the first acquiring module and the first determining module can be arranged on the computing equipment, and the second determining module is arranged on the user equipment.

In one possible implementation, the horizontal axis of the roof line model is the computational intensity, and the vertical axis is the computational force, the system further comprising: a third determining module, configured to determine a first fold line in the roof line model according to a theoretical calculation force maximum and a theoretical bandwidth maximum of the graphics processor; and determining the area between the first fold line and the transverse axis as the performance bottleneck area.

The setting position of the third determination module may be the same as that of the second determination module.

In one possible implementation, the performance bottleneck region includes a bandwidth bottleneck region and a computation bottleneck region, and the system further includes: a fourth determining module, configured to determine a second fold line in the roof line model according to a computation strength and an actual computation force maximum value when the graphics processor runs a reference kernel function, where the second fold line is below the first fold line; the optimizing module is used for optimizing the memory function called by the function to be optimized when the position of the point falls in the bandwidth bottleneck area and is located below the second folding line; when the position of the point falls in the calculation bottleneck area and is positioned below the second folding line, optimizing the calculation function called by the function to be optimized; and when the position of the point is in the bandwidth bottleneck area and above the second folding line, optimizing a calculation method used when the function to be optimized runs, the transmission time of input data when the function to be optimized runs and the transmission time of output data when the function to be optimized runs.

The setting positions of the fourth determining module and the optimizing module may be the same as those of the second determining module.

In one possible implementation manner, the first obtaining module is specifically configured to: acquiring the input matrix size of the objective kernel function and the block size of parallel computation in the graphic processor; determining the access quantity and the calculated quantity according to the input matrix size and the partition size; determining the bandwidth according to the access quantity and the operation time consumption; one or more of the index, the name, the runtime consumption, the computational effort, the memory amount, the bandwidth are stored as the kernel function performance data.

In one possible implementation manner, the calculated amount represents a floating point operation number completed when the graphics processor runs the objective kernel function, the access amount represents a memory swap amount completed for a single input sample when the graphics processor runs the objective kernel function, and the first determining module is specifically configured to: determining the calculation intensity of the graphics processor when running the objective kernel function according to the ratio of the calculation amount to the access amount; and determining the actual calculated force maximum value when the graphics processor runs the objective kernel function according to the calculated quantity and the running time consumption ratio.

According to a fourth aspect of the present disclosure, there is provided a performance analysis apparatus for application to a computing device having a physical graphics processor and a virtual container disposed thereon, the apparatus comprising: the first receiving unit is used for receiving configuration information from the cloud platform, wherein the configuration information is uploaded by the user equipment and indicates a model to be analyzed and operation parameters of the model; a first copying unit for copying a code repository file from the cloud platform, the code repository file indicating a manner in which the computing device analyzes performance of the model running on the graphics processor and a preset data structure; the first downloading unit is used for downloading resources required by running the model from the cloud platform according to the mode indicated by the code warehouse file, running the model by using the graphic processor based on the resources, acquiring performance original data of the model running by using the container and analyzing the performance original data to obtain performance display data, wherein the performance display data accords with the preset data structure; the first uploading unit is used for uploading the performance display data to the cloud platform, and the performance display data is downloaded from the cloud platform by the user equipment and displayed to a user; the method comprises the steps that a kernel function called by a model to be analyzed when the model is operated is taken as a target kernel function, performance original data comprise kernel function performance data when the graphics processor operates the target kernel function, performance display data comprise calculation intensity and actual calculation force maximum values when the graphics processor operates the target kernel function, the calculation intensity represents the number of floating point operations completed per unit memory exchange when the target kernel function operates, and the actual calculation force represents throughput of the target kernel function; the performance display data are displayed as points in a roof line model, the points represent the performance of the graphics processor running the objective kernel function, when the positions of the points fall in a performance bottleneck area in the roof line model, the performance bottleneck is indicated to exist in the objective kernel function, and the objective kernel function is set as a kernel function to be optimized.

According to a fifth aspect of the present disclosure, there is provided a computing device comprising: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to implement the above-described method when executing the instructions stored by the memory.

According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer program instructions, wherein the computer program instructions when executed by a processor implement the above-described method.

According to a seventh aspect of the present disclosure, there is provided a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, performs the above method.

According to the performance analysis method of the embodiment of the disclosure, by acquiring kernel performance data when the graphics processor runs the target kernel, determining the computation strength and the actual computation force maximum value when the graphics processor runs the target kernel according to the kernel performance data, wherein the computation strength represents the number of floating point operations completed by per unit memory exchange when the target kernel runs, and the actual computation force represents the throughput of the target kernel, so that the performance of the target kernel can be quantized to the same dimension as a roof line model, and points representing the performance of the graphics processor running the target kernel can be added in the roof line model according to the computation strength and the actual computation force maximum value. When the position of the point falls in a performance bottleneck area in the roof line model, the performance bottleneck of the objective kernel function is indicated, the performance of the objective kernel function under the limitation of a specific graphic processor is analyzed from the kernel function level, and the accuracy of an analysis result is ensured. The objective kernel function is set as a kernel function to be optimized, so that assistance can be provided for determining whether the objective kernel function needs to be optimized. Therefore, the performance analysis method of the embodiment of the disclosure can ensure the accuracy of the performance analysis result and is convenient for determining the performance optimization mode according to the performance analysis result.

According to the performance analysis method, configuration information is received from the cloud platform, the configuration information is uploaded by the user equipment, the configuration information indicates a model to be analyzed and operation parameters of the model, and performance analysis requirements of a user can be determined; copying a code warehouse file from the cloud platform, wherein the code warehouse file indicates a mode of analyzing the performance of the model running on the graphic processor by the computing device and a preset data structure, so that the computing device can have the capability of analyzing the performance of the model running on the graphic processor; according to the mode indicated by the code warehouse file, downloading resources required by a running model from a cloud platform, using a graphic processor running model based on the resources, acquiring performance original data of the model running by using a container and analyzing the performance original data to obtain performance display data, so that the performance display data can be a performance analysis result meeting the performance analysis requirement of a user, and the performance display data accords with a preset data structure, thereby realizing standardized and structured data processing; through uploading the performance display data to the cloud platform, the performance display data is downloaded from the cloud platform by the user equipment and displayed to the user, so that the user can view the performance display data at any time. The method comprises the steps that a kernel function called by a model to be analyzed when the model is operated is used as a target kernel function, performance original data comprise kernel function performance data when a graphic processor operates the target kernel function, performance display data comprise calculation intensity and actual calculation force maximum values when the graphic processor operates the target kernel function, the calculation intensity represents the number of floating point operations completed by per unit memory exchange when the target kernel function operates, and the actual calculation force represents throughput of the target kernel function, so that performance display data obtained by a computing device can indicate performance of a kernel function layer, and accuracy is higher; the performance display data are displayed as points in the roof line model, which represent the performance of the graphics processor running the objective kernel function, when the positions of the points fall in a performance bottleneck area in the roof line model, the performance bottleneck exists in the objective kernel function, and the objective kernel function is set as the kernel function to be optimized, so that whether the performance bottleneck exists in the objective kernel function can be intuitively displayed, and help can be provided for determining whether the objective kernel function needs to be optimized. Therefore, the performance analysis method of the embodiment of the disclosure can ensure the accuracy of the performance analysis result and is convenient for determining the performance optimization mode according to the performance analysis result. For a user, one-click and automatic performance analysis can be realized only by submitting configuration information, so that user experience is improved. And the performance analysis result is a structured and standardized result, so that the performance optimization mode is more convenient to determine based on the performance analysis result.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features and aspects of the present disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 illustrates an exemplary application scenario of a performance analysis method according to an embodiment of the present disclosure.

Fig. 2 shows a schematic diagram of a flow of a performance analysis method according to an embodiment of the disclosure.

Fig. 3 illustrates an example of a roof line model according to an embodiment of the disclosure.

Fig. 4 shows a schematic diagram of a flow of a performance analysis method according to an embodiment of the disclosure.

Fig. 5 illustrates an example of a presentation effect of performance presentation data according to an embodiment of the present disclosure.

Fig. 6 illustrates an example of a presentation effect of performance presentation data according to an embodiment of the present disclosure.

Fig. 7 illustrates an example of a presentation effect of performance presentation data according to an embodiment of the present disclosure.

Fig. 8 illustrates an example of a presentation effect of performance presentation data according to an embodiment of the present disclosure.

Fig. 9 shows an example of a presentation effect of performance presentation data according to an embodiment of the present disclosure.

Fig. 10 illustrates an example of a presentation effect of performance presentation data according to an embodiment of the present disclosure.

Fig. 11 shows a schematic diagram of the structure of a performance analysis system according to an embodiment of the present disclosure.

Fig. 12 is a schematic view showing the structure of a performance analysis apparatus according to an embodiment of the present disclosure.

Fig. 13 shows a schematic structural diagram of an apparatus 1900 according to an embodiment of the disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

In addition, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order not to obscure the present disclosure.

As shown in fig. 1, the application scenario may include a computing device, a storage device, and a user device in a performance analysis system. The three can communicate with each other through the cloud platform. Some type of graphics processor may be provided on the computing device, and the graphics processor may have the capability to run a model that may be called into a DNN library of kernel functions, i.e., that corresponds to the graphics processor running kernel functions. The input data when the kernel function is operated may be stored in the storage device, the input data may be acquired from the storage device when the kernel function is operated, and the output data may be stored in the storage device after the kernel function is operated.

When any kernel function in the DNN library is used as the target kernel function, the performance analysis system executes the performance analysis method of the embodiment of the disclosure, and the computing device can acquire and analyze kernel function performance data when the graphics processor runs the target kernel function, so as to obtain the computation strength and the actual computation power maximum value (namely the practically achievable throughput maximum value) when the graphics processor runs the target kernel function, which are described below. The data analyzed by the computing device may be stored to a storage device.

The user device may obtain data from the storage device analyzed by the computing device, and from the obtained data may determine a location of a point in the roof line model representing the performance of the objective kernel function. The roof line model comprises a performance bottleneck region, and whether the objective kernel function needs to be optimized can be judged according to whether the position of the point is located in the performance bottleneck region. Examples of roof line models can be found below in connection with the description of fig. 3.

In one possible implementation manner, the method is used for analyzing whether the objective kernel function executed by the graphics processor needs to be optimized, and the method includes steps S21-S23:

step S21, kernel function performance data of the graphics processor when running the target kernel function is obtained. Illustratively, the kernel performance data may include one or more of an index, a name, a running time, an access amount, a calculation amount, and a bandwidth of the kernel function directly acquired during the execution of the kernel function by the graphics processor, where the index, the name, and the running time may be directly acquired, and a determination manner of the access amount, the calculation amount, and the bandwidth may be described in further detail below in step S21. Further, the kernel performance data may further include more parameters related to the kernel, such as an input size of the kernel, etc., and the present disclosure is not limited to the specific kind included in the kernel performance data.

Step S22, determining the maximum value of the calculation intensity and the actual calculation force when the graphics processor runs the target kernel function according to the kernel function performance data, wherein the calculation intensity represents the number of floating point operations completed by each unit memory exchange when the target kernel function runs, and the actual calculation force represents the throughput of the target kernel function.

The roof line model of embodiments of the present disclosure may be a two-dimensional image, with calculated intensity and calculated force being the horizontal and vertical axes of the roof line model, respectively. Step S22 further analyzes the kernel performance data, and may quantify the target kernel performance to the dimension of the horizontal axis and the dimension of the vertical axis of the roof line model. For a specific analysis, reference may be made to the further description of step S22 below.

And S23, adding a point representing the performance of the graphics processor running the objective kernel function into the roof line model according to the calculated intensity and the actual calculated intensity maximum value, indicating that the objective kernel function has a performance bottleneck when the position of the point falls in a performance bottleneck area in the roof line model, and setting the objective kernel function as the kernel function to be optimized.

Fig. 3 illustrates an example of a roof line model according to an embodiment of the disclosure. As shown in fig. 3, the horizontal axis of the roof line model is the computational intensity, the vertical axis is the computational intensity, and the point (e.g., point P) representing the performance of the graphics processor running the objective kernel function can be determined from the computational intensity when the graphics processor running the objective kernel function ₀ ) On the abscissa in the roof line model, a point (e.g., point P) representing the performance of the graphics processor running the objective kernel function may be determined from the actual maximum of the computing force when the graphics processor runs the objective kernel function ₀ ) Ordinate in roof line model, thus completing point P ₀ Is determined by the location of (a).

The roof line model may include a performance bottleneck region, such as the region between the polyline L1 and the horizontal axis in FIG. 3. At point P ₀ When the position of the target kernel function falls in the performance bottleneck region, indicating that the target kernel function has the performance bottleneck, and setting the target kernel function as the kernel function to be optimized.

Wherein step S21 and step S22 may be performed by the computing device mentioned in fig. 1, and step S23 may be performed by the user device mentioned in fig. 1. It will be understood by those skilled in the art that step S23 may also be performed by a computing device, and the two-dimensional image of the roof line model is transmitted to a user device for display, and the present disclosure is not limited to the specific implementation of step S23.

In one possible implementation, step S21 includes:

acquiring the size of an input matrix of a target kernel function and the size of a partition calculated in parallel in a graphics processor;

determining access quantity and calculated quantity according to the size of the input matrix and the size of the partition;

determining throughput according to the calculated amount and the operation time consumption;

determining bandwidth according to the access quantity and the operation time consumption;

one or more of index, name, run time, computation, access, throughput, bandwidth are stored as kernel function performance data.

For example, to implement step S21, a corresponding class (class) may be first designed in the DNN library, so that the class may include member variables such as kernel name, running time (in ns), access amount, calculation amount, and the like. The function a and the function b can be designed so that the function a can determine the throughput and the bandwidth according to the operation time consumption, the access quantity and the calculated quantity, and the function b can store the kernel function performance data corresponding to the current target kernel function and the kernel function performance data of other obtained kernel functions together. The process of executing functions a and b may be represented by defining macros. After each kernel function is executed by using the defined macro, the graphics processor may execute the defined macro after executing the target kernel function, and step S21 may be implemented.

When executing the function a, the input matrix size of the objective kernel function and the block size of parallel computation in the graphics processor may be acquired first, and the access amount and the computation amount are determined according to the input matrix size and the block size. Throughput can be determined from the ratio of the calculated amount to the time spent running, and bandwidth can be determined from the ratio of the amount of access to the time spent running.

In executing function b, one or more of index, name, run time, computation, access, throughput, bandwidth may be stored as kernel function performance data. In this process, the log-value class parameters (e.g., run-time, computation, access, throughput, bandwidth) may be normalized, e.g., uniformly reserved to the last 5 bits of the decimal point, and the time units used are converted to milliseconds (ms). The relational database TiDB may be used to store kernel function performance data, see table 1 for specific data structures.

TABLE 1

Data type	Data structure
		Form identification	Integer unsigned must not be empty
Index	Integer unsigned must not be empty
		Name of the name	The variable length type (up to 128 characters) must not be empty
Run time consuming	Integer unsigned
		Bandwidth of a communication device	Floating point type
Throughput of	Floating point type
		Calculated amount	Floating point type
Visit amount	Floating point type
		···	···

Index, name, running time, calculation amount, access amount, throughput and bandwidth are all performance data of the kernel function layer, so that the accuracy of determining whether the target kernel function has performance bottlenecks according to the data is higher.

In one possible implementation, the calculated amount represents the number of floating point operations completed when the graphics processor runs the target kernel, the accessed amount represents the amount of memory swap completed for a single input sample when the graphics processor runs the target kernel,

step S22 includes:

determining the calculation intensity of the graphics processor when running the objective kernel function according to the ratio of the calculation amount and the access amount;

the actual computing power maximum value when the graphics processor runs the objective kernel function is determined according to the throughput.

For example, the computational load of the target kernel may represent the number of floating point operations completed by the graphics processor when running the target kernel, which may be in units of FLPs. The memory size of the target kernel represents the amount of memory swap done for a single input sample when the graphics processor runs the target kernel, and may be in Bytes. Therefore, the ratio of the calculated amount and the access amount of the objective kernel function can represent the calculation intensity of the objective kernel function, that is, the number of floating point operations completed by each unit of memory exchange when the objective kernel function runs, and the unit can be FLOP/Byte.

The throughput (units GFLOP/s) is determined from the ratio of the calculated amount to the run time, so the maximum value of the throughput can be used directly to represent the actual calculated force maximum.

In this way, the computational intensity and actual computational intensity maximum when the graphics processor runs the objective kernel function can be determined. The ratio of the actual calculated force maximum value to the theoretical calculated force maximum value of the graphic processor, namely the calculation efficiency of the graphic processor, can determine whether an optimization space exists when the calculation efficiency is low.

An exemplary method of determining a performance bottleneck region in a roof line model is described below.

In one possible implementation, the horizontal axis of the roof line model is the computational intensity, the vertical axis is the computational force,

prior to step S23, the method further comprises:

determining a first broken line in the roof line model according to the theoretical calculation force maximum value and the theoretical bandwidth maximum value of the graphic processor;

the area between the first fold line and the transverse axis is determined as the performance bottleneck area.

For example, the horizontal axis of the roof line model of the present disclosure is the computation intensity, and the vertical axis is the computation force, and mainly describes how much actual computation force can be achieved by a certain kernel function under the limitation of a specific computing platform (graphics processor). More specifically, the roof line model may indicate "actual calculated force maximum E that can be reached by a kernel function with calculated amount A and access amount B when a graphics processor with theoretical calculated force maximum C and theoretical bandwidth maximum D is running _max What is.

The theoretical calculation force maximum value C refers to the number of floating point operations which can be completed by the graphic processor every second when the graphic processor tilts out the full force, and the unit can be GFLOP/s. Theoretical calculated force maximum for a graphics processor, for exampleC may be 15200GFLOP/s. The theoretical bandwidth maximum D refers to the amount of memory swap that the graphics processor can complete per second with full power, and may be in GB/s. For example, the theoretical bandwidth maximum D of the graphics processor may be 448GB/s. Thus, the upper limit of the computational intensity I of the graphics processor _max The ratio of the theoretical calculation force maximum value C to the theoretical bandwidth maximum value D can be equal, namely the number of floating point operations completed by the whole force per unit memory exchange can be realized by the graphic processor, and the unit can be FLOP/Byte. The computational intensity of the graphics processor reaches an upper limit I _max When the actual calculated force maximum value also necessarily reaches the upper limit E _max Thus the actual calculated force maximum E of the graphics processor _max Equal to the upper limit I of the computing intensity of the graphics processor _max Actual calculation force is performed. Can obtain an abscissa as the calculated intensity I _max The ordinate is the actual calculated force maximum E _max Point P of (2) _max (I _max ，E _max )。

Selecting a value smaller than the calculated intensity I _max Is calculated intensity I of (2) ₁ According to the calculated intensity I ₁ The graphics processor calculates the intensity I ₁ The actual calculated force maximum E ₁ Can obtain an abscissa as the calculated intensity I ₁ The ordinate is the actual calculated force maximum E ₁ Point P of (2) ₁ (I ₁ ，E ₁ )。

Selecting one more than the calculated intensity I _max Is calculated intensity I of (2) ₂ According to the calculated intensity I ₂ The graphics processor calculates the intensity I ₂ The actual calculated force maximum E ₂ Can obtain an abscissa as the calculated intensity I ₂ The ordinate is the actual calculated force maximum E ₂ Point P of (2) ₂ (I ₂ ，E ₂ )。

According to the above three points P _max (I _max ，E _max )、P ₁ (I ₁ ，E ₁ )、P ₂ (I ₂ ，E ₂ ) A first fold line in the roof line model may be determined, as indicated by fold line L1 in fig. 3. Point P of which _max (I _max ，E _max ) Can be used as a turning point. First fold line and transverse axisThe area in between can be determined as a performance bottleneck area. The points located in this performance bottleneck region are theoretically all possible to optimize. There is in theory no point falling in the area between the first polyline and the vertical axis, as this means that the actual performance of the objective kernel exceeds the actual calculated force maximum/calculated intensity maximum of the graphics processor.

In this way, the performance bottleneck region in the roof line model can be determined, and when the point representing the performance of the graphics processor running the target kernel function is added in the roof line model, the difference between the actual running performance and the theoretical performance of the single kernel function can be intuitively known. The performance bottleneck area is determined based on the theoretical calculation force maximum value and the theoretical bandwidth maximum value of the graphic processor, so that the performance bottleneck area can be determined in advance, the time cost of performance analysis is saved, and the performance analysis efficiency is improved.

In actual running of the kernel function, it is often difficult for the graphics processor to reach theoretical force maxima and theoretical bandwidth maxima. Therefore, in order to achieve the best optimization effect, the objective kernel function with a larger optimization space can be further selected for optimization. Furthermore, the objective kernel function may be limited by the bandwidth of the graphics processor and thus performance bottlenecks, or may be limited by the computational power of the graphics processor and thus performance bottlenecks, and for both cases different optimization approaches may be chosen. Several exemplary optimization approaches are described below.

In one possible implementation, the performance bottleneck region includes a bandwidth bottleneck region and a computation bottleneck region, and the method further includes:

determining a second fold line in the roof line model according to the calculated intensity and the actual calculated force maximum value of the graphic processor when the reference kernel function is operated, wherein the second fold line is below the first fold line;

when the position of the point falls in the bandwidth bottleneck area and is positioned below the second fold line, optimizing the memory access function called by the function to be optimized;

when the position of the point falls in the calculation bottleneck area and is positioned below the second folding line, optimizing the calculation function called by the function to be optimized;

When the position of the point is located in the bandwidth bottleneck area and above the second fold line, the calculation method used when the function to be optimized runs, the transmission time of input data when the function to be optimized runs and the transmission time of output data when the function to be optimized runs are optimized.

For example, a reference kernel function may be preset (since kernel function performance is more affected by bandwidth, a computationally intensive kernel function may be selected, for example), and the actual computation power and actual bandwidth of the graphics processor running the reference kernel function may be recorded. The ratio of the maximum value of the actual calculation force to the maximum value of the actual bandwidth, namely the upper limit I of the actual calculation intensity of the graphic processor _max’ . At the upper limit of the actual calculation intensity I _max’ The calculated force reached is the actual calculated force maximum of the graphics processor.

Multiplying the actual calculated force maximum value by a preset weight parameter (e.g. 0.8) to obtain calculated force E _max’ . And then obtain an abscissa as the calculated intensity I _max’ The ordinate is calculated force E _max’ Point P of (2) _max’ (I _max’ ，E _max’ )。

Graphics processor to calculate intensity I ₁ The actual calculated force maximum value when the reference kernel function is operated is multiplied by a preset weight parameter (for example, 0.8) to obtain calculated force E ₁₁ Further, a calculated intensity I with an abscissa is obtained ₁ The ordinate is calculated force E ₁₁ Point P of (2) ₁₁ (I ₁ ，E ₁₁ )。

Graphics processor to calculate intensity I ₂ The actual calculated force maximum value when the reference kernel function is operated is multiplied by a preset weight parameter (for example, 0.8) to obtain calculated force E ₂₂ Further, a calculated intensity I with an abscissa is obtained ₂ The ordinate is calculated force E ₂₂ Point P of (2) ₂₂ (I ₂ ，E ₂₂ )。

According to the above three points P _max’ (I _max’ ，E _max’ )、P ₁₁ (I ₁ ，E ₁₁ )、P ₂₂ (I ₂ ，E ₂₂ ) A second fold line in the roof line model may be determined, as indicated by fold line L2 in fig. 3. Point P of which _max’ (I _max’ ，E _max’ ) Can be used as a turning point. The second fold line is located below the first fold line, and a point located in the region between the second fold line and the first fold line may be considered a point where there is less optimized space. Points located in the region between the second fold line and the transverse axis may be considered points where there is a larger optimization space. In this regard, the kernel function to be optimized corresponding to the point located below the second fold line may be optimized preferentially.

The data associated with the first fold line and the second fold line (theoretical calculated force maximum, theoretical bandwidth maximum, actual calculated force maximum, actual bandwidth maximum) may be stored to a relational database in a memory device and may be stored as a data structure as shown in table 2. After the user equipment obtains the data in table 2, the first fold line (and the second fold line) described above can be determined, so as to draw a roof line model.

TABLE 2

Data type	Data structure
		Graphics processor name	The fixed length type (up to 20 characters) must not be empty
Theoretical maximum value of force	Floating point must not be empty
		Maximum theoretical bandwidth	Floating point must not be empty
Maximum value of actual calculation force	Floating point must not be empty
		Maximum value of actual bandwidth	Floating point must not be empty

The performance bottleneck region may include a bandwidth bottleneck region and a computation bottleneck region, as shown in FIG. 3, where P _max (I _max ，E _max )、P ₁ (I ₁ ，E ₁ ) The area between the connection line of the kernel function and the horizontal axis can be a bandwidth bottleneck area, and the performance of the kernel function corresponding to the point in the area is limited by the bandwidth of the graphics processor, namely, when the computation intensity of the kernel function is smaller than the computation intensity upper limit of the graphics processor, the actual computation intensity of the kernel function is limited by the bandwidth of the computation of the kernel function; p (P) _max (I _max ，E _max )、P ₂ (I ₂ ，E ₂ ) The area between the connection line of the kernel function and the horizontal axis can be a calculation bottleneck area, and the performance of the kernel function corresponding to the point in the area is limited by the actual calculation power of the graphics processor, that is, when the calculation power of the kernel function exceeds the upper limit of the calculation power of the graphics processor, the actual calculation power of the kernel function can only reach the maximum value of the actual calculation power of the calculation platform at most. At point P ₀ The point P can be considered when the position of (a) falls in the bandwidth bottleneck region and is located below the first folding line and above the second folding line ₀ The corresponding objective kernel function has better access bandwidth utilization rate. At point P ₀ The point P can be considered when the position of (2) falls in the calculation bottleneck region and is located below the first folding line and above the second folding line ₀ The corresponding objective kernel function has better data reuse rate and data locality.

Running the kernel functions involves reading and storing of data and computation, and therefore requires the call to the memory function and computation function. Based on this, at point P ₀ When the position of the memory function to be optimized is located in the bandwidth bottleneck area and below the second folding line, the memory function called by the function to be optimized can be optimized because the memory function is limited by the bandwidth of the graphics processor more greatly; at point P ₀ Is located in the computation bottleneck region and is located in the second folding lineAnd when the calculation function is lower, the calculation function called by the function to be optimized can be optimized because the calculation function is more limited by the actual calculation force of the graphic processor.

At point P ₀ When the position of the target kernel function (function to be optimized) corresponding to the point falls in the bandwidth bottleneck region and is located above the second folding line, the optimization space of the target kernel function (function to be optimized) is smaller, and the actual calculated force maximum value is lower. If it is still desired to raise the actual computing power maximum value of the objective kernel function (function to be optimized), an optimization mode capable of improving the computing density can be selected, and the computing method used when the function to be optimized is running, the transmission time of the input data when the function to be optimized is running, and the transmission time of the output data when the function to be optimized is running are optimized. Means may be employed to increase locality of graphics processor memory space, increase cache hit rate, improve data structure, data type, etc., for example.

By the method, flexibility and diversity of the method for optimizing the function to be optimized are improved, a user can be helped to determine the optimization direction, and performance analysis capability is improved.

In one possible implementation, the embodiment of the disclosure further proposes a performance analysis method, and fig. 4 shows a schematic diagram of a flow of the performance analysis method according to the embodiment of the disclosure.

As shown in fig. 4, the method is applied to a computing device, on which a physical graphic processor and a virtual container are provided, and includes steps S41-S44:

step S41, receiving configuration information from a cloud platform, wherein the configuration information is uploaded by user equipment and indicates a model to be analyzed and operation parameters of the model;

step S42, copying a code warehouse file from the cloud platform, wherein the code warehouse file indicates a mode of analyzing the performance of the model running on the graphic processor by the computing equipment and a preset data structure;

step S43, downloading resources required for running the model from the cloud platform according to the mode indicated by the code warehouse file, running the model by using the graphic processor based on the resources, acquiring performance original data run by the model by using the container and analyzing the performance original data to obtain performance display data, wherein the performance display data accords with the preset data structure;

Step S44, uploading the performance display data to the cloud platform, wherein the performance display data is downloaded from the cloud platform by the user equipment and displayed to a user;

the method comprises the steps that a kernel function called by a model to be analyzed when the model is operated is taken as a target kernel function, performance original data comprise kernel function performance data when the graphics processor operates the target kernel function, performance display data comprise calculation intensity and actual calculation force maximum values when the graphics processor operates the target kernel function, the calculation intensity represents the number of floating point operations completed per unit memory exchange when the target kernel function operates, and the actual calculation force represents throughput of the target kernel function; the performance display data are displayed as points in a roof line model, the points represent the performance of the graphics processor running the objective kernel function, when the positions of the points fall in a performance bottleneck area in the roof line model, the performance bottleneck is indicated to exist in the objective kernel function, and the objective kernel function is set as a kernel function to be optimized.

An example of the configuration information is described below.

The model to be analyzed may be any model in the field of artificial intelligence, for example, a face recognition model of visual direction, etc.

The configuration information may include a model name and an acquisition path of the model file such that a model to be analyzed may be uniquely determined according to the configuration information. In addition, there may be some specific operating parameters of the different models that need to be set prior to operation, such as the input dimensions of the model (the dimensions of each sample that is input into the model), model weights, etc., which may be entered by the user, indicate what specific parameters the user wishes to analyze the performance of the model.

The configuration information may also include a list of batch sizes, which represent the number of samples that entered the model at one time. It is generally possible to run models of various batch sizes in a production environment, and the dimensions of the matrix operations corresponding to different batch sizes are different and therefore a key parameter affecting the performance of the model. The batch size list may include a number of values (which may be a multiple relationship between values, e.g., 2 times, 4 times, etc.) specified by the user, indicating at which batch size the user wishes to analyze the performance of the model, respectively.

The excessive number of values in the user-entered batch size list beyond what is supported by the graphics processor is also permissible because the computing device can recognize this problem based on prior art techniques to handle such that the batch size at the time of actually running the model can be that supported by the graphics processor hardware. The specific implementation of how to identify and process the excessive values in the batch size list is not described here.

When the model is run, the framework in the software stack of the graphics processor is called, the current deep learning framework mainly comprises PyTorch, tensorFlow, paddlePaddle and the like, the behaviors of the several frameworks are greatly different, different influences are generated on the running of the model, and therefore the framework can be specified by a user. In this regard, the configuration information may also include a frame name for identifying the selected frame. In order to adapt to different application scene demands, the framework is also provided with multiple branches, the deep neural network library is also provided with multiple branches, and the configuration information can also comprise the identification of the framework branches and the identification of the deep neural network library branches so as to indicate which framework branch and which deep neural network library branch the user wants to run the model under. In addition to using existing branches, the use of new branches submitted by users is also supported.

It will be appreciated that the configuration information may include less or more information than the above examples, so long as the model to be analyzed and the operating parameters of the model may be determined according to the configuration information, and the disclosure is not limited to the specific content included in the configuration information.

The computing device, the user device may be the computing device, the user device, respectively, described in fig. 1 above. The performance analysis method of the present disclosure, when executed by a computing device, may be seen as having 5 phases: a primary cloning stage, a checking stage, a downloading stage, a secondary cloning stage and an analyzing stage. Step S42 corresponds to a cloning stage, and in step S42, the computing device copies a code repository file from the cloud platform, where the code repository file may be a file that is compiled by a user in advance and stored in any storage device (the storage device described in fig. 1 above) through the cloud platform in advance, and may instruct the computing device to analyze the performance of the model running on the graphics processor and a preset data structure. Examples of means and data structures may be found in the further description of step S43 below. The code repository may include multiple pieces of code, each piece of code corresponding to one or more steps of the computing device analyzing the performance of the model running on the graphics processor (see step S43). The preset data structure may be a data structure which is conveniently displayed in a chart form and may be embodied in codes corresponding to some or some steps.

In a possible implementation manner, in step S43, the obtaining performance raw data of the model operation using the container and analyzing the performance raw data to obtain performance display data includes:

according to the step S21 and the step S22 in the performance analysis method, performance display data are obtained.

Step S43 corresponds to a checking phase, a downloading phase, a secondary cloning phase, and an analyzing phase. In step S43, according to the manner indicated by the code repository file, the computing device may download resources required for running the model from the cloud platform and run the model based on the resources. The kernel function called by the model to be analyzed when being run can be used as the target kernel function.

The computing device may also be provided with a virtual container (dock) that the computing device may use to obtain performance raw data for the model run and analyze the performance raw data to obtain performance presentation data. The performance raw data comprises kernel function performance data when the graphics processor runs the target kernel function, the performance display data comprises computation intensity and actual computation force maximum value when the graphics processor runs the target kernel function, the computation intensity represents the number of floating point operations completed by per unit memory exchange when the target kernel function runs, and the actual computation force represents the throughput of the target kernel function. That is, in step S43, the computing device may complete the work of steps S21 and S22 described above. Because the container has an isolation effect, the operation of data acquisition and data analysis cannot be influenced with the operation of the model.

The performance presentation data may be data conforming to a preset data structure indicated by the code repository, in which case the performance presentation data may be conveniently presented in a graphical form.

In step S44, the computing device may upload the performance presentation data to the cloud platform, and further, may store the performance presentation data to any storage device, such as a relational database stored in the storage device, through the cloud platform. The user device may access the relational database of the storage device through the cloud platform to download the performance presentation data, and use the visualization tool to present the performance presentation data in a chart form in conjunction with the user' S viewing needs (i.e., step S23 described above) in order for the user to view and determine the manner in which to optimize the model performance. When the performance presentation data includes the maximum value of the calculation intensity and the actual calculation force when the graphics processor runs the objective kernel function, the performance presentation data can be presented as a point representing the performance of the objective kernel function running by the graphics processor in the roof line model, when the position of the point falls in a performance bottleneck region in the roof line model, the performance bottleneck is indicated to exist in the objective kernel function, and the objective kernel function is set as the kernel function to be optimized. Examples of roof line models have been described above and are not described in detail herein.

The user's viewing needs may be a comparison of performance of different kernel functions running on the same graphics processor, or a comparison of performance of different graphics processors for the same kernel function, etc. Since the performance presentation data has been structured, the corresponding data can be conveniently found and presented. Fig. 5-10 illustrate examples of presentation effects of performance presentation data according to embodiments of the present disclosure.

Fig. 5 and 6 are roof line models when the graphic processor 1 and the graphic processor 2 operate the same model, respectively. The kernel functions represented by the points in the circle are the same, and it can be seen that when the kernel functions represented by the points in the circle are run by the graphics processor 1, the kernel functions are in the bandwidth bottleneck region, and part of the kernel functions are located below the second fold line, so that the performance of the kernel functions can be improved by considering a method for optimizing the memory function; and when run by the graphics processor 2, above the second fold line, indicates that these kernel functions have achieved superior performance.

Fig. 7 and 8 are a schematic diagram of the kernel calculation efficiency and a roof line model, respectively, when the graphics processor 2 runs the model. The kernel function corresponding to the computational efficiency in the block of fig. 7 is the same as the kernel function represented by the points in the block of fig. 8. As can be seen from fig. 7, this portion of the kernel function is relatively computationally inefficient. But as seen in connection with fig. 8, the performance of this portion of the kernel function is above the second fold line, and thus better performance has been achieved.

Fig. 9 and 10 are a schematic diagram and a roof line model, respectively, of kernel execution time consumption when the graphics processor 2 runs the model. The kernel function corresponding to the execution time consumption in the block of fig. 9 is the same as the kernel function represented by the point in the block of fig. 10. As can be seen from fig. 9, this part of the kernel function is relatively time-consuming to execute. But as seen in connection with fig. 10, the performance of this portion of the kernel function is above the second fold line, and thus better performance has been achieved.

After the user determines the manner of optimizing the model performance, the configuration information may be re-given and output to the computing device through the cloud platform. In this case, the computing device may re-execute steps S41-S44 until the user determines from the performance exposure data that the performance of the model meets the requirements.

Before downloading the resources required for running the model from the cloud platform in step S43, it may be checked whether the container and the graphics processor have running environments meeting the conditions, and if so, the resources required for running the model from the cloud platform. The resources required to run the model may be one or more of a model file, a library file, a dataset file, and the like, required to run the model. The model file here may include code of the model, the library file may include functions necessary to run any model, parameters not requiring user definition, etc., the data set file may include data input to the model, for example, when the model is a neural network model for face recognition, the data set may include input data input to a kernel function of interest. It will be appreciated that the resources required to run the model may also include more data, which is not limiting of the present disclosure.

In step S43, before the model is run by using the graphics processor based on the resource, the deep neural network library file and the frame file related to the model may be copied from the cloud platform, the deep neural network library file is compiled, the frame file is installed, and after the deep neural network library file is compiled and the frame file is installed, the model is run by using the graphics processor based on the resource.

Multiple levels of the graphics processor software stack may include a driver layer, a compute layer, a framework layer, a model layer, an application layer, and the like. It will be appreciated that in step S43, the computing device may also obtain other kinds of performance raw data besides kernel performance data, such as trace data, model end-to-end performance data, and so on, and analyze and obtain other kinds of performance presentation data besides computation intensity and actual computation force maximum, such as trace data, model end-to-end performance data, running environment data, and so on, which conform to a preset data structure. The trace data may indicate the performance of the framework layer and the model end-to-end performance data may indicate the performance of the model layer such that a user may observe the performance of the graphics processor running model from multiple dimensions while the performance presentation data is presented. The present disclosure is not limited to a specific type of performance raw data and performance presentation data, as long as both are related to the performance of the model.

In one possible implementation, the disclosure further proposes a performance analysis system, as shown in fig. 11, for analyzing whether an objective kernel function executed by a graphics processor needs to be optimized, where the system includes:

a first obtaining module 101, configured to obtain kernel performance data when the graphics processor runs the target kernel;

a first determining module 102, configured to determine, according to the kernel performance data, a computation strength and an actual computation power maximum value when the graphics processor runs the target kernel, where the computation strength represents a number of floating point operations completed per unit memory swap when the target kernel runs, and the actual computation power represents a throughput of the target kernel;

a second determining module 103, configured to add a point representing the performance of the graphics processor running the objective kernel function in a roof line model according to the calculated intensity and the actual calculated force maximum value, indicate that the objective kernel function has a performance bottleneck when the position of the point falls in a performance bottleneck area in the roof line model, and set the objective kernel function as a kernel function to be optimized.

In one possible implementation, the performance bottleneck region includes a bandwidth bottleneck region and a computation bottleneck region, and the system further includes:

a fourth determining module, configured to determine a second fold line in the roof line model according to a computation strength and an actual computation force maximum value when the graphics processor runs a reference kernel function, where the second fold line is below the first fold line;

The optimizing module is used for optimizing the memory function called by the function to be optimized when the position of the point falls in the bandwidth bottleneck area and is located below the second folding line; when the position of the point falls in the calculation bottleneck area and is positioned below the second folding line, optimizing the calculation function called by the function to be optimized; and when the position of the point is in the bandwidth bottleneck area and above the second folding line, optimizing a calculation method used when the function to be optimized runs, the transmission time of input data when the function to be optimized runs and the transmission time of output data when the function to be optimized runs.

In one possible implementation, the disclosure further proposes a performance analysis apparatus applied to a computing device on which a physical graphics processor and a virtual container are disposed, the apparatus comprising:

a first receiving unit 111, configured to receive configuration information from a cloud platform, where the configuration information is uploaded by a user device, and the configuration information indicates a model to be analyzed and an operation parameter of the model;

A first copying unit 112 for copying a code repository file from the cloud platform, the code repository file indicating a manner in which the computing device analyzes the performance of the model running on the graphics processor and a preset data structure;

a first downloading unit 113, configured to download, from the cloud platform, resources required for running the model according to the manner indicated by the code repository file, run the model using the graphics processor based on the resources, obtain performance raw data run by the model using the container, and analyze the performance raw data to obtain performance display data, where the performance display data conforms to the preset data structure;

a first uploading unit 114, configured to upload the performance display data to the cloud platform, where the performance display data is downloaded from the cloud platform by the user device and displayed to a user;

In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

The disclosed embodiments also provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method. The computer readable storage medium may be a volatile or nonvolatile computer readable storage medium.

The disclosed embodiments also propose a computing device comprising: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to implement the above-described method when executing the instructions stored by the memory.

Embodiments of the present disclosure also provide a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, performs the above method.

Fig. 13 shows a schematic structural diagram of an apparatus 1900 according to an embodiment of the disclosure. For example, the apparatus 1900 may be provided as a computing device or user device or storage device as described above. Referring to fig. 13, the apparatus 1900 includes a processing component 1922 that further includes one or more processors and memory resources represented by memory 1932 for storing instructions, such as application programs, that can be executed by the processing component 1922. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, processing component 1922 is configured to execute instructions to perform the methods described above.

The apparatus 1900 may further comprise a power component 1926 configured to perform power management of the apparatus 1900, a wired or wireless network interface 1950 configured to connect the apparatus 1900 to a network, and an input/output interface 1958 (I/O interface). The apparatus 1900 may operate based on an operating system stored in the memory 1932, such as Windows Server ^TM ，Mac OS X ^TM ，Unix ^TM ，Linux ^TM ，FreeBSD ^TM Or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 1932, including computer program instructions executable by processing component 1922 of apparatus 1900 to perform the above-described methods.

The present disclosure may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

Computer program instructions for performing the operations of the present disclosure can be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A performance analysis method for analyzing whether an objective kernel function executed by a graphics processor needs to be optimized, the method comprising:

obtaining kernel function performance data when the graphics processor runs the target kernel function;

determining the maximum value of calculation intensity and actual calculation force when the graphics processor runs the target kernel function according to the kernel function performance data, wherein the calculation intensity represents the number of floating point operations completed by each unit memory exchange when the target kernel function runs, and the actual calculation force represents the throughput of the target kernel function;

And adding a point representing the performance of the graphics processor running the objective kernel function in a roof line model according to the calculated intensity and the actual calculated force maximum value, indicating that the objective kernel function has a performance bottleneck when the position of the point falls in a performance bottleneck area in the roof line model, and setting the objective kernel function as a kernel function to be optimized.

2. The method of claim 1, wherein the roof line model has a calculated intensity on the horizontal axis and a calculated force on the vertical axis,

before adding the point representing the performance of the graphics processor running the objective kernel in the roof line model, the method further comprises:

determining a first broken line in the roof line model according to the theoretical calculated force maximum value and the theoretical bandwidth maximum value of the graphic processor;

and determining the area between the first fold line and the transverse axis as the performance bottleneck area.

3. The method of claim 2, wherein the performance bottleneck region comprises a bandwidth bottleneck region and a computation bottleneck region, the method further comprising:

determining a second fold line in the roof line model according to the calculated intensity and the actual calculated intensity maximum value of the graphic processor when the reference kernel function is operated, wherein the second fold line is below the first fold line;

When the position of the point falls in the bandwidth bottleneck area and is positioned below the second folding line, optimizing the memory access function called by the function to be optimized;

and when the position of the point is in the bandwidth bottleneck area and above the second folding line, optimizing a calculation method used when the function to be optimized runs, the transmission time of input data when the function to be optimized runs and the transmission time of output data when the function to be optimized runs.

4. A method according to any of claims 1-3, wherein the kernel performance data comprises one or more of an index, a name, a run time, a memory size, a computation size, a bandwidth of the objective kernel.

5. The method of claim 4, wherein the obtaining kernel performance data of the graphics processor while running the target kernel comprises:

acquiring the input matrix size of the objective kernel function and the block size of parallel computation in the graphic processor;

Determining the access quantity and the calculated quantity according to the input matrix size and the partition size;

determining the bandwidth according to the access quantity and the operation time consumption;

one or more of the index, the name, the runtime consumption, the computational effort, the memory amount, the bandwidth are stored as the kernel function performance data.

6. The method of claim 4 or 5, wherein the calculated amount represents a number of floating point operations completed when the graphics processor runs the target kernel, the accessed amount represents an amount of memory swap completed for a single input sample when the graphics processor runs the target kernel,

the determining the computing intensity and the actual computing force maximum value when the graphics processor runs the target kernel function according to the kernel function performance data comprises the following steps:

determining the calculation intensity of the graphics processor when running the objective kernel function according to the ratio of the calculation amount to the access amount;

and determining the actual calculated force maximum value when the graphics processor runs the objective kernel function according to the calculated quantity and the running time consumption ratio.

7. A performance analysis method, the method being applied to a computing device having a physical graphics processor and a virtual container disposed thereon, the method comprising:

Receiving configuration information from a cloud platform, wherein the configuration information is uploaded by user equipment and indicates a model to be analyzed and operation parameters of the model;

copying a code repository file from the cloud platform, the code repository file indicating a manner in which the computing device analyzes performance of the model running on the graphics processor and a preset data structure;

downloading resources required for running the model from the cloud platform according to the mode indicated by the code warehouse file, running the model by using the graphic processor based on the resources, acquiring performance original data run by the model by using the container, and analyzing the performance original data to obtain performance display data, wherein the performance display data accords with the preset data structure;

uploading the performance display data to the cloud platform, wherein the performance display data is downloaded from the cloud platform by the user equipment and displayed to a user;

8. The method of claim 7, wherein obtaining performance raw data for the model run using the container and analyzing the performance raw data to obtain performance presentation data comprises:

the performance display data is obtained based on the method of any one of claims 1-6.

9. A performance analysis system for analyzing whether an objective kernel function run by a graphics processor needs to be optimized, the system comprising:

the first acquisition module is used for acquiring kernel function performance data when the graphics processor runs the target kernel function;

the first determining module is used for determining the maximum value of calculation intensity and actual calculation force when the graphics processor runs the target kernel function according to the kernel function performance data, wherein the calculation intensity represents the number of floating point operations completed by per unit memory exchange when the target kernel function runs, and the actual calculation force represents the throughput of the target kernel function;

and the second determining module is used for adding a point representing the performance of the graphics processor running the target kernel function into a roof line model according to the calculated intensity and the actual calculated force maximum value, indicating that the target kernel function has a performance bottleneck when the position of the point falls in a performance bottleneck area in the roof line model, and setting the target kernel function as the kernel function to be optimized.

10. A performance analysis apparatus for application to a computing device having a physical graphics processor and a virtual container disposed thereon, the apparatus comprising:

the first receiving unit is used for receiving configuration information from the cloud platform, wherein the configuration information is uploaded by the user equipment and indicates a model to be analyzed and operation parameters of the model;

a first copying unit for copying a code repository file from the cloud platform, the code repository file indicating a manner in which the computing device analyzes performance of the model running on the graphics processor and a preset data structure;

the first downloading unit is used for downloading resources required by running the model from the cloud platform according to the mode indicated by the code warehouse file, running the model by using the graphic processor based on the resources, acquiring performance original data of the model running by using the container and analyzing the performance original data to obtain performance display data, wherein the performance display data accords with the preset data structure;

the first uploading unit is used for uploading the performance display data to the cloud platform, and the performance display data is downloaded from the cloud platform by the user equipment and displayed to a user;

11. A computing device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to implement the method of any one of claims 1 to 9 when executing the instructions stored by the memory.

12. A non-transitory computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1 to 9.