CN112000472A

CN112000472A - Method and device for tuning performance bottleneck of GPU (graphics processing Unit) of high-performance server and storage medium

Info

Publication number: CN112000472A
Application number: CN202010804248.9A
Authority: CN
Inventors: 赵阳阳; 段谊海
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2020-08-11
Filing date: 2020-08-11
Publication date: 2020-11-27
Anticipated expiration: 2040-08-11
Also published as: CN112000472B

Abstract

The invention discloses a method, a device and a storage medium for tuning performance bottleneck of a GPU (graphics processing Unit) of a high-performance server, wherein the threshold value of the index of the server is configured by setting the calculation performance; the GPU executes the calculation example, tracks the calling of the processing function, records the processing function and the initial termination time of the processing function, and calculates the actual calculation performance of the GPU through the processing function and the initial termination time of the processing function; in the process of executing the example, collecting and recording data of server indexes related to the GPU; analyzing and comparing the data of the server index and the threshold value of the server index, and judging whether the adjustment of the GPU parameters is adaptive to the server; analyzing and comparing the actual calculation performance of the GPU with the set calculation performance, and judging whether GPU parameters need to be further adjusted; and if the actual calculation performance is smaller than the set calculation performance, adjusting the GPU parameters through a parameter optimization algorithm, and if the actual calculation performance reaches the set calculation performance, ending GPU tuning. The method can master the influence of the server index on the GPU computing performance and better finish the GPU tuning.

Description

Method and device for tuning performance bottleneck of GPU (graphics processing Unit) of high-performance server and storage medium

Technical Field

The invention relates to the field of server GPU optimization methods, in particular to a high-performance server GPU performance bottleneck tuning method, a high-performance server GPU performance bottleneck tuning device and a storage medium.

Background

With the vigorous development of large games, 3D technologies, AI vision and the like in the field of computers, the performance of the GPU is more and more powerful, and gradually has the characteristics of programmable pipelines, high-density parallel processing and the like, the floating-point computing capability of many GPUs exceeds that of CPUs, so that how to train or reason the performance of the GPU, the performance bottleneck of the GPU is found and the parameters are optimized, so that the performance of the GPU hardware is fully exerted, which is particularly important.

In the prior art, generally, a driver corresponding to the GPU is installed in the GPU server, and the real-time situation under the GPU load is monitored through the driver, so as to determine whether the GPU has reached the performance bottleneck in the operation process of the application program, however, in the specific implementation process, the monitoring mode of the relevant indexes of the server on the GPU performance cannot take into account the influence of other indexes of the server on the GPU performance, and therefore, in the GPU tuning process, the limiting factor of the GPU bottleneck is often determined inaccurately.

Disclosure of Invention

The invention provides a method for tuning performance bottleneck of a GPU (graphics processing Unit) of a high-performance server, aiming at detecting the change of performance index data of the GPU, automatically finding out the performance bottleneck of the GPU and adjusting and optimizing parameters of the GPU so as to exert the optimal operation performance of the GPU, shorten the development tuning time of the GPU and improve the quality of GPU application programs.

The invention provides a performance bottleneck tuning method of a high-performance server GPU, which comprises the following steps:

configuring a GPU (graphics processing Unit) to set calculation performance, storing the calculation performance in a third file, configuring a threshold value of a server index, and storing the threshold value in a fourth file;

the GPU executes the example, tracks the call of the processing function in the process of executing the example, records the processing function and the initial termination time of the processing function, calculates the actual calculation performance of the GPU through the processing function and the initial termination time of the processing function, and records the actual calculation performance of the GPU in a first file;

in the process of executing the example, collecting and recording data of a server index related to the GPU, wherein the data of the server index is recorded in a second file;

comparing the data of the server index with a threshold value of the server index according to the data of the first file, the second file, the third file and the fourth file, and judging whether the current GPU parameter is matched with the server;

and comparing the actual calculation performance of the GPU with the set calculation performance, and if the actual calculation performance is smaller than the set calculation performance, adjusting the GPU parameters through a parameter optimization algorithm.

Preferably, the obtaining of the actual computation performance of the GPU through the computation of the processing function and the start/end time thereof, and recording the actual computation performance of the GPU in the first file includes: acquiring processing functions running at a specific moment, calling the number of starting operation blocks of each processing function and the number of threads in each operation block of each processing function, and calculating the number of threads of each processing function; and summing the thread number of each processing function to obtain the thread number of the GPU, calculating the actual calculation capacity of the GPU according to the thread number of the GPU at a specific moment, and recording the actual calculation capacity of the GPU in the first file.

Preferably, the acquiring and recording data of the server index related to the GPU, the recording of the data of the server index in the second file includes: and periodically calling the server index data stored in the method file or the register, and recording the time for acquiring the server index data and the time for acquiring the server index data in the second file.

Preferably, the server index data includes a server CPU utilization rate, a server memory utilization rate, a PCIE bandwidth, and an NVLINK transceiving rate.

Preferably, a GPU parameter configuration interface is arranged between the server and the GPU, and the GPU parameters are adjusted by a parameter optimization algorithm, wherein the parameter optimization algorithm outputs GPU parameters to the GPU parameter configuration interface, and the GPU parameter configuration interface configures the GPU parameters to the GPU.

Preferably, the GPU parameters include memory frequency, core clock frequency, maximum power limit, and compute mode.

Preferably, the parameter optimization algorithm comprises a memory frequency optimization algorithm, a core clock frequency optimization algorithm and a maximum power limit optimization algorithm; the memory frequency optimization algorithm outputs a memory frequency parameter, the core clock frequency optimization algorithm outputs a core clock frequency parameter, and the maximum power limit optimization algorithm outputs a maximum power limit parameter.

Preferably, the memory frequency optimization algorithm adjusts the memory frequency to the direction which enables the GPU performance to be better, and in the process, when the GPU performance is reduced, a binary algorithm is adopted to determine the optimal memory frequency; the core clock frequency optimization algorithm adjusts the core clock frequency to the direction which enables the GPU performance to be better, and when the GPU performance is reduced in the process, a binary algorithm is adopted to determine the optimal core clock frequency; and adjusting the maximum power limit to the direction of enabling the GPU performance to be better by the maximum power limit optimization algorithm, and determining the optimal memory frequency by adopting a dichotomy algorithm when the GPU performance is reduced in the process.

The invention also provides a high-performance server GPU performance bottleneck tuning device which comprises a processing unit, a storage unit, a bus unit and an interface unit, wherein the processing unit, the storage unit and the interface unit are connected to the bus unit, the storage unit stores at least one instruction, the instruction can realize GPU computing performance acquisition and judgment, server index data acquisition and judgment, tuning of GPU performance parameters and configuration of GPU parameters, the processing unit calls the instruction to execute so as to realize GPU computing performance acquisition and judgment, server index data acquisition and judgment and tuning of the GPU performance parameters, and the interface unit calls the instruction to execute to configure the GPU parameters to the GPU.

The invention also provides a storage medium which stores at least one instruction, wherein the instruction can realize GPU calculation performance acquisition and judgment, server index data acquisition and judgment, GPU performance parameter tuning and GPU parameter configuration.

The method, the device and the storage medium for tuning the performance bottleneck of the GPU of the high-performance server have the following beneficial effects:

according to the method for tuning the performance bottleneck of the GPU of the high-performance server, the actual computing performance and the ideal computing performance of the GPU are analyzed by acquiring the computing performance of the GPU when the GPU runs an example, and whether a continuous optimization space exists is judged; obtaining data of relevant indexes of a server influencing GPU computing performance; analyzing the server indexes and the corresponding index threshold values, mastering the influence of the server indexes on the GPU computing performance, adjusting the tuning strategy, eliminating the influence of the server indexes, and better completing the tuning of the GPU.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a performance bottleneck tuning method of a GPU of a high-performance server according to an embodiment of the present invention;

FIG. 2 is an architecture diagram of a performance bottleneck tuning device of a GPU of a high-performance server according to an embodiment of the present invention;

FIG. 3 is a diagram of a system architecture configured in the apparatus of FIG. 2 in accordance with an embodiment of the present invention;

FIG. 4 is a flow chart of optimizing GPU parameters in an embodiment of the present invention;

fig. 5 is a schematic flow chart illustrating a process of eliminating the influence of server metrics according to an embodiment of the present invention.

Reference numerals and meanings in the drawings: 701. a processing unit 702, a storage unit 703, a bus unit 704 and an interface unit.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 2, an embodiment of the present invention provides a high-performance server GPU performance bottleneck tuning device, including a processing unit 701, a storage unit 702, a bus unit 703, and an interface unit 704, where the processing unit 701, the storage unit 702, and the interface unit 704 are connected to the bus unit, the storage unit 702 stores at least one instruction, the instruction can implement GPU computation performance acquisition and determination, server index data acquisition and determination, tuning of GPU performance parameters, and configuration of GPU parameters, the processing unit invokes the instruction to execute to implement GPU computation performance acquisition and determination, server index data acquisition and determination, tuning of GPU performance parameters, and the interface unit invokes the instruction to execute to configure the GPU parameters to the GPU.

The embodiment of the invention provides a storage medium, which stores at least one instruction, wherein the instruction can realize GPU computing performance acquisition and judgment, server index data acquisition and judgment, GPU performance parameter tuning and GPU parameter configuration.

Referring to fig. 3, the high-performance server GPU performance bottleneck tuning method is implemented by configuring a system including an execution module, a monitoring module, an analysis module, and a parameter optimization module in the high-performance server GPU performance bottleneck tuning device.

Referring to fig. 1, in S100, the GPU is configured to set the computation performance and store the computation performance in the third file, and the threshold of the server index is configured and store the threshold in the fourth file;

s200, the execution module is matched with the GPU to execute the calculation, the execution module tracks the calling of a processing function in the process of executing the calculation, records the processing function and the initial termination time of the processing function, calculates the actual calculation performance of the GPU through the processing function and the initial termination time of the processing function, and records the actual calculation performance of the GPU in a first file;

s300, in the process of executing the calculation, the monitoring module collects and records data of server indexes related to the GPU, and the data of the server indexes are recorded in a second file;

s400, the analysis module acquires the contents of the first file, the second file, the third file and the fourth file;

s500, the analysis module analyzes and compares the data of the server indexes with the threshold values of the server indexes, and judges whether the GPU parameters are adjusted to be matched with the server. Referring to fig. 5, after the GPU parameters are adjusted, it is determined whether the change of the server index is within the threshold range, and if the change of the server index is within the threshold range, it indicates that the server index is consistent, which indicates that the GPU computing performance is not affected by the server index, and the GPU is continuously adjusted and optimized; and if the GPU exceeds the threshold range, the GPU computing performance is influenced by the server index, the calculation examples are readjusted, and tuning is carried out.

S600, referring to fig. 4, the analyzing module analyzes and compares the actual computation performance of the GPU with the set computation performance, and determines whether further adjustment of GPU parameters is required; and if the actual calculation performance is less than the set calculation performance, the parameter optimization module adjusts the GPU parameters through a parameter optimization algorithm, and if the actual calculation performance reaches the set calculation performance, GPU tuning is finished.

Referring to fig. 3, in a specific implementation process, the execution module configures a CUDNN deep neural network acceleration structure for processing GPU network layer data, optimizes the GPU network layer data processing by convolution operation, shortens image processing time, when the GPU executes an example, the execution module configures a tracking program, the tracking program monitors a storage address of a processing function by an instruction, thereby tracking invocation of the processing function, specifically, when a processing function in the storage address is invoked, records the storage address of the invoked processing function, determines an attribute of the invoked processing function according to the storage address, records an initial termination time of the invoked processing function, and stores the initial termination time in a log file, the execution module obtains content in the log file, the execution module determines a processing function to be executed at any time, and the execution module invokes, by a first instruction, the number of start operation blocks of each processing function and each operation block of the processing function The number of threads in the operation block started by the processing function is summed, and the total number of threads of each processing function is calculated; and summing the thread number of each processing function to obtain the thread number of the GPU, obtaining the thread number of the GPU at a certain moment to calculate the actual calculation capacity of the GPU at the moment and recording the actual calculation capacity in the first file, wherein the first file is stored in a file directory specified by a server. The handling function may be a CUDA function.

The monitoring module collects and records data of server indexes related to the GPU, and the data of the server indexes are recorded in a second file; in a specific implementation process, the monitoring module calls server index data stored in a method file or a register through a second instruction periodically, the monitoring module executes the second instruction cyclically and synchronously with a period of 20ms, the second instruction comprises a command for reading CPU utilization rate data in the method file or the register, a command for reading server memory utilization rate data in the method file or the register, a command for reading PCIE bandwidth data in the method file or the register, a command for reading NVLINK transceiving rate data in the method file or the register, and records the obtained server CPU utilization rate data, memory utilization rate data, PCIE bandwidth data, NVLINK transceiving rate data, and the time for obtaining the server index data in the second file. The second file is stored in a server-specified file directory.

Saving a third file and a fourth file in a server designated file directory, wherein the third file manually sets GPU (graphics processing unit) setting calculation performance data, and the fourth file manually sets threshold data of server indexes; in a specific implementation process, interfaces for modifying the third file and the fourth file are set, and the third file and the fourth file are modified by manually inputting the GPU setting calculation performance data and the threshold value data of the server index into the interfaces. The analysis module respectively reads the contents of the first file, the second file, the third file and the fourth file from corresponding file directories by using a reading instruction; the analysis module compares data in the second file and the fourth file, compares actual server CPU utilization rate data with a set threshold value of CPU utilization rate, compares actual server memory utilization rate with a set memory utilization rate threshold value, compares actual server PCIE bandwidth data with a set PCIE bandwidth threshold value, and compares actual server NVLINK receiving and sending rate with a set NVLINK receiving and sending rate threshold value. The memory utilization rate of the server, the PCIE bandwidth of the CPU utilization rate and the NVLINK receiving and sending rate are controlled, and the influence of server indexes on GPU tuning is eliminated; the analysis module compares the data of the first file and the third file, compares the actual calculation performance of the GPU with the set calculation performance, and judges whether the GPU parameters need to be further adjusted or not according to the difference between the actual calculation performance and the set calculation performance. Wherein, the analysis module is provided with a threshold value of interface CPU utilization rate and a PCIE bandwidth threshold value

And initializing GPU parameters, wherein the GPU runs the calculation examples in the initialized state, and the GPU parameters comprise memory frequency, core clock frequency, maximum power limit and calculation mode. When the performance of the GPU in the initialization state is not fully exerted, the difference between the practical calculation performance of the GPU analyzed by the analysis module and the set calculation performance is out of an allowable range, the GPU transmits the initialized parameters to the parameter optimization module, the parameter optimization module sets a GPU parameter configuration interface, the parameter optimization module is configured with a parameter optimization algorithm, the parameter optimization algorithm outputs the adjusted GPU parameters to the modification interface, and the GPU parameter configuration interface configures the GPU parameters to the GPU through a third instruction. The parameter optimization algorithm comprises a memory frequency optimization algorithm, a core clock frequency optimization algorithm and a maximum power limit optimization algorithm; the memory frequency optimization algorithm outputs a memory frequency parameter, the core clock frequency optimization algorithm outputs a core clock frequency parameter, and the maximum power limit optimization algorithm outputs a maximum power limit parameter.

Specifically, the parameter optimization module takes a GPU memory frequency initialization parameter as an initial value of a GPU memory frequency optimization algorithm, sets a memory frequency adjustment amount (including positive and negative), performs two times of adjustment according to a positive memory frequency adjustment amount or a negative memory frequency adjustment amount, sequentially configures the results of the two times of adjustment to a GPU, the GPU repeatedly runs the same calculation example, compares the initial value with the GPU calculation performance after the two times of memory frequency adjustment, if the two times of calculation performance are worse than the initialization state, the second time of calculation performance is worse than the first time of calculation performance, which indicates that the adjustment direction is wrong, the GPU memory frequency optimization algorithm changes the sign of the memory frequency adjustment amount, so that the memory frequency adjustment is performed towards the GPU tuning direction, compares the GPU calculation performance after the two times of memory frequency adjustment, if the GPU calculation performance is worse, which indicates that the optimal GPU memory frequency is between the last time and the last but second time of memory frequency setting, then finding the optimal memory frequency of the GPU by adopting a dichotomy mode, and configuring the optimal memory frequency to the GPU through the GPU parameter configuration interface; if the first adjustment calculation performance in the two adjustments is optimal, the optimal GPU memory frequency is between the initial value and the memory frequency set in the second jump, and then the optimal GPU memory frequency is found in a dichotomy mode; and if the GPU calculation performance is adjusted for two times to be better and better, continuing to adjust according to the original memory frequency adjustment amount until the GPU calculation performance is deteriorated, indicating that the optimal GPU memory frequency is between the last memory frequency and the last but one second memory frequency, and then finding the optimal GPU memory frequency by adopting a bisection method.

Similarly, the parameter optimization module takes the GPU core clock frequency initialization parameter as an initial value of a GPU core clock frequency optimization algorithm, sets a core clock frequency adjustment amount (including positive and negative), performs two times of adjustment according to the positive core clock frequency adjustment amount or the negative core clock frequency adjustment amount, sequentially configures the results of the two times of adjustment to the GPU, the GPU repeatedly runs the same algorithm, compares the initial value with the GPU computing performance after the two times of core clock frequency adjustment, if the two times of computing performance are worse than the initialization state, the second time of computing performance is worse than the first time of computing performance, which indicates that the adjustment direction is wrong, the GPU core clock frequency optimization algorithm changes the sign of the core clock frequency adjustment amount, so that the adjustment of the core clock frequency is performed towards the GPU tuning direction, compares the GPU computing performance after the two times of core clock frequency adjustment, and if the GPU computing performance is worse, the optimal GPU core clock frequency is between the last core clock frequency and the penultimate core clock frequency, then the optimal GPU core clock frequency is found in a dichotomy mode, and the optimal core clock frequency is configured to the GPU through the GPU parameter configuration interface; if the first adjustment calculation performance in the two adjustments is optimal, the optimal GPU core clock frequency is between the initial value and the core clock frequency set in the second jump, and then the optimal GPU core clock frequency is found in a dichotomy mode; and if the GPU calculation performance is adjusted for two times to be better and better, continuing to adjust according to the original core clock frequency adjustment amount until the GPU calculation performance is deteriorated, indicating that the optimal GPU core clock frequency is between the last core clock frequency and the penultimate core clock frequency, and then finding the optimal GPU core clock frequency by adopting a dichotomy mode.

Similarly, the parameter optimization module takes the GPU maximum power limit initialization parameter as an initial value of a GPU maximum power limit optimization algorithm, sets a maximum power limit adjustment amount (including positive and negative), performs twice adjustment according to the positive maximum power limit adjustment amount or the negative maximum power limit adjustment amount, sequentially configures the results of the twice adjustment to the GPU, the GPU re-runs the same calculation example, compares the initial value with the GPU calculation performance after the twice maximum power limit adjustment, if the twice calculation performance is worse than the initialization state, the second calculation performance is worse than the first calculation performance, which indicates that the adjustment direction is wrong, the GPU maximum power limit optimization algorithm changes the sign of the maximum power limit adjustment amount, so that the adjustment of the maximum power limit is performed towards the GPU optimization direction, compares the GPU calculation performance after the twice maximum power limit adjustment, and if the GPU calculation performance is worse, the maximum power limit of the optimal GPU is described to be between the maximum power limits set for the last time and the last but one time, then the optimal maximum power limit of the GPU is found in a dichotomy mode, and the optimal maximum power limit is configured to the GPU through the GPU parameter configuration interface; if the first adjustment calculation performance in the two adjustments is optimal, the optimal maximum power limit of the GPU is between the initial value and the maximum power limit set in the second jump, and then the optimal maximum power limit of the GPU is found in a dichotomy mode; and if the GPU calculation performance is adjusted for two times to be better and better, continuing to adjust according to the original maximum power limit adjustment amount until the GPU calculation performance is poor, indicating that the optimal GPU maximum power limit is between the maximum power limit set for the last time and the maximum power limit set for the last but one time, and then finding the optimal maximum power limit of the GPU by adopting a dichotomy mode.

And switching GPU working modes, running the calculation examples in different working modes, and selecting a proper GPU working mode according to the result of comparing the actual calculation performance with the ideal calculation performance by the analysis module.

According to the high-performance server GPU performance bottleneck tuning method, the calculation performance of the GPU during operation of the calculation examples is obtained through the execution module, the analysis module analyzes the actual calculation performance and the ideal calculation performance of the GPU, and whether a continuous optimization space exists is judged; acquiring data of relevant indexes of the server, which influence the GPU computing performance, through the monitoring module; the analysis module analyzes the server indexes and the corresponding index threshold values, grasps the influence of the server indexes on the GPU computing performance, adjusts the tuning strategy, and if an example which is not limited by the server indexes is selected for tuning, the tuning strategy excludes the influence of the server indexes and well completes the tuning of the GPU.

Finally, it should be noted that, as one of ordinary skill in the art can appreciate that all or part of the processes of the methods of the above embodiments can be implemented by a computer program to instruct related hardware, and a program of the method for detecting a network loop based on SONiC can be stored in a computer readable storage medium, and when executed, the program can include the processes of the embodiments of the methods described above. The storage medium of the program may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like. The embodiments of the computer program may achieve the same or similar effects as any of the above-described method embodiments.

Furthermore, the methods disclosed according to embodiments of the present invention may also be implemented as a computer program executed by a processor, which may be stored in a computer-readable storage medium. Which when executed by a processor performs the above-described functions defined in the methods disclosed in embodiments of the invention.

Further, the above method steps and system elements may also be implemented using a controller and a computer readable storage medium for storing a computer program for causing the controller to implement the functions of the above steps or elements.

Further, it should be appreciated that the computer-readable storage media (e.g., memory) herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. By way of example, and not limitation, nonvolatile memory can include Read Only Memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which can act as external cache memory. By way of example and not limitation, RAM is available in a variety of forms such as synchronous RAM (DRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The storage devices of the disclosed aspects are intended to comprise, without being limited to, these and other suitable types of memory.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments of the present invention.

The various illustrative logical blocks, modules, and circuits described in connection with the disclosure herein may be implemented or performed with the following components designed to perform the functions herein: a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination of these components. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP, and/or any other such configuration.

The steps of a method or algorithm described in connection with the disclosure herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.

In one or more exemplary designs, the functions may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a storage medium. Storage media includes computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. By way of example, and not limitation, such storage media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store desired program code in the form of instructions or data structures and which can be accessed by a general-purpose or special-purpose computer or a general-purpose or special-purpose processor. Also, any connection is properly termed a storage medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk, blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of storage media.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.

The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims

1. A performance bottleneck tuning method for a GPU (graphics processing unit) of a high-performance server is characterized by comprising the following steps:

2. The method of claim 1, wherein the calculating the actual computation performance of the GPU according to the processing function and the start/stop time thereof comprises recording the actual computation performance of the GPU in a first file: acquiring processing functions running at a specific moment, calling the number of starting operation blocks of each processing function and the number of threads in each operation block of each processing function, and calculating the number of threads of each processing function; and summing the thread number of each processing function to obtain the thread number of the GPU, calculating the actual calculation capacity of the GPU according to the thread number of the GPU at a specific moment, and recording the actual calculation capacity of the GPU in the first file.

3. The method according to claim 1, wherein the acquiring and recording data of the server index related to the GPU, the recording the data of the server index in the second file comprises: and periodically calling the server index data stored in the method file or the register, and recording the time for acquiring the server index data and the time for acquiring the server index data in the second file.

4. The method of claim 3, wherein the server index data includes a server CPU utilization rate, a server memory utilization rate, a PCIE bandwidth, and an NVLINK transceiving rate.

5. The method as claimed in claim 1, wherein the GPU parameters are adjusted by a parameter optimization algorithm by setting a GPU parameter configuration interface between the server and the GPU, the parameter optimization algorithm outputs GPU parameters to the GPU parameter configuration interface, and the GPU parameter configuration interface configures the GPU parameters to the GPU.

6. The method of claim 5, wherein the GPU parameters comprise memory frequency, core clock frequency, maximum power limit, and compute mode.

7. The method according to claim 6, wherein the parameter optimization algorithm comprises a memory frequency optimization algorithm, a core clock frequency optimization algorithm, and a maximum power limit optimization algorithm; the memory frequency optimization algorithm outputs a memory frequency parameter, the core clock frequency optimization algorithm outputs a core clock frequency parameter, and the maximum power limit optimization algorithm outputs a maximum power limit parameter.

8. The method according to claim 7, wherein the memory frequency optimization algorithm adjusts the memory frequency in a direction to make the performance of the GPU better, and in the process, when the performance of the GPU is reduced, a binary algorithm is adopted to determine the optimal memory frequency; the core clock frequency optimization algorithm adjusts the core clock frequency to the direction which enables the GPU performance to be better, and when the GPU performance is reduced in the process, a binary algorithm is adopted to determine the optimal core clock frequency; and adjusting the maximum power limit to the direction of enabling the GPU performance to be better by the maximum power limit optimization algorithm, and determining the optimal memory frequency by adopting a dichotomy algorithm when the GPU performance is reduced in the process.

9. The device for tuning the performance bottleneck of the GPU of the high-performance server is characterized by comprising a processing unit (701), a storage unit (702), a bus unit (703) and an interface unit (704), wherein the processing unit (701), the storage unit (702) and the interface unit (704) are connected to the bus unit, the storage unit (702) stores at least one instruction, the instruction can achieve GPU computing performance obtaining and judging, server index data obtaining and judging, GPU performance parameter tuning and GPU parameter configuration, the processing unit calls the instruction to execute so as to achieve GPU computing performance obtaining and judging, server index data obtaining and judging and GPU performance parameter tuning, and the interface unit calls the instruction to execute the GPU parameters to be configured to the GPU.

10. A storage medium storing at least one instruction that enables GPU computing performance acquisition and determination, server index data acquisition and determination, tuning of GPU performance parameters, and configuration of GPU parameters.