CN112000472A - Method and device for tuning performance bottleneck of GPU (graphics processing Unit) of high-performance server and storage medium - Google Patents

Method and device for tuning performance bottleneck of GPU (graphics processing Unit) of high-performance server and storage medium Download PDF

Info

Publication number
CN112000472A
CN112000472A CN202010804248.9A CN202010804248A CN112000472A CN 112000472 A CN112000472 A CN 112000472A CN 202010804248 A CN202010804248 A CN 202010804248A CN 112000472 A CN112000472 A CN 112000472A
Authority
CN
China
Prior art keywords
gpu
performance
server
file
processing function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010804248.9A
Other languages
Chinese (zh)
Other versions
CN112000472B (en
Inventor
赵阳阳
段谊海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202010804248.9A priority Critical patent/CN112000472B/en
Publication of CN112000472A publication Critical patent/CN112000472A/en
Application granted granted Critical
Publication of CN112000472B publication Critical patent/CN112000472B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/501Performance criteria
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5018Thread allocation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/508Monitor
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a method, a device and a storage medium for tuning performance bottleneck of a GPU (graphics processing Unit) of a high-performance server, wherein the threshold value of the index of the server is configured by setting the calculation performance; the GPU executes the calculation example, tracks the calling of the processing function, records the processing function and the initial termination time of the processing function, and calculates the actual calculation performance of the GPU through the processing function and the initial termination time of the processing function; in the process of executing the example, collecting and recording data of server indexes related to the GPU; analyzing and comparing the data of the server index and the threshold value of the server index, and judging whether the adjustment of the GPU parameters is adaptive to the server; analyzing and comparing the actual calculation performance of the GPU with the set calculation performance, and judging whether GPU parameters need to be further adjusted; and if the actual calculation performance is smaller than the set calculation performance, adjusting the GPU parameters through a parameter optimization algorithm, and if the actual calculation performance reaches the set calculation performance, ending GPU tuning. The method can master the influence of the server index on the GPU computing performance and better finish the GPU tuning.

Description

Method and device for tuning performance bottleneck of GPU (graphics processing Unit) of high-performance server and storage medium
Technical Field
The invention relates to the field of server GPU optimization methods, in particular to a high-performance server GPU performance bottleneck tuning method, a high-performance server GPU performance bottleneck tuning device and a storage medium.
Background
With the vigorous development of large games, 3D technologies, AI vision and the like in the field of computers, the performance of the GPU is more and more powerful, and gradually has the characteristics of programmable pipelines, high-density parallel processing and the like, the floating-point computing capability of many GPUs exceeds that of CPUs, so that how to train or reason the performance of the GPU, the performance bottleneck of the GPU is found and the parameters are optimized, so that the performance of the GPU hardware is fully exerted, which is particularly important.
In the prior art, generally, a driver corresponding to the GPU is installed in the GPU server, and the real-time situation under the GPU load is monitored through the driver, so as to determine whether the GPU has reached the performance bottleneck in the operation process of the application program, however, in the specific implementation process, the monitoring mode of the relevant indexes of the server on the GPU performance cannot take into account the influence of other indexes of the server on the GPU performance, and therefore, in the GPU tuning process, the limiting factor of the GPU bottleneck is often determined inaccurately.
Disclosure of Invention
The invention provides a method for tuning performance bottleneck of a GPU (graphics processing Unit) of a high-performance server, aiming at detecting the change of performance index data of the GPU, automatically finding out the performance bottleneck of the GPU and adjusting and optimizing parameters of the GPU so as to exert the optimal operation performance of the GPU, shorten the development tuning time of the GPU and improve the quality of GPU application programs.
The invention provides a performance bottleneck tuning method of a high-performance server GPU, which comprises the following steps:
configuring a GPU (graphics processing Unit) to set calculation performance, storing the calculation performance in a third file, configuring a threshold value of a server index, and storing the threshold value in a fourth file;
the GPU executes the example, tracks the call of the processing function in the process of executing the example, records the processing function and the initial termination time of the processing function, calculates the actual calculation performance of the GPU through the processing function and the initial termination time of the processing function, and records the actual calculation performance of the GPU in a first file;
in the process of executing the example, collecting and recording data of a server index related to the GPU, wherein the data of the server index is recorded in a second file;
comparing the data of the server index with a threshold value of the server index according to the data of the first file, the second file, the third file and the fourth file, and judging whether the current GPU parameter is matched with the server;
and comparing the actual calculation performance of the GPU with the set calculation performance, and if the actual calculation performance is smaller than the set calculation performance, adjusting the GPU parameters through a parameter optimization algorithm.
Preferably, the obtaining of the actual computation performance of the GPU through the computation of the processing function and the start/end time thereof, and recording the actual computation performance of the GPU in the first file includes: acquiring processing functions running at a specific moment, calling the number of starting operation blocks of each processing function and the number of threads in each operation block of each processing function, and calculating the number of threads of each processing function; and summing the thread number of each processing function to obtain the thread number of the GPU, calculating the actual calculation capacity of the GPU according to the thread number of the GPU at a specific moment, and recording the actual calculation capacity of the GPU in the first file.
Preferably, the acquiring and recording data of the server index related to the GPU, the recording of the data of the server index in the second file includes: and periodically calling the server index data stored in the method file or the register, and recording the time for acquiring the server index data and the time for acquiring the server index data in the second file.
Preferably, the server index data includes a server CPU utilization rate, a server memory utilization rate, a PCIE bandwidth, and an NVLINK transceiving rate.
Preferably, a GPU parameter configuration interface is arranged between the server and the GPU, and the GPU parameters are adjusted by a parameter optimization algorithm, wherein the parameter optimization algorithm outputs GPU parameters to the GPU parameter configuration interface, and the GPU parameter configuration interface configures the GPU parameters to the GPU.
Preferably, the GPU parameters include memory frequency, core clock frequency, maximum power limit, and compute mode.
Preferably, the parameter optimization algorithm comprises a memory frequency optimization algorithm, a core clock frequency optimization algorithm and a maximum power limit optimization algorithm; the memory frequency optimization algorithm outputs a memory frequency parameter, the core clock frequency optimization algorithm outputs a core clock frequency parameter, and the maximum power limit optimization algorithm outputs a maximum power limit parameter.
Preferably, the memory frequency optimization algorithm adjusts the memory frequency to the direction which enables the GPU performance to be better, and in the process, when the GPU performance is reduced, a binary algorithm is adopted to determine the optimal memory frequency; the core clock frequency optimization algorithm adjusts the core clock frequency to the direction which enables the GPU performance to be better, and when the GPU performance is reduced in the process, a binary algorithm is adopted to determine the optimal core clock frequency; and adjusting the maximum power limit to the direction of enabling the GPU performance to be better by the maximum power limit optimization algorithm, and determining the optimal memory frequency by adopting a dichotomy algorithm when the GPU performance is reduced in the process.
The invention also provides a high-performance server GPU performance bottleneck tuning device which comprises a processing unit, a storage unit, a bus unit and an interface unit, wherein the processing unit, the storage unit and the interface unit are connected to the bus unit, the storage unit stores at least one instruction, the instruction can realize GPU computing performance acquisition and judgment, server index data acquisition and judgment, tuning of GPU performance parameters and configuration of GPU parameters, the processing unit calls the instruction to execute so as to realize GPU computing performance acquisition and judgment, server index data acquisition and judgment and tuning of the GPU performance parameters, and the interface unit calls the instruction to execute to configure the GPU parameters to the GPU.
The invention also provides a storage medium which stores at least one instruction, wherein the instruction can realize GPU calculation performance acquisition and judgment, server index data acquisition and judgment, GPU performance parameter tuning and GPU parameter configuration.
The method, the device and the storage medium for tuning the performance bottleneck of the GPU of the high-performance server have the following beneficial effects:
according to the method for tuning the performance bottleneck of the GPU of the high-performance server, the actual computing performance and the ideal computing performance of the GPU are analyzed by acquiring the computing performance of the GPU when the GPU runs an example, and whether a continuous optimization space exists is judged; obtaining data of relevant indexes of a server influencing GPU computing performance; analyzing the server indexes and the corresponding index threshold values, mastering the influence of the server indexes on the GPU computing performance, adjusting the tuning strategy, eliminating the influence of the server indexes, and better completing the tuning of the GPU.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.
FIG. 1 is a schematic flow chart of a performance bottleneck tuning method of a GPU of a high-performance server according to an embodiment of the present invention;
FIG. 2 is an architecture diagram of a performance bottleneck tuning device of a GPU of a high-performance server according to an embodiment of the present invention;
FIG. 3 is a diagram of a system architecture configured in the apparatus of FIG. 2 in accordance with an embodiment of the present invention;
FIG. 4 is a flow chart of optimizing GPU parameters in an embodiment of the present invention;
fig. 5 is a schematic flow chart illustrating a process of eliminating the influence of server metrics according to an embodiment of the present invention.
Reference numerals and meanings in the drawings: 701. a processing unit 702, a storage unit 703, a bus unit 704 and an interface unit.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 2, an embodiment of the present invention provides a high-performance server GPU performance bottleneck tuning device, including a processing unit 701, a storage unit 702, a bus unit 703, and an interface unit 704, where the processing unit 701, the storage unit 702, and the interface unit 704 are connected to the bus unit, the storage unit 702 stores at least one instruction, the instruction can implement GPU computation performance acquisition and determination, server index data acquisition and determination, tuning of GPU performance parameters, and configuration of GPU parameters, the processing unit invokes the instruction to execute to implement GPU computation performance acquisition and determination, server index data acquisition and determination, tuning of GPU performance parameters, and the interface unit invokes the instruction to execute to configure the GPU parameters to the GPU.
The embodiment of the invention provides a storage medium, which stores at least one instruction, wherein the instruction can realize GPU computing performance acquisition and judgment, server index data acquisition and judgment, GPU performance parameter tuning and GPU parameter configuration.
The invention provides a performance bottleneck tuning method of a high-performance server GPU, which comprises the following steps:
configuring a GPU (graphics processing Unit) to set calculation performance, storing the calculation performance in a third file, configuring a threshold value of a server index, and storing the threshold value in a fourth file;
the GPU executes the example, tracks the call of the processing function in the process of executing the example, records the processing function and the initial termination time of the processing function, calculates the actual calculation performance of the GPU through the processing function and the initial termination time of the processing function, and records the actual calculation performance of the GPU in a first file;
in the process of executing the example, collecting and recording data of a server index related to the GPU, wherein the data of the server index is recorded in a second file;
comparing the data of the server index with a threshold value of the server index according to the data of the first file, the second file, the third file and the fourth file, and judging whether the current GPU parameter is matched with the server;
and comparing the actual calculation performance of the GPU with the set calculation performance, and if the actual calculation performance is smaller than the set calculation performance, adjusting the GPU parameters through a parameter optimization algorithm.
Referring to fig. 3, the high-performance server GPU performance bottleneck tuning method is implemented by configuring a system including an execution module, a monitoring module, an analysis module, and a parameter optimization module in the high-performance server GPU performance bottleneck tuning device.
Referring to fig. 1, in S100, the GPU is configured to set the computation performance and store the computation performance in the third file, and the threshold of the server index is configured and store the threshold in the fourth file;
s200, the execution module is matched with the GPU to execute the calculation, the execution module tracks the calling of a processing function in the process of executing the calculation, records the processing function and the initial termination time of the processing function, calculates the actual calculation performance of the GPU through the processing function and the initial termination time of the processing function, and records the actual calculation performance of the GPU in a first file;
s300, in the process of executing the calculation, the monitoring module collects and records data of server indexes related to the GPU, and the data of the server indexes are recorded in a second file;
s400, the analysis module acquires the contents of the first file, the second file, the third file and the fourth file;
s500, the analysis module analyzes and compares the data of the server indexes with the threshold values of the server indexes, and judges whether the GPU parameters are adjusted to be matched with the server. Referring to fig. 5, after the GPU parameters are adjusted, it is determined whether the change of the server index is within the threshold range, and if the change of the server index is within the threshold range, it indicates that the server index is consistent, which indicates that the GPU computing performance is not affected by the server index, and the GPU is continuously adjusted and optimized; and if the GPU exceeds the threshold range, the GPU computing performance is influenced by the server index, the calculation examples are readjusted, and tuning is carried out.
S600, referring to fig. 4, the analyzing module analyzes and compares the actual computation performance of the GPU with the set computation performance, and determines whether further adjustment of GPU parameters is required; and if the actual calculation performance is less than the set calculation performance, the parameter optimization module adjusts the GPU parameters through a parameter optimization algorithm, and if the actual calculation performance reaches the set calculation performance, GPU tuning is finished.
Referring to fig. 3, in a specific implementation process, the execution module configures a CUDNN deep neural network acceleration structure for processing GPU network layer data, optimizes the GPU network layer data processing by convolution operation, shortens image processing time, when the GPU executes an example, the execution module configures a tracking program, the tracking program monitors a storage address of a processing function by an instruction, thereby tracking invocation of the processing function, specifically, when a processing function in the storage address is invoked, records the storage address of the invoked processing function, determines an attribute of the invoked processing function according to the storage address, records an initial termination time of the invoked processing function, and stores the initial termination time in a log file, the execution module obtains content in the log file, the execution module determines a processing function to be executed at any time, and the execution module invokes, by a first instruction, the number of start operation blocks of each processing function and each operation block of the processing function The number of threads in the operation block started by the processing function is summed, and the total number of threads of each processing function is calculated; and summing the thread number of each processing function to obtain the thread number of the GPU, obtaining the thread number of the GPU at a certain moment to calculate the actual calculation capacity of the GPU at the moment and recording the actual calculation capacity in the first file, wherein the first file is stored in a file directory specified by a server. The handling function may be a CUDA function.
The monitoring module collects and records data of server indexes related to the GPU, and the data of the server indexes are recorded in a second file; in a specific implementation process, the monitoring module calls server index data stored in a method file or a register through a second instruction periodically, the monitoring module executes the second instruction cyclically and synchronously with a period of 20ms, the second instruction comprises a command for reading CPU utilization rate data in the method file or the register, a command for reading server memory utilization rate data in the method file or the register, a command for reading PCIE bandwidth data in the method file or the register, a command for reading NVLINK transceiving rate data in the method file or the register, and records the obtained server CPU utilization rate data, memory utilization rate data, PCIE bandwidth data, NVLINK transceiving rate data, and the time for obtaining the server index data in the second file. The second file is stored in a server-specified file directory.
Saving a third file and a fourth file in a server designated file directory, wherein the third file manually sets GPU (graphics processing unit) setting calculation performance data, and the fourth file manually sets threshold data of server indexes; in a specific implementation process, interfaces for modifying the third file and the fourth file are set, and the third file and the fourth file are modified by manually inputting the GPU setting calculation performance data and the threshold value data of the server index into the interfaces. The analysis module respectively reads the contents of the first file, the second file, the third file and the fourth file from corresponding file directories by using a reading instruction; the analysis module compares data in the second file and the fourth file, compares actual server CPU utilization rate data with a set threshold value of CPU utilization rate, compares actual server memory utilization rate with a set memory utilization rate threshold value, compares actual server PCIE bandwidth data with a set PCIE bandwidth threshold value, and compares actual server NVLINK receiving and sending rate with a set NVLINK receiving and sending rate threshold value. The memory utilization rate of the server, the PCIE bandwidth of the CPU utilization rate and the NVLINK receiving and sending rate are controlled, and the influence of server indexes on GPU tuning is eliminated; the analysis module compares the data of the first file and the third file, compares the actual calculation performance of the GPU with the set calculation performance, and judges whether the GPU parameters need to be further adjusted or not according to the difference between the actual calculation performance and the set calculation performance. Wherein, the analysis module is provided with a threshold value of interface CPU utilization rate and a PCIE bandwidth threshold value
And initializing GPU parameters, wherein the GPU runs the calculation examples in the initialized state, and the GPU parameters comprise memory frequency, core clock frequency, maximum power limit and calculation mode. When the performance of the GPU in the initialization state is not fully exerted, the difference between the practical calculation performance of the GPU analyzed by the analysis module and the set calculation performance is out of an allowable range, the GPU transmits the initialized parameters to the parameter optimization module, the parameter optimization module sets a GPU parameter configuration interface, the parameter optimization module is configured with a parameter optimization algorithm, the parameter optimization algorithm outputs the adjusted GPU parameters to the modification interface, and the GPU parameter configuration interface configures the GPU parameters to the GPU through a third instruction. The parameter optimization algorithm comprises a memory frequency optimization algorithm, a core clock frequency optimization algorithm and a maximum power limit optimization algorithm; the memory frequency optimization algorithm outputs a memory frequency parameter, the core clock frequency optimization algorithm outputs a core clock frequency parameter, and the maximum power limit optimization algorithm outputs a maximum power limit parameter.
Specifically, the parameter optimization module takes a GPU memory frequency initialization parameter as an initial value of a GPU memory frequency optimization algorithm, sets a memory frequency adjustment amount (including positive and negative), performs two times of adjustment according to a positive memory frequency adjustment amount or a negative memory frequency adjustment amount, sequentially configures the results of the two times of adjustment to a GPU, the GPU repeatedly runs the same calculation example, compares the initial value with the GPU calculation performance after the two times of memory frequency adjustment, if the two times of calculation performance are worse than the initialization state, the second time of calculation performance is worse than the first time of calculation performance, which indicates that the adjustment direction is wrong, the GPU memory frequency optimization algorithm changes the sign of the memory frequency adjustment amount, so that the memory frequency adjustment is performed towards the GPU tuning direction, compares the GPU calculation performance after the two times of memory frequency adjustment, if the GPU calculation performance is worse, which indicates that the optimal GPU memory frequency is between the last time and the last but second time of memory frequency setting, then finding the optimal memory frequency of the GPU by adopting a dichotomy mode, and configuring the optimal memory frequency to the GPU through the GPU parameter configuration interface; if the first adjustment calculation performance in the two adjustments is optimal, the optimal GPU memory frequency is between the initial value and the memory frequency set in the second jump, and then the optimal GPU memory frequency is found in a dichotomy mode; and if the GPU calculation performance is adjusted for two times to be better and better, continuing to adjust according to the original memory frequency adjustment amount until the GPU calculation performance is deteriorated, indicating that the optimal GPU memory frequency is between the last memory frequency and the last but one second memory frequency, and then finding the optimal GPU memory frequency by adopting a bisection method.
Similarly, the parameter optimization module takes the GPU core clock frequency initialization parameter as an initial value of a GPU core clock frequency optimization algorithm, sets a core clock frequency adjustment amount (including positive and negative), performs two times of adjustment according to the positive core clock frequency adjustment amount or the negative core clock frequency adjustment amount, sequentially configures the results of the two times of adjustment to the GPU, the GPU repeatedly runs the same algorithm, compares the initial value with the GPU computing performance after the two times of core clock frequency adjustment, if the two times of computing performance are worse than the initialization state, the second time of computing performance is worse than the first time of computing performance, which indicates that the adjustment direction is wrong, the GPU core clock frequency optimization algorithm changes the sign of the core clock frequency adjustment amount, so that the adjustment of the core clock frequency is performed towards the GPU tuning direction, compares the GPU computing performance after the two times of core clock frequency adjustment, and if the GPU computing performance is worse, the optimal GPU core clock frequency is between the last core clock frequency and the penultimate core clock frequency, then the optimal GPU core clock frequency is found in a dichotomy mode, and the optimal core clock frequency is configured to the GPU through the GPU parameter configuration interface; if the first adjustment calculation performance in the two adjustments is optimal, the optimal GPU core clock frequency is between the initial value and the core clock frequency set in the second jump, and then the optimal GPU core clock frequency is found in a dichotomy mode; and if the GPU calculation performance is adjusted for two times to be better and better, continuing to adjust according to the original core clock frequency adjustment amount until the GPU calculation performance is deteriorated, indicating that the optimal GPU core clock frequency is between the last core clock frequency and the penultimate core clock frequency, and then finding the optimal GPU core clock frequency by adopting a dichotomy mode.
Similarly, the parameter optimization module takes the GPU maximum power limit initialization parameter as an initial value of a GPU maximum power limit optimization algorithm, sets a maximum power limit adjustment amount (including positive and negative), performs twice adjustment according to the positive maximum power limit adjustment amount or the negative maximum power limit adjustment amount, sequentially configures the results of the twice adjustment to the GPU, the GPU re-runs the same calculation example, compares the initial value with the GPU calculation performance after the twice maximum power limit adjustment, if the twice calculation performance is worse than the initialization state, the second calculation performance is worse than the first calculation performance, which indicates that the adjustment direction is wrong, the GPU maximum power limit optimization algorithm changes the sign of the maximum power limit adjustment amount, so that the adjustment of the maximum power limit is performed towards the GPU optimization direction, compares the GPU calculation performance after the twice maximum power limit adjustment, and if the GPU calculation performance is worse, the maximum power limit of the optimal GPU is described to be between the maximum power limits set for the last time and the last but one time, then the optimal maximum power limit of the GPU is found in a dichotomy mode, and the optimal maximum power limit is configured to the GPU through the GPU parameter configuration interface; if the first adjustment calculation performance in the two adjustments is optimal, the optimal maximum power limit of the GPU is between the initial value and the maximum power limit set in the second jump, and then the optimal maximum power limit of the GPU is found in a dichotomy mode; and if the GPU calculation performance is adjusted for two times to be better and better, continuing to adjust according to the original maximum power limit adjustment amount until the GPU calculation performance is poor, indicating that the optimal GPU maximum power limit is between the maximum power limit set for the last time and the maximum power limit set for the last but one time, and then finding the optimal maximum power limit of the GPU by adopting a dichotomy mode.
And switching GPU working modes, running the calculation examples in different working modes, and selecting a proper GPU working mode according to the result of comparing the actual calculation performance with the ideal calculation performance by the analysis module.
According to the high-performance server GPU performance bottleneck tuning method, the calculation performance of the GPU during operation of the calculation examples is obtained through the execution module, the analysis module analyzes the actual calculation performance and the ideal calculation performance of the GPU, and whether a continuous optimization space exists is judged; acquiring data of relevant indexes of the server, which influence the GPU computing performance, through the monitoring module; the analysis module analyzes the server indexes and the corresponding index threshold values, grasps the influence of the server indexes on the GPU computing performance, adjusts the tuning strategy, and if an example which is not limited by the server indexes is selected for tuning, the tuning strategy excludes the influence of the server indexes and well completes the tuning of the GPU.
Finally, it should be noted that, as one of ordinary skill in the art can appreciate that all or part of the processes of the methods of the above embodiments can be implemented by a computer program to instruct related hardware, and a program of the method for detecting a network loop based on SONiC can be stored in a computer readable storage medium, and when executed, the program can include the processes of the embodiments of the methods described above. The storage medium of the program may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like. The embodiments of the computer program may achieve the same or similar effects as any of the above-described method embodiments.
Furthermore, the methods disclosed according to embodiments of the present invention may also be implemented as a computer program executed by a processor, which may be stored in a computer-readable storage medium. Which when executed by a processor performs the above-described functions defined in the methods disclosed in embodiments of the invention.
Further, the above method steps and system elements may also be implemented using a controller and a computer readable storage medium for storing a computer program for causing the controller to implement the functions of the above steps or elements.
Further, it should be appreciated that the computer-readable storage media (e.g., memory) herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. By way of example, and not limitation, nonvolatile memory can include Read Only Memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which can act as external cache memory. By way of example and not limitation, RAM is available in a variety of forms such as synchronous RAM (DRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The storage devices of the disclosed aspects are intended to comprise, without being limited to, these and other suitable types of memory.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments of the present invention.
The various illustrative logical blocks, modules, and circuits described in connection with the disclosure herein may be implemented or performed with the following components designed to perform the functions herein: a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination of these components. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP, and/or any other such configuration.
The steps of a method or algorithm described in connection with the disclosure herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.
In one or more exemplary designs, the functions may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a storage medium. Storage media includes computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. By way of example, and not limitation, such storage media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store desired program code in the form of instructions or data structures and which can be accessed by a general-purpose or special-purpose computer or a general-purpose or special-purpose processor. Also, any connection is properly termed a storage medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk, blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of storage media.
The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.
It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.
The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims (10)

1. A performance bottleneck tuning method for a GPU (graphics processing unit) of a high-performance server is characterized by comprising the following steps:
configuring a GPU (graphics processing Unit) to set calculation performance, storing the calculation performance in a third file, configuring a threshold value of a server index, and storing the threshold value in a fourth file;
the GPU executes the example, tracks the call of the processing function in the process of executing the example, records the processing function and the initial termination time of the processing function, calculates the actual calculation performance of the GPU through the processing function and the initial termination time of the processing function, and records the actual calculation performance of the GPU in a first file;
in the process of executing the example, collecting and recording data of a server index related to the GPU, wherein the data of the server index is recorded in a second file;
comparing the data of the server index with a threshold value of the server index according to the data of the first file, the second file, the third file and the fourth file, and judging whether the current GPU parameter is matched with the server;
and comparing the actual calculation performance of the GPU with the set calculation performance, and if the actual calculation performance is smaller than the set calculation performance, adjusting the GPU parameters through a parameter optimization algorithm.
2. The method of claim 1, wherein the calculating the actual computation performance of the GPU according to the processing function and the start/stop time thereof comprises recording the actual computation performance of the GPU in a first file: acquiring processing functions running at a specific moment, calling the number of starting operation blocks of each processing function and the number of threads in each operation block of each processing function, and calculating the number of threads of each processing function; and summing the thread number of each processing function to obtain the thread number of the GPU, calculating the actual calculation capacity of the GPU according to the thread number of the GPU at a specific moment, and recording the actual calculation capacity of the GPU in the first file.
3. The method according to claim 1, wherein the acquiring and recording data of the server index related to the GPU, the recording the data of the server index in the second file comprises: and periodically calling the server index data stored in the method file or the register, and recording the time for acquiring the server index data and the time for acquiring the server index data in the second file.
4. The method of claim 3, wherein the server index data includes a server CPU utilization rate, a server memory utilization rate, a PCIE bandwidth, and an NVLINK transceiving rate.
5. The method as claimed in claim 1, wherein the GPU parameters are adjusted by a parameter optimization algorithm by setting a GPU parameter configuration interface between the server and the GPU, the parameter optimization algorithm outputs GPU parameters to the GPU parameter configuration interface, and the GPU parameter configuration interface configures the GPU parameters to the GPU.
6. The method of claim 5, wherein the GPU parameters comprise memory frequency, core clock frequency, maximum power limit, and compute mode.
7. The method according to claim 6, wherein the parameter optimization algorithm comprises a memory frequency optimization algorithm, a core clock frequency optimization algorithm, and a maximum power limit optimization algorithm; the memory frequency optimization algorithm outputs a memory frequency parameter, the core clock frequency optimization algorithm outputs a core clock frequency parameter, and the maximum power limit optimization algorithm outputs a maximum power limit parameter.
8. The method according to claim 7, wherein the memory frequency optimization algorithm adjusts the memory frequency in a direction to make the performance of the GPU better, and in the process, when the performance of the GPU is reduced, a binary algorithm is adopted to determine the optimal memory frequency; the core clock frequency optimization algorithm adjusts the core clock frequency to the direction which enables the GPU performance to be better, and when the GPU performance is reduced in the process, a binary algorithm is adopted to determine the optimal core clock frequency; and adjusting the maximum power limit to the direction of enabling the GPU performance to be better by the maximum power limit optimization algorithm, and determining the optimal memory frequency by adopting a dichotomy algorithm when the GPU performance is reduced in the process.
9. The device for tuning the performance bottleneck of the GPU of the high-performance server is characterized by comprising a processing unit (701), a storage unit (702), a bus unit (703) and an interface unit (704), wherein the processing unit (701), the storage unit (702) and the interface unit (704) are connected to the bus unit, the storage unit (702) stores at least one instruction, the instruction can achieve GPU computing performance obtaining and judging, server index data obtaining and judging, GPU performance parameter tuning and GPU parameter configuration, the processing unit calls the instruction to execute so as to achieve GPU computing performance obtaining and judging, server index data obtaining and judging and GPU performance parameter tuning, and the interface unit calls the instruction to execute the GPU parameters to be configured to the GPU.
10. A storage medium storing at least one instruction that enables GPU computing performance acquisition and determination, server index data acquisition and determination, tuning of GPU performance parameters, and configuration of GPU parameters.
CN202010804248.9A 2020-08-11 2020-08-11 Method and device for tuning performance bottleneck of GPU (graphics processing Unit) of high-performance server and storage medium Active CN112000472B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010804248.9A CN112000472B (en) 2020-08-11 2020-08-11 Method and device for tuning performance bottleneck of GPU (graphics processing Unit) of high-performance server and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010804248.9A CN112000472B (en) 2020-08-11 2020-08-11 Method and device for tuning performance bottleneck of GPU (graphics processing Unit) of high-performance server and storage medium

Publications (2)

Publication Number Publication Date
CN112000472A true CN112000472A (en) 2020-11-27
CN112000472B CN112000472B (en) 2022-07-08

Family

ID=73463791

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010804248.9A Active CN112000472B (en) 2020-08-11 2020-08-11 Method and device for tuning performance bottleneck of GPU (graphics processing Unit) of high-performance server and storage medium

Country Status (1)

Country Link
CN (1) CN112000472B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113535407A (en) * 2021-07-30 2021-10-22 济南浪潮数据技术有限公司 Server optimization method, system, equipment and storage medium
CN113672468A (en) * 2021-08-24 2021-11-19 北京字节跳动网络技术有限公司 Load monitoring method and device
CN113868105A (en) * 2021-08-20 2021-12-31 苏州浪潮智能科技有限公司 Self-optimization method and device for Java performance benchmark test of server
CN114021733A (en) * 2021-09-30 2022-02-08 苏州浪潮智能科技有限公司 Model training optimization method and device, computer equipment and storage medium
CN116225311A (en) * 2022-12-12 2023-06-06 荣耀终端有限公司 Configuration method, device and server for terminal equipment storage system parameters

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8410994B1 (en) * 2010-08-23 2013-04-02 Matrox Graphics Inc. System and method for remote graphics display
CN107832177A (en) * 2017-11-20 2018-03-23 郑州云海信息技术有限公司 A kind of EDP method of testings, system, equipment and the storage medium of more GPU systems
CN109558264A (en) * 2018-12-12 2019-04-02 浪潮(北京)电子信息产业有限公司 A kind of volume information method of calibration, system and the associated component of virtual volume

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8410994B1 (en) * 2010-08-23 2013-04-02 Matrox Graphics Inc. System and method for remote graphics display
CN107832177A (en) * 2017-11-20 2018-03-23 郑州云海信息技术有限公司 A kind of EDP method of testings, system, equipment and the storage medium of more GPU systems
CN109558264A (en) * 2018-12-12 2019-04-02 浪潮(北京)电子信息产业有限公司 A kind of volume information method of calibration, system and the associated component of virtual volume

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113535407A (en) * 2021-07-30 2021-10-22 济南浪潮数据技术有限公司 Server optimization method, system, equipment and storage medium
CN113535407B (en) * 2021-07-30 2024-03-19 济南浪潮数据技术有限公司 Optimization method, system, equipment and storage medium of server
CN113868105A (en) * 2021-08-20 2021-12-31 苏州浪潮智能科技有限公司 Self-optimization method and device for Java performance benchmark test of server
CN113868105B (en) * 2021-08-20 2023-08-08 苏州浪潮智能科技有限公司 Self-optimizing method and device for server Java performance benchmark test
CN113672468A (en) * 2021-08-24 2021-11-19 北京字节跳动网络技术有限公司 Load monitoring method and device
CN114021733A (en) * 2021-09-30 2022-02-08 苏州浪潮智能科技有限公司 Model training optimization method and device, computer equipment and storage medium
CN114021733B (en) * 2021-09-30 2023-11-14 苏州浪潮智能科技有限公司 Model training optimization method, device, computer equipment and storage medium
CN116225311A (en) * 2022-12-12 2023-06-06 荣耀终端有限公司 Configuration method, device and server for terminal equipment storage system parameters
CN116225311B (en) * 2022-12-12 2023-11-21 荣耀终端有限公司 Configuration method, device and server for terminal equipment storage system parameters

Also Published As

Publication number Publication date
CN112000472B (en) 2022-07-08

Similar Documents

Publication Publication Date Title
CN112000472B (en) Method and device for tuning performance bottleneck of GPU (graphics processing Unit) of high-performance server and storage medium
CN110413255B (en) Artificial neural network adjusting method and device
CN110555450B (en) Face recognition neural network adjusting method and device
Bateni et al. Predjoule: A timing-predictable energy optimization framework for deep neural networks
CN110289994B (en) Cluster capacity adjusting method and device
CN110727685B (en) Data compression method, equipment and storage medium based on Cassandra database
WO2022166316A1 (en) Light supplementing method and apparatus for facial recognition, and facial recognition device and system therefor
CN103268204A (en) Adjusting and optimizing method and device of solid-state disk
CN113316794A (en) Data management device for supporting high-speed artificial neural network operation by data cache based on data position of artificial neural network
WO2021152849A1 (en) Data processing device and data processing program
CN117251391A (en) Link equalization method, device, equipment and storage medium
CN112469059A (en) Back-to-first service communication system, transmitting end device, medium, and signal processing method
US20170063955A1 (en) Communication method, communication device, and recording medium
WO2023272432A1 (en) Image processing method and image processing apparatus
CN112433682B (en) Method for acquiring control parameters in solid state disk, storage medium and electronic device
US11455533B2 (en) Information processing apparatus, control method, and non-transitory computer-readable storage medium for storing information processing program
CN114418059A (en) Information processing method and device
CN113128682A (en) Automatic neural network model adaptation method and device
TW202201284A (en) Automatic machine learning system performance tuning method, device, electronic device and storage medium
CN115858418B (en) Data caching method and system
KR102585838B1 (en) Method for lightweighting neural network model and electronic apparatus for performing the same
US20230095268A1 (en) Storage medium, machine learning method, and information processing apparatus
CN115696405B (en) Computing task unloading optimization method and system considering fairness
US20080205220A1 (en) Recording apparatus and recording method
US11604717B2 (en) Processor performance measurement apparatus and processor performance measurement method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant