CN115543746A - Graphics processor monitoring method, system and device and electronic equipment - Google Patents

Graphics processor monitoring method, system and device and electronic equipment Download PDF

Info

Publication number
CN115543746A
CN115543746A CN202211228713.4A CN202211228713A CN115543746A CN 115543746 A CN115543746 A CN 115543746A CN 202211228713 A CN202211228713 A CN 202211228713A CN 115543746 A CN115543746 A CN 115543746A
Authority
CN
China
Prior art keywords
target
parameter
graphics processor
processor
instruction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211228713.4A
Other languages
Chinese (zh)
Inventor
石晓明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202211228713.4A priority Critical patent/CN115543746A/en
Publication of CN115543746A publication Critical patent/CN115543746A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available

Abstract

The embodiment of the application provides a method, a system and a device for monitoring a graphics processor and electronic equipment, wherein the method comprises the following steps: acquiring a first target parameter of a target graphic processor according to a preset period; judging whether the target graphic processor is abnormal or not according to the first target parameter; under the condition that the target graphics processor is abnormal, acquiring a second target parameter in the target graphics processor through a target instruction, wherein the second target parameter is all operation parameters of the target graphics processor detected in the operation process of the target graphics processor, and the second target parameter comprises the first target parameter; and determining the abnormal reason of the target graphic processor by analyzing the second target parameter. Through the method and the device, the problem that monitoring of the graphics processor is not comprehensive and specific enough in the related art is solved, and the effect of ensuring normal work of the graphics processor is further achieved.

Description

Graphics processor monitoring method, system and device and electronic equipment
Technical Field
The embodiment of the application relates to the technical field of monitoring, in particular to a method, a system and a device for monitoring a graphics processor and electronic equipment.
Background
In the era of big data informatization, a server needs to perform a large amount of computing work. In order to satisfy fast data operations, a GPU (Graphics Processing Unit) is generally configured in a server. The GPU is mainly used to complete data operation of image processing, and liberate the CPU in the server from the task of image processing, thereby executing more other system tasks and improving the overall performance of the server.
With the development of the fields of video coding and decoding, scientific computing, artificial intelligence and the like, the requirement on a GPU server is increased, and therefore monitoring and management of core components of the GPU become more important. In the related art, a maintainer installs a driver of the GPU and related monitoring software, and the related monitoring software is manually operated to realize monitoring of the GPU, so that the monitoring process is complicated, the monitoring efficiency is low, and the problems that the monitoring of the GPU is not comprehensive and specific are often caused.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The embodiment of the application provides a method, a system and a device for monitoring a graphics processor and electronic equipment, which are used for at least solving the problem that the monitoring of the graphics processor in the related technology is not comprehensive and specific enough in the related technology.
According to an embodiment of the present application, there is provided a graphics processor monitoring method including: acquiring first target parameters of a target graphic processor according to a preset period, wherein the first target parameters comprise at least one of the following parameters: a threshold type parameter, a state type parameter and an asset information parameter; judging whether the target graphic processor is abnormal or not according to the first target parameter; under the condition that the target graphics processor is abnormal, acquiring a second target parameter in the target graphics processor through a target instruction, wherein the second target parameter is all operation parameters of the target graphics processor detected in the operation process of the target graphics processor, and the second target parameter comprises the first target parameter; and determining the abnormality reason of the target graphic processor by analyzing the second target parameter.
In one exemplary embodiment, the obtaining, by the target instruction, the second target parameter in the target graphics processor comprises: determining a target interface corresponding to a target instruction, wherein the target instruction is an instruction in a preset format, and the preset format comprises at least one of the following: a first instruction format and a second instruction format; and acquiring a second target parameter in the target graphic processor according to the target interface.
In an exemplary embodiment, obtaining the second target parameter in the target graphics processor in accordance with the target interface comprises: setting the reading state of the parameter in the target graphic processor as allowing to read; and reading all the operating parameters in the target graphic processor, and packaging through a target protocol to obtain a second target parameter, wherein the target protocol is a protocol supported by the target graphic processor.
In one exemplary embodiment, determining whether the target graphics processor has an exception based on the first target parameter includes: determining a preset threshold corresponding to a threshold type parameter in the target graphic processor according to the asset information parameter, wherein the asset information parameter comprises at least one of the following parameters: the model of the target graphics processor, the manufacturer of the target graphics processor, and the serial number of the target graphics processor; comparing the parameter value of the threshold type parameter with a preset threshold, wherein the threshold type parameter comprises at least one of the following: a temperature parameter in the target graphics processor, a voltage parameter in the target graphics processor, a current parameter in the target graphics processor; and under the condition that the threshold type parameter exceeds a preset threshold, judging that the target graphics processor has an abnormality.
In an exemplary embodiment, determining whether the target graphics processor is anomalous based on the first target parameter further comprises: determining state information of the state type parameter, wherein the state information comprises: normal and abnormal; and when the state information is abnormal, judging that the target graphics processor has an abnormality.
In an exemplary embodiment, the obtaining, by the target instruction in the preset format, the second target parameter in the target graphics processor further includes: and sending alarm information under the condition that the target graphic processor is abnormal, and sending the abnormal first target parameter to a front-end interface for displaying.
According to another embodiment of the present application, there is provided a graphics processor monitoring system including: the system comprises a baseboard management controller and a target graphic processor, wherein modules running in the baseboard management controller comprise: the system comprises a function application layer, a component management layer and a protocol encapsulation layer, wherein the protocol encapsulation layer performs data interaction with a target graphics processor, is used for encapsulating parameter data in the target graphics processor and generates a target interface, and the target interface is used for being called by the component management layer; the component management layer is used for receiving a target instruction issued by the function application layer and calling a target interface to acquire parameter data in the target graphic processor according to the target instruction; the functional application layer collects parameter data for analysis by sending a target instruction.
In one exemplary embodiment, the functional application layer includes: the real-time monitoring module is used for collecting a first target parameter of the target graphic processor according to a preset period and judging whether the target graphic processor is abnormal or not according to the first target parameter, wherein the first target parameter comprises: a threshold type parameter, a state type parameter and an asset information parameter; the first diagnosis module is used for acquiring a second target parameter in the target graphic processor by sending a target instruction in a first instruction format, and analyzing the second target parameter to obtain a specific reason for the abnormality in the target graphic processor; and the second diagnosis module is used for acquiring a second target parameter in the target graphic processor by sending the target instruction in the second instruction format, and analyzing the second target parameter to obtain a specific reason of the abnormality in the target graphic processor.
According to another embodiment of the present application, there is provided a graphics processor monitoring apparatus including: the period monitoring module is used for acquiring a first target parameter of the target graphic processor according to a preset period, wherein the first target parameter comprises at least one of the following parameters: a threshold type parameter, a state type parameter and an asset information parameter; the abnormity judgment module is used for judging whether the target graphic processor is abnormal or not according to the first target parameter; the parameter acquisition module is used for acquiring a second target parameter in the target graphics processor through a target instruction under the condition that the target graphics processor is abnormal, wherein the second target parameter is all the operation parameters of the target graphics processor detected in the operation process of the target graphics processor, and the second target parameter comprises the first target parameter; and the abnormality analysis module is used for determining the reason of the abnormality of the target graphic processor by analyzing the second target parameter.
According to a further embodiment of the application, there is also provided a computer-readable storage medium having a computer program stored thereon, wherein the computer program is arranged to perform the steps of any of the above-mentioned method embodiments when executed.
According to yet another embodiment of the present application, there is also provided an electronic device, comprising a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the method embodiments described above.
According to the method and the device, a GPU server monitoring model based on BMC (Baseboard Management Controller) is utilized, some key parameters of the GPU are monitored in real time, all operation parameters of the GPU are obtained for diagnosis under the condition that the key parameters are abnormal, the problem that monitoring of a graphics processor is not comprehensive and specific enough in the related technology is solved, and the effect of ensuring normal work of the graphics processor is achieved.
Drawings
FIG. 1 is a flow chart of a graphics processor monitoring method according to an embodiment of the present application;
fig. 2 is a block diagram of a hardware structure of a mobile terminal according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a graphics processor monitoring system according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a graphics processor monitoring apparatus according to an embodiment of the present application.
Detailed Description
Embodiments of the present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
In order to facilitate the understanding of the embodiments of the present application by those skilled in the art, some technical terms or terms related to the embodiments of the present application will be explained as follows:
a Graphics Processing Unit (GPU), also called a display core, a visual processor, and a display chip, is a microprocessor that is specially used for image and graphics related operations on a personal computer, a workstation, a game machine, and some mobile devices (e.g., a tablet computer, a smart phone, etc.).
Baseboard Management Controller (BMC): the remote management controller of the server is executed, and can perform operations such as firmware upgrading and machine equipment checking on the machine in a state that the machine is not started.
IPMI: (Intelligent Platform Management Interface): the core of the IPMI is a dedicated chip/controller (BMC) which is independent of the operating system, BIOS and processor, and thus belongs to an out-of-band management device, and the IPMI is a set of computer interface specifications defined for the autonomous computer subsystem, and is used to provide management and monitoring functions of software and hardware, such as CPU, firmware (BIOS or UEFI) and operating system, which are independent of the host system, so that developers can interact with the BMC through the IPMI OEM CMD.
PLDM: platform Level Data Model, platform Level Data Model. The internally-oriented low-level data model is intended to be an efficient data/control source for mapping under the Common Information Model (CIM).
RESTful: the framework is a specification, constraint and principle of the framework, and the framework conforming to the specification is the RESTful framework.
Uniform Resource Identifier (URI): is a string used to identify the name of an internet resource.
Management component transport protocol (mccp): is a protocol for two-way communication between intelligent devices within a platform management subsystem of a computer system using one or more buses.
SMBUS (System Management Bus): it is a low speed communication applied in mobile PC and desktop PC system. The devices on the motherboard are controlled and corresponding information is collected via an inexpensive and powerful bus (consisting of two wires).
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
The method embodiments provided in the embodiments of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Taking the example of the method running on the mobile terminal, fig. 2 is a block diagram of a hardware structure of the mobile terminal of a graphics processor monitoring method according to an embodiment of the present application. As shown in fig. 2, the mobile terminal may comprise one or more processors 202 (only one is shown in fig. 2) (the processor 202 may comprise, but is not limited to, a processing means such as a microprocessor MCU or a programmable logic device FPGA), and a memory 204 for storing data, wherein the mobile terminal may further comprise a transmission device 206 for communication functions and an input-output device 208. It will be understood by those of ordinary skill in the art that the structure shown in fig. 2 is only an illustration and is not intended to limit the structure of the mobile terminal. For example, the mobile terminal may also include more or fewer components than shown in FIG. 2, or have a different configuration than shown in FIG. 2.
The memory 204 may be used to store a computer program, for example, a software program and a module of an application software, such as a computer program corresponding to the graphics processor monitoring method in the embodiment of the present application, and the processor 202 executes various functional applications and data processing by running the computer program stored in the memory 204, so as to implement the method described above. Memory 204 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 204 may further include memory located remotely from the processor 202, which may be connected to the mobile terminal through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 206 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal. In one example, the transmission device 206 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmitting device 206 may be a Radio Frequency (RF) module, which is used to communicate with the internet via wireless.
In this embodiment, a method for operating the mobile terminal is provided, and fig. 1 is a flowchart of a graphics processor monitoring method according to an embodiment of the present application, as shown in fig. 1, where the flowchart includes the following steps:
step S102, collecting a first target parameter of a target graphic processor according to a preset period, wherein the first target parameter comprises at least one of the following parameters: a threshold type parameter, a state type parameter and an asset information parameter;
in this embodiment, the target graphic processor is a GPU, the first target parameter is a key parameter in the GPU, and the threshold parameter is mainly a numerical parameter in the GPU, including but not limited to: the system comprises a temperature parameter in a target graphic processor, a voltage parameter in the target graphic processor and a current parameter in the target graphic processor, wherein the state type parameter is an identification type parameter for representing whether the overall operation state of the GPU is normal, and the asset information parameter is a parameter for representing the GPU production information, and the parameters comprise but are not limited to: a model of the target graphics processor, a manufacturer of the target graphics processor, a serial number of the target graphics processor, and so forth.
In this embodiment, the preset period may be adjusted according to actual requirements.
Step S104, judging whether the target graphic processor is abnormal or not according to the first target parameter;
in this embodiment, the determining whether the target gpu has an exception according to the threshold type parameter, specifically, the determining whether the target gpu has an exception according to the first target parameter includes the following steps: determining a preset threshold corresponding to a threshold type parameter in the target graphic processor according to the asset information parameter, wherein the asset information parameter comprises at least one of the following parameters: the model of the target graphics processor, the manufacturer of the target graphics processor, and the serial number of the target graphics processor; comparing the parameter value of the threshold type parameter with a preset threshold, wherein the threshold type parameter comprises at least one of the following: a temperature parameter in the target graphics processor, a voltage parameter in the target graphics processor, a current parameter in the target graphics processor; and under the condition that the threshold type parameter exceeds a preset threshold, judging that the target graphics processor has an abnormality.
As an optional implementation manner, the BMC sends a read instruction to the GPU every preset time, and starts to collect current GPU information of the GPU itself after the GPU receives the read instruction, where the current GPU information may not only reflect the current performance of the GPU, but also reflect other non-performance information of the GPU. And when the current GPU information is received by the BMC, monitoring the GPU according to the received current GPU information. The BMC sends the reading instruction to the GPU periodically, so that the BMC can monitor the GPU periodically, the problem of the GPU can be found out in time, and the purpose of maintaining the abnormal problem of the GPU in time is achieved.
For example, the preset temperature threshold of the type a GPU is 100 ℃, and the temperature parameter detected by the internal temperature sensor of the GPU is 120 ℃, which exceeds the preset temperature threshold, and at this time, it is determined that the GPU is abnormal.
For example, when the power parameter detected by the internal power sensor of the GPU of the model a exceeds the preset power preset threshold corresponding to the GPU of the model a, it is determined that the GPU is abnormal.
In this embodiment, the determining whether the target gpu has an exception according to the state type parameter, specifically, the determining whether the target gpu has an exception according to the first target parameter further includes: determining state information of the state type parameter, wherein the state information comprises: normal and abnormal; and when the state information is abnormal, judging that the target graphics processor has an abnormality.
Step S106, under the condition that the target graphic processor is abnormal, acquiring a second target parameter in the target graphic processor through a target instruction, wherein the second target parameter is all operation parameters of the target graphic processor detected in the operation process of the target graphic processor, and the second target parameter comprises the first target parameter;
in this embodiment, the second target parameter is all parameters obtained after detecting the GPU operation process, and the target protocol is a protocol type that can be supported by the target graphics processor.
In some embodiments of the present application, obtaining, by the target instruction, the second target parameter in the target graphics processor comprises: determining a target interface corresponding to a target instruction, wherein the target instruction is an instruction in a preset format, and the preset format comprises at least one of the following: a first instruction format and a second instruction format; and acquiring a second target parameter in the target graphic processor according to the target interface.
In this embodiment, the first instruction format is an IPMI OEM command format, and the second instruction format is a restful URL command format, both of which can achieve acquisition of all parameters in the GPU, and a suitable command format can be selected for use according to actual requirements. It should be noted that the two command formats are only examples, and the present solution is not limited to the two command formats for obtaining all the parameters of the GPU.
In some embodiments of the present application, obtaining the second target parameter in the target graphics processor according to the target interface comprises: setting the reading state of the parameter in the target graphic processor as allowing to read; and reading all the operating parameters in the target graphic processor, and packaging through a target protocol to obtain a second target parameter, wherein the target protocol is a protocol supported by the target graphic processor.
As an optional implementation, the acquisition of the parameter is completed by calling an interface, specifically, a setnummercensorenable interface is called to enable the GPU internal threshold type parameter, that is, the read state of the threshold type parameter in the target graphics processor is set to be read-enabled; calling a SetStateSensors Enables interface to enable the GPU internal state type parameters, namely setting the reading state of the state type parameters in the target graphics processor as allowing reading; respectively reading original data (raw data) of a state type parameter and a threshold type parameter by calling a GetStatsensesrreading interface and a GetSensorreading interface; and finally, calling a GetDDR interface to acquire specific semantics of the threshold type parameter and the state type parameter, including a unit and a real value calculation method, a coefficient and the like.
As an alternative embodiment, the asset information parameters are obtained by calling the getfrurcordtable interface.
In order to improve the real-time performance and the comprehensiveness of GPU monitoring, the method also comprises the following steps before acquiring a second target parameter in a target graphic processor through a target instruction with a preset format: and sending alarm information under the condition that the target graphic processor is abnormal, and sending the abnormal first target parameter to a front-end interface for displaying.
Specifically, some key parameters of the GPU are monitored in real time and externally displayed by a Sensor (Sensor) or an IPMI OEM command, and when an abnormality is detected, an alarm message is sent to avoid data processing errors and the like caused by the abnormality of the GPU.
And step S108, determining the abnormal reason of the target graphic processor by analyzing the second target parameter.
Specifically, the second target parameters, that is, all the parameters of the GPU, include component state parameters in addition to the key parameters, where the component state parameters are used to characterize the current state of each component in the GPU.
As an alternative embodiment, the abnormal component in the GPU may be determined by detecting the component state parameter, and the component information is fed back to the front-end interface.
According to the scheme, a maintainer does not need to install the driver of the GPU and related monitoring software, so that the monitoring process is simplified, and the monitoring efficiency is improved; in addition, the BMC can automatically acquire the GPU key parameter information in real time, so that the state of the GPU can be automatically monitored, and in addition, when the signaling key parameter is abnormal, all the parameters of the GPU are acquired to further analyze the abnormal reason.
Through the steps, the problem that monitoring on the graphics processor is not comprehensive and specific enough in the related technology is solved, the effect of ensuring normal work of the graphics processor is further achieved, the method can be used for assisting in positioning, and the operation and maintenance cost is reduced.
The main body for executing the above steps may be a controller, a terminal, etc., but is not limited thereto.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present application or portions thereof that contribute to the prior art may be embodied in the form of a software product, where the computer software product is stored in a storage medium (such as a ROM/RAM, a magnetic disk, and an optical disk), and includes several instructions for enabling a terminal device (which may be a mobile phone, a computer, a server, or a network device) to execute the method described in the embodiments of the present application.
Embodiments of the present application also provide a graphics processor monitoring system, which may be used to perform the steps of any of the above method embodiments.
Fig. 3 is a schematic structural diagram of a graphics processor monitoring system according to an embodiment of the present application, where as shown in fig. 3, the graphics processor monitoring system includes: baseboard management controller 30 and target graphics processor 32, wherein the modules running in baseboard management controller 30 include: a functional application layer 302, a component management layer 304, and a protocol encapsulation layer 306, and,
the protocol encapsulation layer 306 performs data interaction with the target graphics processor 32, and is configured to encapsulate parameter data in the target graphics processor 32 and generate a target interface, where the target interface is used for being called by the component management layer 304;
in this embodiment, the protocol encapsulation layer includes: PLDM protocol layer and Mctp protocol layer
Specifically, the PLDM protocol layer is the specific implementation of a PLDM Base, a PLDM Control and Monitoring, a PLDM FRU protocol; the MCTP protocol layer is a specific implementation of the MCTP Base/Binding protocol. The application adopts a layered design idea, each layer provides an interface API for the upper layer, and the interface API of the lower layer is called.
The component management layer 304 is configured to receive a target instruction issued by the function application layer 302, and call a target interface to acquire parameter data in the target graphics processor 32 according to the target instruction;
specifically, the component management layer provides a uniform interface for the function application layer, a layer of API interface is packaged inside the component management layer, and the component management layer interacts with the PLDM protocol layer in a double FIFO (first in first out) mode, so that stable communication and decoupling with the PLDM and MCTP protocol layers are guaranteed.
The functional application layer 302 collects the parameter data for analysis by sending a target instruction.
In some embodiments of the present application, the functional application layer 302 includes: a real-time monitoring module 3022, a first diagnostic module 3024, a second diagnostic module 3026, wherein,
the real-time monitoring module 3022 is configured to collect a first target parameter of the target graphics processor 32 according to a preset period, and determine whether the target graphics processor 32 is abnormal according to the first target parameter, where the first target parameter includes: a threshold type parameter, a state type parameter and an asset information parameter;
specifically, the real-time monitoring module 3022 is configured to monitor a key parameter module of the GPU in real time, and may complete parameter acquisition by calling an interface, specifically, calling a setnumercissensensenoreenable interface to enable a threshold type parameter inside the GPU, that is, setting a read state of the threshold type parameter in the target graphics processor to be read-enabled; calling a SetStateSensors Enables interface to enable the GPU internal state type parameters, namely setting the reading state of the state type parameters in the target graphics processor as allowing reading; respectively reading original data (raw data) of a state type parameter and a threshold type parameter by calling a GetStateSensorreading interface and a GetSensorreading interface; and finally, calling a GetDDR interface to acquire specific semantics of the threshold type parameter and the state type parameter, including a unit and a real value calculation method, a coefficient and the like.
The GPU in the server is a microprocessor dedicated to image operation work. In order to realize the monitoring of the BMC on the GPU, the BMC sends a reading instruction to the GPU every preset time, and starts to collect the current GPU information of the GPU after the GPU receives the reading instruction, wherein the current GPU information not only can embody the current performance of the GPU, but also can embody other non-performance information of the GPU. And when the current GPU information is collected by the GPU, sending the collected current GPU information to the BMC, and after the BMC receives the current GPU information, monitoring the GPU according to the received current GPU information. The BMC sends the reading instruction to the GPU periodically, so that the BMC can monitor the GPU periodically, the problem of the GPU can be found out in time, and the purpose of maintaining the abnormal problem of the GPU in time is achieved.
For example, the preset temperature threshold of the type a GPU is 100 ℃, and the temperature parameter detected by the internal temperature sensor of the GPU is 120 ℃, which exceeds the preset temperature threshold, at this time, it is determined that the GPU is abnormal.
As an alternative implementation, the current temperature of the GPU, i.e. the temperature comparison result, may be sent to the front-end interactive interface for presentation.
The first diagnostic module 3024 is configured to obtain a second target parameter in the target graphics processor 32 by sending the target instruction in the first instruction format, and analyze the second target parameter to obtain a specific reason for the occurrence of the abnormality in the target graphics processor 32;
specifically, the first diagnostic module 3024 is configured to diagnose the GPU by sending instructions in the IPMI OEM command format.
The second diagnosing module 3026 is configured to obtain a second target parameter in the target graphics processor 32 by sending a target instruction in a second instruction format, and analyze the second target parameter to obtain a specific reason for the occurrence of the abnormality in the target graphics processor 32.
In particular, the first diagnostic module 3024 is used to diagnose the GPU by sending instructions in the restful URL command format.
In this embodiment, the first diagnostic module and the second diagnostic module may both achieve acquisition of all parameters in the GPU, and may select a suitable command format for use according to actual requirements. It should be noted that the two command formats are only examples, and the present solution is not limited to the two command formats for obtaining all the parameters of the GPU.
In this embodiment, the baseboard management controller is BMC, the target graphics processor 32 is PVC OAM GPU, which is a GPU of Intel corporation, its out-of-band management physical channel is SMBUS, the application layer implements a subset of the PLDM over mcpp protocol stack part, and the Baseboard Management Controller (BMC) can manage the PVC OAM GPU through the PLDM over mcpp over SMBUS.
Specifically, the BMC is a management controller specific to the server, and may automatically monitor a current operating state of the GPU in the server, such as monitoring a state of a sensor in the GPU server, accessing a BIOS configuration or a system operation console, and may timely regulate and control the GPU server according to the current operating state.
In addition to monitoring some key parameters of the GPU in real time and displaying them in pairs of sensors or IPMI OEM commands, the solution of the present application also provides an interface function for diagnosing the GPU: for example, an IPMI OEM command or a restful URL, when the GPU is abnormal, all parameters of the GPU may be acquired through the interface function, so as to assist in positioning, and reduce the operation and maintenance cost. The problem of in the correlation technique monitor graphics processing ware comprehensive concrete inadequately is solved, and then reached and guaranteed graphics processing ware normal operating's effect, can be used for the assistance-localization real-time problem, reduce the fortune dimension cost.
In this embodiment, a graphics processor monitoring apparatus is further provided, and the apparatus is used to implement the foregoing embodiments and preferred embodiments, and the description of which has been already made is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 4 is a schematic structural diagram of a graphics processor monitoring apparatus according to an embodiment of the present application, where as shown in fig. 4, the graphics processor monitoring apparatus includes:
the period monitoring module 40 is configured to acquire a first target parameter of the target graphics processor according to a preset period, where the first target parameter includes at least one of: a threshold type parameter, a state type parameter and an asset information parameter;
an exception determining module 42, configured to determine whether the target graphics processor is abnormal according to the first target parameter;
the parameter obtaining module 44 is configured to obtain, through the target instruction, a second target parameter in the target graphics processor under the condition that the target graphics processor is abnormal, where the second target parameter is all operation parameters of the target graphics processor detected in an operation process of the target graphics processor, and the second target parameter includes the first target parameter;
and an exception analysis module 46, configured to determine a cause of the exception of the target graphics processor by analyzing the second target parameter.
In this embodiment, the determining whether the target gpu has an exception according to the threshold type parameter, specifically, the determining whether the target gpu has an exception according to the first target parameter includes the following steps: determining a preset threshold corresponding to a threshold type parameter in the target graphic processor according to the asset information parameter, wherein the asset information parameter comprises at least one of the following parameters: the model of the target graphics processor, the manufacturer of the target graphics processor, and the serial number of the target graphics processor; comparing the parameter value of the threshold type parameter with a preset threshold, wherein the threshold type parameter comprises at least one of the following: a temperature parameter in the target graphics processor, a voltage parameter in the target graphics processor, a current parameter in the target graphics processor; and under the condition that the threshold type parameter exceeds a preset threshold, judging that the target graphics processor has an abnormality.
In this embodiment, the determining whether the target gpu has an exception according to the state type parameter, specifically, the determining whether the target gpu has an exception according to the first target parameter further includes: determining state information of the state type parameter, wherein the state information comprises: normal and abnormal; and when the state information is abnormal, judging that the target graphics processor has an abnormality.
In some embodiments of the present application, obtaining, by the target instruction, the second target parameter in the target graphics processor comprises: determining a target interface corresponding to a target instruction, wherein the target instruction is an instruction in a preset format, and the preset format comprises at least one of the following: a first instruction format and a second instruction format; and acquiring a second target parameter in the target graphic processor according to the target interface.
In this embodiment, the first instruction format is an IPMI OEM command format, and the second instruction format is a restful URL command format, both of which can achieve acquisition of all parameters in the GPU, and a suitable command format can be selected for use according to actual requirements. It should be noted that the two command formats are only examples, and the present solution is not limited to the two command formats for obtaining all the parameters of the GPU.
In some embodiments of the present application, obtaining the second target parameter in the target graphics processor according to the target interface comprises: setting the reading state of the parameter in the target graphic processor as allowing to read; and reading all the operating parameters in the target graphic processor, and packaging through a target protocol to obtain a second target parameter, wherein the target protocol is a protocol supported by the target graphic processor.
In order to improve the real-time performance and the comprehensiveness of GPU monitoring, the method also comprises the following steps before acquiring a second target parameter in a target graphic processor through a target instruction with a preset format: and sending alarm information under the condition that the target graphic processor is abnormal, and sending the abnormal first target parameter to a front-end interface for displaying.
According to the scheme, a maintainer does not need to install the driver of the GPU and related monitoring software, so that the monitoring process is simplified, and the monitoring efficiency is improved; in addition, the BMC can automatically acquire the GPU key parameter information in real time, so that the state of the GPU can be automatically monitored, and in addition, when the signaling key parameters are abnormal, all the parameters of the GPU are acquired to further analyze the abnormal reasons.
In addition to monitoring some key parameters of the GPU in real time and displaying them in pairs of sensors or IPMI OEM commands, the solution of the present application also provides an interface function for diagnosing the GPU: for example, an IPMI OEM command or a restful URL, when the GPU is abnormal, all parameters of the GPU may be acquired through the interface function, so as to assist in positioning, and reduce the operation and maintenance cost.
Through the steps, the problem that monitoring on the graphics processor is not comprehensive and specific enough in the related technology is solved, the effect of ensuring normal work of the graphics processor is further achieved, the method can be used for assisting in positioning, and the operation and maintenance cost is reduced.
It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.
Embodiments of the present application further provide a computer-readable storage medium having a computer program stored therein, where the computer program is configured to perform the steps in any of the above method embodiments when executed: acquiring first target parameters of a target graphic processor according to a preset period, wherein the first target parameters comprise at least one of the following parameters: a threshold type parameter, a state type parameter and an asset information parameter; judging whether the target graphic processor is abnormal or not according to the first target parameter; under the condition that the target graphics processor is abnormal, acquiring a second target parameter in the target graphics processor through a target instruction, wherein the second target parameter is all operation parameters of the target graphics processor detected in the operation process of the target graphics processor, and the second target parameter comprises the first target parameter; and determining the abnormal reason of the target graphic processor by analyzing the second target parameter.
In an exemplary embodiment, the computer-readable storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
Embodiments of the present application further provide an electronic device, including a memory and a processor, where the memory stores a computer program, and the computer terminal may include one or more processors (the processor may include but is not limited to a processing device such as a microprocessor MCU or a programmable logic device FPGA), and the processor is configured to execute the computer program to perform the steps of the graphics processor monitoring method in any one of the above method embodiments: acquiring a first target parameter of a target graphic processor according to a preset period, wherein the first target parameter comprises at least one of the following parameters: a threshold type parameter, a state type parameter and an asset information parameter; judging whether the target graphic processor is abnormal or not according to the first target parameter; under the condition that the target graphics processor is abnormal, acquiring a second target parameter in the target graphics processor through a target instruction, wherein the second target parameter is all operation parameters of the target graphics processor detected in the operation process of the target graphics processor, and the second target parameter comprises the first target parameter; and determining the abnormal reason of the target graphic processor by analyzing the second target parameter. In this embodiment, the processor may be a BMC.
Specifically, the modules run by the baseboard management controller 3 include: a function application layer, a component management layer, and a protocol encapsulation layer, in this embodiment, the protocol encapsulation layer includes: a PLDM protocol layer and a mcpp protocol layer.
Specifically, the PLDM protocol layer is the specific implementation of a PLDM Base, a PLDM Control and Monitoring, a PLDM FRU protocol; the MCTP protocol layer is a specific implementation of the MCTP Base/Binding protocol. The application adopts a layered design idea, each layer provides an interface API for the upper layer, and the interface API of the lower layer is called.
Specifically, the component management layer provides a uniform interface for the function application layer, a layer of API interface is packaged inside the component management layer, and the component management layer interacts with the PLDM protocol layer in a double FIFO (first in first out) mode, so that stable communication and decoupling with the PLDM and MCTP protocol layers are guaranteed.
In some embodiments of the present application, the functional application layer comprises: the device comprises a real-time monitoring module, a first diagnosis module and a second diagnosis module.
Specifically, the real-time monitoring module is configured to monitor a key parameter module of the GPU in real time, and may complete parameter acquisition by calling an interface, specifically, calling a setnumercissentrennerable interface to enable a threshold type parameter inside the GPU, that is, setting a read state of the threshold type parameter in the target graphics processor to be read-enabled; calling a SetStateSensors Enables interface to enable the GPU internal state type parameters, namely setting the reading state of the state type parameters in the target graphics processor as allowing reading; respectively reading original data (raw data) of a state type parameter and a threshold type parameter by calling a GetStatsensesrreading interface and a GetSensorreading interface; and finally, calling a GetDDR interface to acquire specific semantics of the threshold type parameter and the state type parameter, including a unit and a real value calculation method, a coefficient and the like. The first diagnostic module is to diagnose the GPU by sending instructions in IPMI OEM command format. The first diagnostic module is to diagnose the GPU by sending an instruction in the restful URL command format.
In this embodiment, both the IPMI OEM command format and the restful URL command format may be used to obtain all parameters in the GPU, and a suitable command format may be selected for use according to actual requirements. It should be noted that the two command formats are only examples, and the present solution is not limited to the two command formats for obtaining all the parameters of the GPU.
In an exemplary embodiment, the electronic device may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.
In this embodiment, the input/output device may be a front-end interactive page;
specifically, the target object can view and modify the target instruction in the front-end interaction page according to actual requirements; in addition, the front-end interactive page can also display the acquired information such as key parameters, all parameters and the like in the target graphic processor.
As an optional implementation manner, before acquiring the second target parameter in the target graphics processor through the target instruction in the preset format, the method further includes the following steps: and sending alarm information under the condition that the target graphic processor is abnormal, and sending the abnormal first target parameter to a front-end interface for displaying.
According to the scheme, a maintainer does not need to install the driver of the GPU and related monitoring software, so that the monitoring process is simplified, and the monitoring efficiency is improved; in addition, the BMC can automatically acquire the GPU key parameter information in real time, so that the state of the GPU can be automatically monitored, and in addition, when the signaling key parameters are abnormal, all the parameters of the GPU are acquired to further analyze the abnormal reasons.
In addition to monitoring some key parameters of the GPU in real time and displaying them in pairs of sensors or IPMI OEM commands, the solution of the present application also provides an interface function for diagnosing the GPU: for example, an IPMI OEM command or a restful URL, when the GPU is abnormal, all parameters of the GPU may be acquired through the interface function, so as to assist in positioning, and reduce the operation and maintenance cost. The problem of in the correlation technique monitor graphics processing ware comprehensive concrete inadequately is solved, and then reached and guaranteed graphics processing ware normal operating's effect, can be used for the assistance-localization real-time problem, reduce the fortune dimension cost.
For specific examples in this embodiment, reference may be made to the examples described in the above embodiments and exemplary embodiments, and details of this embodiment are not repeated herein.
It will be apparent to those skilled in the art that the various modules or steps of the present application described above may be implemented using a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and they may be implemented using program code executable by the computing devices, such that they may be stored in a memory device and executed by the computing devices, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into separate integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present application is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the principle of the present application shall be included in the protection scope of the present application.

Claims (11)

1. A graphics processor monitoring method, comprising:
acquiring first target parameters of a target graphic processor according to a preset period, wherein the first target parameters comprise at least one of the following parameters: a threshold type parameter, a state type parameter and an asset information parameter;
judging whether the target graphic processor is abnormal or not according to the first target parameter;
under the condition that the target graphics processor is abnormal, acquiring a second target parameter in the target graphics processor through a target instruction, wherein the second target parameter is all operation parameters of the target graphics processor detected in the operation process of the target graphics processor, and the second target parameter comprises the first target parameter;
and determining the abnormal reason of the target graphic processor by analyzing the second target parameter.
2. The method of claim 1, wherein obtaining, via a target instruction, a second target parameter in the target graphics processor comprises:
determining a target interface corresponding to the target instruction, wherein the target instruction is an instruction in a preset format, and the preset format includes at least one of the following: a first instruction format and a second instruction format;
and acquiring a second target parameter in the target graphic processor according to the target interface.
3. The method of claim 2, wherein obtaining second target parameters in the target graphics processor in accordance with the target interface comprises:
setting the reading state of the parameters in the target graphics processor as reading permission;
and reading all the operating parameters in the target graphic processor, and packaging through the target protocol to obtain the second target parameter, wherein the target protocol is a protocol supported by the target graphic processor.
4. The method of claim 1, wherein determining whether the target graphics processor is anomalous based on the first target parameter comprises:
determining a preset threshold corresponding to the threshold type parameter in the target graphic processor according to the asset information parameter, wherein the asset information parameter comprises at least one of the following parameters: the model of the target graphics processor, the manufacturer of the target graphics processor, and the serial number of the target graphics processor;
comparing the parameter value of the threshold type parameter with the preset threshold, wherein the threshold type parameter comprises at least one of the following: a temperature parameter in the target graphics processor, a voltage parameter in the target graphics processor, a current parameter in the target graphics processor;
and under the condition that the threshold type parameter exceeds the preset threshold, judging that the target graphics processor is abnormal.
5. The method of claim 1, wherein determining whether the target graphics processor is anomalous based on the first target parameter further comprises:
determining state information of the state type parameter, wherein the state information comprises: normal and abnormal;
and under the condition that the state information is abnormal, judging that the target graphics processor has an abnormality.
6. The method of claim 1, wherein obtaining the second target parameter in the target graphics processor via the target instruction in the predetermined format further comprises:
and sending alarm information under the condition that the target graphic processor is abnormal, and sending the abnormal first target parameter to a front-end interface for displaying.
7. A graphics processor monitoring system, comprising: the system comprises a baseboard management controller and a target graphic processor, wherein modules running in the baseboard management controller comprise: a functional application layer, a component management layer, and a protocol encapsulation layer, and,
the protocol encapsulation layer performs data interaction with the target graphics processor, and is used for encapsulating parameter data in the target graphics processor and generating a target interface, wherein the target interface is used for being called by the component management layer;
the component management layer is used for receiving a target instruction issued by the function application layer and calling the target interface to acquire parameter data in the target graphic processor according to the target instruction;
and the functional application layer acquires the parameter data for analysis by sending a target instruction.
8. The graphics processor monitoring system of claim 7, wherein the functional application layer comprises: a real-time monitoring module, a first diagnostic module, a second diagnostic module, wherein,
the real-time monitoring module is configured to collect a first target parameter of a target graphics processor according to a preset period, and determine whether the target graphics processor is abnormal according to the first target parameter, where the first target parameter includes: a threshold type parameter, a state type parameter and an asset information parameter;
the first diagnosis module is used for acquiring a second target parameter in the target graphic processor by sending a target instruction in a first instruction format, and analyzing the second target parameter to obtain a specific reason for the occurrence of the abnormality in the target graphic processor;
the second diagnosis module is used for acquiring a second target parameter in the target graphics processor by sending a target instruction in a second instruction format, and analyzing the second target parameter to acquire a specific reason for the abnormality in the target graphics processor.
9. A graphics processor monitoring apparatus, comprising:
the period monitoring module is used for acquiring a first target parameter of the target graphic processor according to a preset period, wherein the first target parameter comprises at least one of the following parameters: a threshold type parameter, a state type parameter and an asset information parameter;
the abnormity judging module is used for judging whether the target graphic processor is abnormal or not according to the first target parameter;
a parameter obtaining module, configured to obtain, through a target instruction, a second target parameter in the target graphics processor when the target graphics processor is abnormal, where the second target parameter is all operation parameters of the target graphics processor detected in an operation process of the target graphics processor, and the second target parameter includes the first target parameter;
and the abnormality analysis module is used for determining the reason of the abnormality of the target graphics processor by analyzing the second target parameter.
10. A computer-readable storage medium, in which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method of graphics processor monitoring as claimed in any one of the claims 1 to 6.
11. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the steps of the graphics processor monitoring method as claimed in any one of claims 1 to 6 are implemented when the computer program is executed by the processor.
CN202211228713.4A 2022-10-09 2022-10-09 Graphics processor monitoring method, system and device and electronic equipment Pending CN115543746A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211228713.4A CN115543746A (en) 2022-10-09 2022-10-09 Graphics processor monitoring method, system and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211228713.4A CN115543746A (en) 2022-10-09 2022-10-09 Graphics processor monitoring method, system and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN115543746A true CN115543746A (en) 2022-12-30

Family

ID=84733855

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211228713.4A Pending CN115543746A (en) 2022-10-09 2022-10-09 Graphics processor monitoring method, system and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN115543746A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117314728A (en) * 2023-11-29 2023-12-29 深圳市七彩虹禹贡科技发展有限公司 GPU operation regulation and control method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117314728A (en) * 2023-11-29 2023-12-29 深圳市七彩虹禹贡科技发展有限公司 GPU operation regulation and control method and system
CN117314728B (en) * 2023-11-29 2024-03-12 深圳市七彩虹禹贡科技发展有限公司 GPU operation regulation and control method and system

Similar Documents

Publication Publication Date Title
CN110908879B (en) Reporting method, reporting device, reporting terminal and recording medium of buried point data
WO2023115999A1 (en) Device state monitoring method, apparatus, and device, and computer-readable storage medium
CN103995500B (en) Controller and information processor
CN109240966A (en) A kind of accelerator card based on CPLD, collecting method and device
CN112286709A (en) Diagnosis method, diagnosis device and diagnosis equipment for server hardware faults
CN106484459B (en) Flow control method and device applied to JavaScript
CN112506755A (en) Log collection method and device, computer equipment and storage medium
CN115543746A (en) Graphics processor monitoring method, system and device and electronic equipment
US11023335B2 (en) Computer and control method thereof for diagnosing abnormality
CN110943865A (en) Method and device for diagnosing equipment fault time and related equipment
CN116560586B (en) Determination method and device of attribute value, storage medium and electronic equipment
CN103778024A (en) Server system and message processing method thereof
CN115543872A (en) Equipment management method and device and computer storage medium
CN115599617B (en) Bus detection method and device, server and electronic equipment
CN116521478A (en) Board state monitoring method, system, electronic equipment and medium
CN112463574A (en) Software testing method, device, system, equipment and storage medium
CN115525392A (en) Container monitoring method and device, electronic equipment and storage medium
CN115543759A (en) Log lookup method and device for operating system, electronic device and storage medium
CN107342916A (en) Monitoring method, device and the server of server info
CN113608982A (en) Function execution performance monitoring method and device, computer equipment and storage medium
CN113965447A (en) Online cloud diagnosis method, device, system, equipment and storage medium
CN109120422B (en) Remote server system capable of obtaining hardware information and management method thereof
CN111524053B (en) Information acquisition method, device, equipment and medium of air quality prediction system
CN109144798B (en) Intelligent management system with machine learning function
CN116149941A (en) Monitoring method and device of server component, server and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination