CN115576774A - Method, system, computer device and medium for managing graphic processor device - Google Patents

Method, system, computer device and medium for managing graphic processor device Download PDF

Info

Publication number
CN115576774A
CN115576774A CN202211234208.0A CN202211234208A CN115576774A CN 115576774 A CN115576774 A CN 115576774A CN 202211234208 A CN202211234208 A CN 202211234208A CN 115576774 A CN115576774 A CN 115576774A
Authority
CN
China
Prior art keywords
chip
graphics processor
information
pcie
remote management
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211234208.0A
Other languages
Chinese (zh)
Inventor
李秀艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202211234208.0A priority Critical patent/CN115576774A/en
Publication of CN115576774A publication Critical patent/CN115576774A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3051Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3031Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a motherboard or an expansion card
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3058Monitoring arrangements for monitoring environmental properties or parameters of the computing system or of the computing system component, e.g. monitoring of power, currents, temperature, humidity, position, vibrations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3058Monitoring arrangements for monitoring environmental properties or parameters of the computing system or of the computing system component, e.g. monitoring of power, currents, temperature, humidity, position, vibrations
    • G06F11/3062Monitoring arrangements for monitoring environmental properties or parameters of the computing system or of the computing system component, e.g. monitoring of power, currents, temperature, humidity, position, vibrations where the monitored property is the power consumption
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3089Monitoring arrangements determined by the means or processing involved in sensing the monitored data, e.g. interfaces, connectors, sensors, probes, agents
    • G06F11/3093Configuration details thereof, e.g. installation, enabling, spatial arrangement of the probes
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a method, a system, computer equipment and a medium for managing graphic processor equipment, wherein the graphic processor equipment at least comprises a graphic processor chip and a remote management chip, and the graphic processor chip is also connected with PCIe; the method comprises the following steps: acquiring firmware information of a graphics processor chip based on PCIe (peripheral component interface express); acquiring state information of a graphics processor chip based on a remote management chip; matching and executing a target management strategy based on the firmware information of the graphic processor chip, the state information of the graphic processor chip and a preset management rule; the remote management chip is added in the graphic manager device to manage the graphic manager device to realize the instructional operation; convenient out-of-band management of the graphics processor is achieved through remote management chip scheme processing.

Description

Graphics processor device management method, system, computer device, and medium
Technical Field
The present invention relates to the field of graphics processors, and in particular, to a method, system, computer device, and medium for managing a graphics processor device.
Background
With the development of the field of big data and artificial intelligence, data services are continuously expanded, and the performance requirements on a server are also continuously improved, so that PCIe with various heterogeneous acceleration, graphics rendering, model reasoning and model training functions is developed. The GPU card is inserted into a PCIe card slot of the server through a PCIe bus interface to become a mainstream mode for enhancing the performance of the server by a server manufacturer.
With the increasing expanded use amount of the AI server for the GPU card, a higher easy-to-manage requirement is provided for the out-of-band of the device of the GPU card itself, and the GPU card on the current market implements partial out-of-band management, but with the deep application of the AI server, the power consumption becomes high, the complexity of power supply design increases, and the corresponding server provides more management requirements for the management of the GPU card use.
Disclosure of Invention
The invention aims at: a graphics processor device management method, system, computer device, and medium are provided.
The technical scheme of the invention is as follows: in a first aspect, the present invention provides a method for managing a graphics processor device, where the graphics processor device at least includes a graphics processor chip and a remote management chip, and the graphics processor chip is further connected to PCIe; the method comprises the following steps:
acquiring firmware information of the graphics processor chip based on the PCIe;
acquiring state information of the graphics processor chip based on the remote management chip;
and matching and executing a target management strategy based on the firmware information of the graphics processor chip, the state information of the graphics processor chip and a preset management rule.
In a preferred embodiment, the PCIe establishes a connection with the host server through a golden finger, and a system management bus is disposed in the golden finger;
the graphic processing chip is at least internally provided with a field replaceable unit, a charged erasable programmable read-only memory and a temperature chip;
the obtaining the firmware information of the graphics processor chip based on the PCIe comprises:
reading the field replaceable unit information, the hot EEPROM information, and the temperature chip information based on the PCIe and the system management bus.
In a preferred embodiment, the method further comprises:
transmitting a PCIe x16 signal sent by the main server to the graphics processor chip through the golden finger based on the PCIe;
and acquiring data calculated by the graphics processor chip based on the PCIe x16 signal and transmitting the data to the main server.
In a preferred embodiment, the graphics processor chip is connected with a power connector to supply power based on the gold finger; the method further comprises the following steps:
reading the remote management chip state information and the graphics processor chip voltage state information based on the PCIe and the system management bus.
In a preferred embodiment, the obtaining the state information of the graphics processor chip based on the remote management chip includes:
and acquiring power supply information and out-of-band information of the graphics processor chip based on the remote management chip.
In a preferred embodiment, a micro control unit is arranged in the remote management chip; the obtaining power supply information and out-of-band information of the graphics processor chip based on the remote management chip comprises:
monitoring and acquiring power supply power data and power supply state signal data based on the remote management chip, wherein the power supply power data at least comprises peak power supply power data and power supply power data of each component in the graphic processor equipment;
and acquiring internal temperature data, whole card power data, video memory temperature data, clock frequency, memory error checking and correcting data, power supply power data, a power supply state signal, a graphics processor state signal and a remote management chip firmware version number of the graphics processor chip based on the micro control unit and the bidirectional two-wire system synchronous serial bus.
In a preferred embodiment, the obtaining the power information and the out-of-band information of the graphics processor chip based on the remote management chip further comprises:
and detecting and acquiring the fault information and PCIe interface information of the graphics processor chip based on the GPIO resource of the remote management chip.
In a second aspect, the present invention further provides a graphics processor device management system, where the graphics processor device at least includes a graphics processor chip and a remote management chip, and the graphics processor chip is further connected to PCIe; the system comprises:
the first obtaining module is used for obtaining the firmware information of the graphics processor chip based on the PCIe;
the second acquisition module is used for acquiring the state information of the graphics processor chip based on the remote management chip;
and the matching execution module is used for matching and executing a target management strategy based on the firmware information of the graphics processor chip, the state information of the graphics processor chip and a preset management rule.
In a third aspect, the present invention also provides a computer apparatus, the apparatus comprising:
one or more processors; and
a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform a graphics processor device management method as claimed in any one of the first aspects.
In a fourth aspect, the invention also provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, performs the steps of the method according to any of the first aspects.
The invention has the advantages that: a method, system, computer device and medium for managing a graphics processor device are provided, the graphics processor device at least includes a graphics processor chip and a remote management chip, the graphics processor chip is further connected with PCIe; the method comprises the following steps: acquiring firmware information of the graphics processor chip based on PCIe; acquiring state information of a graphics processor chip based on a remote management chip; matching and executing a target management strategy based on the firmware information of the graphics processor chip, the state information of the graphics processor chip and a preset management rule; the remote management chip is added in the graphic manager device to manage the graphic manager device to realize the instructional operation; convenient out-of-band management of the graphics processor is achieved through remote management chip scheme processing.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a block diagram illustrating an architecture for graphics processor device management as provided herein;
FIG. 2 is a schematic diagram of an I2C topology of an architecture for graphics processor device management as provided in the application;
FIG. 3 is a flow chart of a method for managing a graphics processor device provided herein;
FIG. 4 is a diagram of a graphics processor device management system architecture as provided herein;
FIG. 5 is a diagram of a computer device architecture provided herein.
Detailed Description
In order to make the purpose, technical solutions and advantages of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
As described in the background art, in the prior art, as the number of the extended uses of the AI server for the GPU card is increased, and a higher management-facilitating requirement is provided for the out-of-band of the device of the GPU card itself, the GPU card on the present world realizes the management of a part of the out-of-band, but as the AI server is deeply applied, the power consumption is increased, the complexity of the power supply design is increased, the corresponding server provides more management requirements for the management of the GPU card, and the existing out-of-band management method cannot gradually meet the GPU management requirement.
In order to solve the above problems, the present application creatively provides a method, a system, a computer device and a computer readable storage medium for managing a graphic processor device, wherein a remote management chip is added to the graphic processor device to implement an instructional operation for managing the graphic processor device; the scheme processing of the remote management chip is used for realizing convenient out-of-band management of the graphics processor, and the out-of-band management requirement of the conventional GPU is met.
The embodiments of the present application will be described in detail below with reference to the drawings and various embodiments.
The first embodiment is as follows: this embodiment describes an architecture for performing device management of a graphics processor in the present application.
Referring to fig. 1, the architecture includes: a master control chip, in one example, the master control chip is an AMD GPU master control chip;
a video memory, which in one example uses 16Gb GDDR (Graphics Double Data Rate) 6 granules, with a total video memory of 32GB;
the power supply connector is powered by 2 PCIe power connectors in one example, and the maximum power consumption of the whole graphic processor device is 300W;
a PCIe interface, which in one example is PCIe X16 GEN4 using a GPU PCIe interface;
the Remote Management chip specifically has the main functions of aggregating more I2C and sideband signals, and is communicated with a mainboard BMC (Baseboard Management Controller) through a golden finger I2C, so that the manageability of the GPU is improved;
a power management chip, in one example, comprising an IR35217M/INA3221/XDPE132G5D power management chip;
the power management chip is connected with PCA 9555I 2C, a Sideband I/O extension and a Temp Sensor,
the main control chip is connected with the main server through a golden finger;
the PCIe signal receives a PCIe x16 signal of the main server end, is connected to the GPU card through the golden finger, and sends data back to the main server end through the calculation of the chip. The power supply of the graphic processor device is supplied with power through the golden finger and the two +12V power connectors, and controls the switching on of each power supply chip and receives the state of each power supply conversion chip after power conversion through the remote management chip. And i2c exists among the chips, so that the chips can conveniently communicate with each other and update the firmware. The main server can Read the information of FRU (Field replaceable Unit) and EEPROM (Electrically Erasable Programmable Read-Only Memory) and temperature chip on the board through SMBus (System Management Bus) on the golden finger, and can also Read the state of remote Management chip through SMBus to know the voltage state on the board and the state of GPU card.
The remote management chip collects the GPU and the video memory related information through the SMBUS, and the BMC accesses the remote management chip through the SMBUS to manage the GPU card. The GPU card adds EMC1413 to monitor the inlet and outlet temperatures; adding PCA9555 (I/O extension) to obtain Board ID (card identifier)/PCB Vision ID (card circuit Board version identifier)/BOM ID (bill of material identifier); and adding FRU information of an FRU storage board card.
FIG. 2 illustrates an I2C topology diagram of the architecture of this graphics processor device management.
Example two: based on the architecture for performing graphics processor device management described in the first embodiment, this embodiment describes a process for performing graphics processor device management in this application with reference to fig. 3.
Specifically, this embodiment provides a method for managing a graphics processor device, where the graphics processor device at least includes a graphics processor chip and a remote management chip, and the graphics processor chip is further connected to PCIe; the method comprises the following steps:
s310, acquiring firmware information of the graphics processor chip based on PCIe.
In one embodiment, the PCIe establishes a connection with the host server via a gold finger, and a system management bus is provided in the gold finger; the graphic processing chip is at least internally provided with a field replaceable unit, a charged erasable programmable read-only memory and a temperature chip;
the method comprises the following steps:
s311, reading the field replaceable unit information, the electrified erasable programmable read only memory information and the temperature chip information based on the PCIe and the system management bus.
Specifically, the main server reads the information of the field replaceable unit, the electrified erasable programmable read-only memory and the temperature chip on the graphic processing chip through a system management bus on the golden finger.
Preferably, the method further comprises:
and S312, transmitting a PCIe x16 signal sent by the main server to the graphics processor chip through the golden finger based on the PCIe.
S313, acquiring the data calculated by the graphics processor chip based on the PCIe x16 signal and transmitting the data to the main server.
Specifically, the PCIe signal receives a PCIe x16 signal from the host server, is connected to the graphics processor chip via the gold finger, and transmits the calculated data back to the host server via the gold finger after passing through the calculation data of the graphics processor chip.
Preferably, the graphics processor chip is connected with a power connector based on the golden finger to supply power; the method further comprises the following steps:
s314, reading the state information of the remote management chip and the voltage state information of the graphic processor chip based on the PCIe and the system management bus.
Specifically, the graphics processor chip supplies power through the gold finger and two +12V power connectors, and controls the switching on of each power supply chip and receives the state of each power supply conversion chip after power conversion through the remote management chip. And i2c exists among the chips, so that the chips can conveniently communicate with each other and update the firmware. The main server reads the state of the remote management chip through the system management bus on the golden finger, and can obtain the voltage state of the current graphic processor equipment and the state information of the graphic processor chip.
S320, acquiring the state information of the graphics processor chip based on the remote management chip.
Specifically, the power information and out-of-band information of the graphics processor chip are obtained based on the remote management chip.
Specifically, the power supply information is monitored based on an analog-digital converter and a GPIO (General-purpose input/output) of the remote management chip, and the out-of-band information of the graphics processor chip is acquired based on an MCU (Microcontroller Unit micro control Unit) of the remote management chip.
In one embodiment, a micro control unit is arranged in the remote management chip; the obtaining the power supply information and the out-of-band information of the graphics processor chip based on the remote management chip comprises:
s321, power supply power data and power supply state signal data are obtained based on the remote management chip monitoring, wherein the power supply power data at least comprise peak power supply power data and power supply power data of all parts in the graphic processor device.
Specifically, the ADC (analog-to-digital converter) using the remote management chip monitors the following 7 power supplies to obtain power supply power data: P0V75, VDDCR _ GFX, VPP, VDDCR _ SOC, VDD _ MEM, VDDCI _ MEM, P1V8;
the GPIO of the remote management chip is used for monitoring normal signals of the power supply, and the following 5 power supply state signals are monitored to obtain power supply state signal data: +0.75V _PWRGD, VPP _ PWRGD, +1.8V _PWRGD, GFX _ SOC _ PWRGD, MVDD _ VDDCI _ PWRGD.
S322, acquiring internal temperature data, whole card power data, video memory temperature data, clock frequency, memory error checking and correcting data, power supply power data, power supply state signals, graphics processor state signals and remote management chip firmware version numbers of the graphics processor chip based on the micro control unit and the bidirectional two-wire system synchronous serial bus.
Specifically, (1) the main server BMC and the micro control unit of the remote management chip acquire the internal temperature data of the graphics processor chip through an I2C (Inter-Integrated Circuit bidirectional two-wire synchronous serial bus) communication mode. The temperature sensor is arranged in the graphics processor chip, the graphics processor chip writes the temperature value of the temperature sensor into the corresponding register after completing the writing, the micro control unit serves as I2C slave equipment, the main server BMC serves as I2C main equipment, the micro control unit provides an out-of-band I2C calling interface for the main server BMC, and the main server BMC acquires the internal temperature data of the graphics processor chip by calling the interface. The internal temperature of the graphics processor chip can be read normally before and after the graphics processor device is driven and loaded, only the temperature is read from an external sensor before the driving and loading, and the temperature is read from the graphics processor chip after the driving and loading, so that the remote management chip has no difference and reads the same register.
(2) The method comprises the following steps that a micro control unit of a main server BMC and a remote management chip obtains power data of a whole graphics processor chip in an I2C communication mode:
the graphics processor chip writes the power related value of the whole graphics processor card into a corresponding register, the micro control unit serves as an I2C slave device, the main server BMC serves as an I2C main device, the micro control unit provides an out-of-band I2C calling interface for the main server BMC, and the main server BMC calls the interface to complete obtaining of the power data of the whole graphics processor card. The reading of the power data of the whole card of the graphics processor needs to be driven by the graphics processor, the reading returns a fixed register data before the drive loading, which indicates that the reading is known to be unavailable, and the real power can be normally read after the drive loading.
(3) The main server BMC and the micro control unit acquire the video memory temperature data in an I2C communication mode:
the graphic processor chip provides a register corresponding to a GDDR (Graphics Double Data Rate) temperature value, the micro control unit serves as an I2C slave device, the main server BMC serves as an I2C master device, the micro control unit provides an out-of-band I2C calling interface for the main server BMC, and the main server BMC calls the interface to complete GDDR temperature acquisition. The GDDR temperature value can be read to a real value only after the graphics processor device is driven and loaded, the GDDR temperature value cannot be read before the graphics processor device is driven and loaded, and at the moment, when the BMC reads the GDDR temperature, the micro control unit returns a fixed value (known to be unreadable).
(4) The main server BMC and the micro control unit acquire the clock frequency of the graphics processor in an I2C communication mode:
the graphic processor provides a register micro-control unit corresponding to the clock frequency of the graphic processor as an I2C slave device, the main server BMC serves as an I2C master device, the micro-control unit provides an out-of-band I2C calling interface for the main server BMC, and the main server BMC calls the interface to complete acquisition of the clock frequency of the graphic processor.
(5) The main server BMC obtains the memory error checking and correcting data in an out-of-band I2C mode: the main server BMC obtains the total number of memory ECCs (Error Checking and Correcting) and the channel where the ECC occurs through the out-of-band I2C, and the specific ECC number of each memory is not required by the out-of-band read function. The graphics processor provides the total number of the memory ECC and a register corresponding to a channel where the ECC occurs, the micro control unit serves as an I2C slave device, the main server BMC serves as an I2C main device, the micro control unit provides an out-of-band I2C calling interface for the main server BMC, and the main server BMC calls the interface to complete the total number of the memory ECC and obtain the channel where the ECC occurs.
(6) The remote management chip is compatible with monitoring and acquiring power supply power data:
to monitor the power supply power signal by the control unit through the internal ADC, P0V75, VDDCR _ GFX, VPP, VDDCR _ SOC, VDD _ MEM, VDDCI _ MEM, P1V8.
The micro control unit is provided with a 12-bit ADC, power supply power data are obtained through conversion of data collected by the ADC, and the main server BMC is communicated with the micro control unit through the I2C to further obtain the power supply power data.
The default voltage value of each channel is set to be 0xffff, and if data cannot be acquired by an ADC of a certain channel, the microcontroller unit returns 0xFE (data acquired but not acquired) for the channel when the BMC acquires the data of the channel.
The micro control unit configures the ADC component in an integrated development environment, namely, an ADC related application programming interface can be automatically generated, and different application programming interfaces are called to realize the monitoring requirement of the power supply.
(7) The micro control unit monitors a power state signal through the GPIO: +0.75V _PWRGD, VPP _ PWRGD, +1.8V _PWRGD, GFX _ SOC _ PWRGD, VDD _ VDDCI _ PWRGD.
In the operation process of the graphic processor device, monitoring of the power supply normal state signal is realized by detecting the GPIO level of the micro control unit, when the corresponding GPIO level is high, the power supply state signal is normal, and when the GPIO level is low, the power supply state signal is abnormal. And the main server BMC and the micro control unit acquire a power state signal through SMBUS communication.
(8) The remote management chip is compatible with the monitoring graphics processor state signal:
the micro control unit monitors the state signals of the graphic processor through GPIO, and the signals comprise four signals of GPU Fault, PCIe GEN4, SENSE _ P12V _ CONN _1 and SENSE _ P12V _ CONN _ 2.
Monitoring of the state signal of the graphics processor is realized by detecting the level of the GPIO of the micro control unit. The master server BMC obtains the graphics processor status signal from the micro control unit through the I2C.
(9) The method comprises the following steps that a main server BMC obtains a firmware version number of a remote management chip:
the main server BMC and the micro control unit acquire the firmware version number of the remote management chip in an I2C communication mode.
The method comprises the steps that a register corresponding to the firmware version number of the remote management chip is added in a graphic processor chip area mapped by the register, a micro control unit serves as an I2C slave device, a main server BMC serves as an I2C main device, the micro control unit provides an out-of-band I2C calling interface for the main server BMC, and the main server BMC calls the interface to complete the acquisition of the firmware version number of the remote management chip.
Preferably, the method further comprises the following steps:
s323, detecting and acquiring the fault information and PCIe interface information of the graphic processor chip based on the GPIO resource of the remote management chip.
Using GPIO resources of the remote management chip, monitor 2 graphics processor chip state signals as follows: GPU Fault and PCIe GEN4, thereby obtaining the Fault information and PCIe interface information of the graphics processor chip.
S330, matching and executing a target management strategy based on the firmware information of the graphics processor chip, the state information of the graphics processor chip and a preset management rule.
Example three: in correspondence with the first to second embodiments, the graphics processor device management system provided by the present application will be described with reference to fig. 4. The system may be implemented by hardware or software, or by a combination of hardware and software, and the present application is not limited thereto.
In one example, the present application provides a graphics processor device management system, where the graphics processor device includes at least a graphics processor chip and a remote management chip, and the graphics processor chip is further connected with PCIe; the graphics processor device management system includes:
a first obtaining module 410, configured to obtain firmware information of the graphics processor chip based on the PCIe;
a second obtaining module 420, configured to obtain status information of the graphics processor chip based on the remote management chip;
and the matching execution module 430 is configured to match and execute a target management policy based on the firmware information of the graphics processor chip, the state information of the graphics processor chip, and a preset management rule.
In one embodiment, the PCIe establishes a connection with the host server via a golden finger, and a system management bus is disposed in the golden finger; the graphic processing chip is at least internally provided with a field replaceable unit, a charged erasable programmable read-only memory and a temperature chip;
the first obtaining module 410 includes:
a first reading unit 411, configured to read the field replaceable unit information, the electrically erasable programmable read only memory information, and the temperature chip information based on the PCIe and the system management bus.
Preferably, the first obtaining module 410 further includes:
a transmitting unit 412, configured to transmit a PCIe x16 signal sent by the host server to the graphics processor chip via the gold finger based on the PCIe;
a first obtaining unit 413, configured to obtain data calculated by the graphics processor chip based on the PCIe x16 signal and transmit the data to the host server.
More preferably, the graphics processor chip is connected with a power connector based on the golden finger to supply power; the first obtaining module 410 further includes:
a second reading unit 414, configured to read the remote management chip state information and the graphics processor chip voltage state information based on the PCIe and the system management bus.
In an embodiment, the second obtaining module 420 is specifically configured to:
and acquiring power supply information and out-of-band information of the graphics processor chip based on the remote management chip.
Preferably, a micro control unit is arranged in the remote management chip; the second obtaining module 420 includes:
a second obtaining unit 421, configured to obtain power supply power data and power supply status signal data based on the monitoring of the remote management chip, where the power supply power data at least includes peak power supply power data and power supply power data of each component in the graphics processor device;
a third obtaining unit 422, configured to obtain, based on the micro control unit and the bidirectional two-wire system synchronous serial bus, internal temperature data of the graphics processor chip, power data of the entire card, temperature data of a video memory, clock frequency, data for checking and correcting memory errors, power data, a power status signal, a graphics processor status signal, and a firmware version number of the remote management chip.
More preferably, the second obtaining module 420 further includes:
a fourth obtaining unit 423, configured to obtain the fault information and PCIe interface information of the graphics processor chip based on GPIO resource detection of the remote management chip.
Example four: corresponding to the first to third embodiments, the computer device provided by the present application will be described with reference to fig. 5. As shown in fig. 5 in one example, the present application provides a computer device comprising:
one or more processors;
and memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform operations comprising:
acquiring firmware information of the graphics processor chip based on the PCIe;
acquiring state information of the graphics processor chip based on the remote management chip;
and matching and executing a target management strategy based on the firmware information of the graphics processor chip, the state information of the graphics processor chip and a preset management rule.
In one embodiment, the program instructions, when read and executed by the one or more processors, further perform the following:
reading the field replaceable unit information, the hot EEPROM information, and the temperature chip information based on the PCIe and the system management bus.
The program instructions, when read and executed by the one or more processors, further perform the following:
transmitting a PCIe x16 signal sent by the main server to the graphics processor chip through the golden finger based on the PCIe;
and acquiring data calculated by the graphics processor chip based on the PCIe x16 signal and transmitting the data to the main server.
The program instructions, when read and executed by the one or more processors, further perform the following:
reading the remote management chip state information and the graphics processor chip voltage state information based on the PCIe and the system management bus.
The program instructions, when read and executed by the one or more processors, further perform the following:
and acquiring power supply information and out-of-band information of the graphics processor chip based on the remote management chip.
The program instructions, when read and executed by the one or more processors, further perform the following:
monitoring and acquiring power supply power data and power supply state signal data based on the remote management chip, wherein the power supply power data at least comprises peak power supply power data and power supply power data of each component in the graphic processor equipment;
and acquiring internal temperature data, whole card power data, video memory temperature data, clock frequency, memory error checking and correcting data, power supply power data, a power supply state signal, a graphics processor state signal and a remote management chip firmware version number of the graphics processor chip based on the micro control unit and the bidirectional two-wire system synchronous serial bus.
The program instructions, when read and executed by the one or more processors, further perform the following:
and detecting and acquiring the fault information and PCIe interface information of the graphics processor chip based on the GPIO resource of the remote management chip.
When the program instructions are read and executed by the one or more processors, operations corresponding to the steps in the foregoing method embodiments may also be executed, which may refer to the foregoing description and are not described herein again. Referring to fig. 5, which schematically illustrates an architecture of a computer device, it may specifically include a processor 510, a video display adapter 511, a disk drive 512, an input/output interface 513, a network interface 514, and a memory 520. The processor 510, the video display adapter 511, the disk drive 512, the input/output interface 513, the network interface 514, and the memory 520 may be communicatively connected by a communication bus 530.
The processor 510 may be implemented by a general Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solution provided in the present Application.
The Memory 520 may be implemented in the form of a Read Only Memory (ROM), a Random Access Memory (RAM), a static storage device, a dynamic storage device, or the like. The memory 520 may store an operating system 521 for controlling the operation of the computer device 500, and a Basic Input Output System (BIOS) 522 for controlling low-level operations of the computer device 500. In addition, a web browser 523, a data storage manager 524, an icon font processing system 525, and the like may also be stored. The icon font processing system 525 may be an application program that implements the operations of the foregoing steps in this embodiment of the application. In summary, when the technical solution provided in the present application is implemented by software or firmware, the relevant program codes are stored in the memory 520 and called to be executed by the processor 510.
The input/output interface 513 is used for connecting an input/output module to realize information input and output. The i/o module may be configured as a component within the device (not shown) or may be external to the device to provide corresponding functionality. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.
The network interface 514 is used for connecting a communication module (not shown in the figure) to realize communication interaction between the device and other devices. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, bluetooth and the like).
Bus 530 includes a path that transfers information between the various components of the device, such as processor 510, video display adapter 511, disk drive 512, input/output interface 513, network interface 514, and memory 520.
In addition, the computer apparatus 500 may also obtain information of specific pickup conditions from the virtual resource object pickup condition information database 541 for performing condition judgment, and the like.
It should be noted that although the computer device 500 only shows the processor 510, the video display adapter 511, the disk drive 512, the input/output interface 513, the network interface 514, the memory 520, the bus 530 and the like, in a specific implementation, the computer device may also include other components necessary for normal operation. Furthermore, it will be understood by those skilled in the art that the apparatus described above may also include only the components necessary to implement the solution of the present application, and not necessarily all of the components shown in the figures.
From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus a necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, or the like, and includes several instructions for enabling a computer device (which may be a personal computer, a cloud server, or a network device) to execute the method according to the embodiments or some parts of the embodiments of the present application.
Example five: in correspondence with the first to fourth embodiments, a computer-readable storage medium provided by the present application will be described below. In one example, the present application provides a computer readable storage medium having a computer program stored thereon, wherein the computer program when executed by a processor implements the steps of:
acquiring firmware information of the graphics processor chip based on the PCIe;
acquiring state information of the graphics processor chip based on the remote management chip;
and matching a target management strategy based on the firmware information of the graphic processor chip, the state information of the graphic processor chip and a preset management rule and executing.
In one embodiment, the computer program when executed by a processor further performs the steps of:
reading the field replaceable unit information, the hot EEPROM information, and the temperature chip information based on the PCIe and the system management bus.
The computer program when executed by a processor further realizes the steps of:
transmitting a PCIe x16 signal sent by the main server to the graphics processor chip through the golden finger based on the PCIe;
and acquiring data calculated by the graphics processor chip based on the PCIe x16 signal and transmitting the data to the main server.
The computer program when executed by a processor further realizes the steps of:
reading the remote management chip state information and the graphics processor chip voltage state information based on the PCIe and the system management bus.
The computer program when executed by a processor further realizes the steps of:
and acquiring power supply information and out-of-band information of the graphics processor chip based on the remote management chip.
The computer program when executed by a processor further realizes the steps of:
monitoring and acquiring power supply power data and power supply state signal data based on the remote management chip, wherein the power supply power data at least comprises peak power supply power data and power supply power data of each component in the graphic processor equipment;
and acquiring internal temperature data, whole card power data, video memory temperature data, clock frequency, memory error checking and correcting data, power supply power data, a power supply state signal, a graphics processor state signal and a remote management chip firmware version number of the graphics processor chip based on the micro control unit and the bidirectional two-wire system synchronous serial bus.
The computer program when executed by a processor further realizes the steps of:
and detecting and acquiring the fault information and PCIe interface information of the graphics processor chip based on the GPIO resource of the remote management chip.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the description of the method embodiments for relevant points. The above-described system embodiments are merely illustrative, wherein the modules described as separate components may or may not be physically separate, and the components shown as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
In addition, it is to be understood that: the terms "first", "second", "third" and "fourth" in this application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or as implying a number of indicated technical features. Thus, features defined as "first", "second", "third", "fourth" may explicitly or implicitly include one or more of the features.
It should be understood that the above-mentioned embodiments are only illustrative of the technical concepts and features of the present invention, and are intended to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the scope of the present invention. All modifications made in accordance with the spirit of the main technical scheme of the invention are intended to be covered by the scope of the invention.

Claims (10)

1. The method for managing the graphic processor equipment is characterized in that the graphic processor equipment at least comprises a graphic processor chip and a remote management chip, wherein the graphic processor chip is also connected with PCIe; the method comprises the following steps:
acquiring firmware information of the graphics processor chip based on the PCIe;
acquiring state information of the graphics processor chip based on the remote management chip;
and matching a target management strategy based on the firmware information of the graphic processor chip, the state information of the graphic processor chip and a preset management rule and executing.
2. The graphics processor device management method according to claim 1, wherein the PCIe establishes a connection with a host server via a golden finger, and a system management bus is provided in the golden finger;
the graphic processing chip is at least internally provided with a field replaceable unit, a charged erasable programmable read-only memory and a temperature chip;
the obtaining firmware information of the graphics processor chip based on the PCIe includes:
reading the field replaceable unit information, the hot EEPROM information, and the temperature chip information based on the PCIe and the system management bus.
3. The graphics processor device management method of claim 2, the method further comprising:
transmitting a PCIe x16 signal sent by the main server to the graphics processor chip through the golden finger based on the PCIe;
and acquiring data calculated by the graphics processor chip based on the PCIe x16 signal and transmitting the data to the main server.
4. The graphics processor device management method according to claim 2, wherein the graphics processor chip is connected with a power connector to supply power based on the golden finger; the method further comprises the following steps:
reading the remote management chip state information and the graphics processor chip voltage state information based on the PCIe and the system management bus.
5. The graphics processor device management method of claim 1, wherein the obtaining state information of the graphics processor chip based on the remote management chip comprises:
and acquiring power supply information and out-of-band information of the graphics processor chip based on the remote management chip.
6. The graphics processor device management method according to claim 5, wherein a micro control unit is provided in the remote management chip; the obtaining power supply information and out-of-band information of the graphics processor chip based on the remote management chip comprises:
monitoring and acquiring power supply power data and power supply state signal data based on the remote management chip, wherein the power supply power data at least comprises peak power supply power data and power supply power data of each component in the graphic processor equipment;
and acquiring internal temperature data, whole card power data, video memory temperature data, clock frequency, memory error checking and correcting data, power supply power data, a power supply state signal, a graphics processor state signal and a remote management chip firmware version number of the graphics processor chip based on the micro control unit and the bidirectional two-wire system synchronous serial bus.
7. The graphics processor device management method of claim 6, wherein the obtaining power supply information and out-of-band information for the graphics processor chip based on the remote management chip further comprises:
and detecting and acquiring the fault information and PCIe interface information of the graphics processor chip based on the GPIO resource of the remote management chip.
8. A graphic processor device management system is characterized in that the graphic processor device at least comprises a graphic processor chip and a remote management chip, wherein the graphic processor chip is also connected with PCIe; the system comprises:
the first obtaining module is used for obtaining the firmware information of the graphics processor chip based on the PCIe;
the second acquisition module is used for acquiring the state information of the graphics processor chip based on the remote management chip;
and the matching execution module is used for matching and executing a target management strategy based on the firmware information of the graphics processor chip, the state information of the graphics processor chip and a preset management rule.
9. A computer device, characterized in that the device comprises:
one or more processors; and
a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform the graphics processor device management method of any of claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1-7.
CN202211234208.0A 2022-10-10 2022-10-10 Method, system, computer device and medium for managing graphic processor device Pending CN115576774A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211234208.0A CN115576774A (en) 2022-10-10 2022-10-10 Method, system, computer device and medium for managing graphic processor device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211234208.0A CN115576774A (en) 2022-10-10 2022-10-10 Method, system, computer device and medium for managing graphic processor device

Publications (1)

Publication Number Publication Date
CN115576774A true CN115576774A (en) 2023-01-06

Family

ID=84585039

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211234208.0A Pending CN115576774A (en) 2022-10-10 2022-10-10 Method, system, computer device and medium for managing graphic processor device

Country Status (1)

Country Link
CN (1) CN115576774A (en)

Similar Documents

Publication Publication Date Title
US11301257B2 (en) Computing performance and power management with firmware performance data structure
CN106681751B (en) Unified firmware management system and management method and computer readable medium
CN105938450B (en) The method and system that automatic debugging information is collected
US10055296B2 (en) System and method for selective BIOS restoration
US9146823B2 (en) Techniques for testing enclosure management controller using backplane initiator
US11061837B2 (en) UBM implementation inside BMC
CN107193763B (en) Information processing method and electronic equipment
US8397053B2 (en) Multi-motherboard server system
CN109388604B (en) Hot plug control method, device and storage medium based on PCIe
JP2012038305A (en) Data processing system having error detection of set information of peripheral device
CN104021060A (en) BMC serial port debugging system and method
US8793364B1 (en) Remote power controller
CN109032901A (en) A kind of monitoring method, device and the controlled terminal of the outer SSD of remote band
CN103902400A (en) Over-frequency setting system and over-frequency setting method thereof
US9454438B2 (en) Recovery circuit for basic input-output system
CN112000545A (en) Graphics processor board card and graphics processor management method
CN115576774A (en) Method, system, computer device and medium for managing graphic processor device
TWI420318B (en) A non-intrusive general-purpose common busbar switching device
CN109003646A (en) A kind of data processing method and single-chip microcontroller
CN116848519A (en) Method and device for generating hardware interface signal and electronic equipment
CN108874595A (en) A kind of HBA card remapping method, system and HBA card and storage medium
CN113900718A (en) Method, system and device for decoupling asset information of BMC (baseboard management controller) and BIOS (basic input/output system)
CN114153388B (en) Hard disk system, hard disk configuration information refreshing method, device and medium
CN112765082B (en) Multi-host arbitration method, device and readable storage medium
CN112631874B (en) Server PSU information acquisition method and device, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination