CN112052143A - Graphic processor GPU management method and system - Google Patents

Graphic processor GPU management method and system Download PDF

Info

Publication number
CN112052143A
CN112052143A CN202010952527.XA CN202010952527A CN112052143A CN 112052143 A CN112052143 A CN 112052143A CN 202010952527 A CN202010952527 A CN 202010952527A CN 112052143 A CN112052143 A CN 112052143A
Authority
CN
China
Prior art keywords
gpu
temperature
voltage
current
abnormal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010952527.XA
Other languages
Chinese (zh)
Inventor
仇金斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Embedway Technologies Shanghai Corp
Original Assignee
Embedway Technologies Shanghai Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Embedway Technologies Shanghai Corp filed Critical Embedway Technologies Shanghai Corp
Priority to CN202010952527.XA priority Critical patent/CN112052143A/en
Publication of CN112052143A publication Critical patent/CN112052143A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3058Monitoring arrangements for monitoring environmental properties or parameters of the computing system or of the computing system component, e.g. monitoring of power, currents, temperature, humidity, position, vibrations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3089Monitoring arrangements determined by the means or processing involved in sensing the monitored data, e.g. interfaces, connectors, sensors, probes, agents
    • G06F11/3093Configuration details thereof, e.g. installation, enabling, spatial arrangement of the probes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/324Display of status information
    • G06F11/327Alarm or error message display

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application provides a graphic processor GPU management method and a system, wherein a graphic processor is connected with a server through a Pice data line, a graphic processor management system is respectively connected with the server and the graphic processor through buses, and the graphic processor management system comprises: the system comprises a management control unit, an information acquisition unit and an alarm unit; the output end of the information acquisition unit is connected with the input end of the management control unit and is used for acquiring the temperature, the voltage and the current of the graphic processor; the control end of the management control unit is connected with the alarm unit and used for judging whether the graphic processor is abnormal or not according to the received temperature, voltage and current; if so, the connection between the graphic processor and the server is controlled to be disconnected, and an alarm signal is sent to the alarm unit so that the alarm unit can give an alarm based on the alarm signal. The invention can monitor the state of the graphics processor in time, and can give corresponding alarm in time when the graphics processor is abnormal, thereby avoiding the risk of burning the graphics processor.

Description

Graphic processor GPU management method and system
Technical Field
The present invention relates to the field of electronic technologies, and in particular, to a method and a system for managing a GPU in a graphics processor.
Background
With the development of the 5G technology, the demand of the home-made server is increasing, and accordingly, the application of the GPU of the home-made server is becoming wider and more important for managing the Graphics Processing Unit (GPU) of the home-made server.
The management mode of the GPU of the graphic processor is realized based on an operating system at present, but the operating system is started slowly, the state of the GPU of the graphic processor cannot be monitored in time, and when the GPU of the graphic processor is abnormal, corresponding alarm cannot be given in time, so that the GPU of the graphic processor has the risk of being burnt.
Disclosure of Invention
In view of this, the present invention provides a method and a system for managing a GPU of a graphics processor, so as to monitor a state of the GPU of the graphics processor in time, so as to give a corresponding alarm in time when the GPU of the graphics processor is abnormal, thereby avoiding a risk of burning the GPU of the graphics processor.
In order to achieve the above purpose, the embodiments of the present invention provide the following technical solutions:
the first aspect of the present invention discloses a GPU management system for a graphics processor, the GPU of the graphics processor is connected to a server through a Pice data line, the GPU management system for the graphics processor is connected to the server and the GPU of the graphics processor through buses, respectively, and the GPU management system for the graphics processor comprises: the management control unit MCU, the information acquisition unit and the alarm unit;
the output end of the information acquisition unit is connected with the input end of the management control unit MCU and is used for acquiring information of the GPU to obtain the temperature, voltage and current of the GPU;
the control end of the management control unit MCU is connected with the alarm unit and is used for receiving the temperature, the voltage and the current of the GPU sent by the information acquisition unit; judging whether the GPU is abnormal or not according to the temperature, the voltage and the current of the GPU; if the GPU is abnormal, controlling the connection between the GPU and the server to be disconnected, and sending an alarm signal to the alarm unit;
and the alarm unit is used for receiving the alarm signal sent by the management control unit MCU and giving an alarm based on the alarm signal.
Optionally, the input end of the management control unit MCU includes a first input end and a second input end, the information acquisition unit includes a temperature sensor and a power management chip, the output end of the temperature sensor is connected to the first input end of the management control unit MCU, the output end of the power management chip is connected to the second input end of the management control unit MCU, and the information acquisition unit for acquiring information of the graphic processor GPU is specifically configured to:
acquiring the temperature of the GPU based on the temperature sensor, converting the temperature of the GPU into an electric signal, and sending the electric signal to the MCU;
and acquiring the voltage and the current of the GPU based on the power management chip, and sending the acquired voltage and the acquired current of the GPU to the management control unit MCU.
Optionally, the processing unit is configured to receive the temperature, the voltage, and the current of the GPU sent by the information acquisition unit; the management control unit MCU for judging whether the GPU is abnormal or not according to the temperature, the voltage and the current of the GPU is specifically used for:
receiving an electric signal sent by the temperature sensor, and analyzing the electric signal to obtain the temperature of the GPU; receiving the voltage and the current of the graphics processor sent by the power management chip; judging whether the temperature of the GPU exceeds a preset temperature threshold value, whether the voltage of the GPU exceeds a preset voltage threshold value and whether the current of the GPU exceeds a preset current threshold value; and if any one of the temperature, the voltage and the current of the GPU exceeds a corresponding preset threshold value, determining that the GPU is abnormal.
Optionally, the GPU management system further includes a GPU fan, and the GPU fan is connected to the MCU through a fan interface of the GPU;
the management control unit MCU is also used for controlling the wind speed of the GPU fan according to the temperature of the GPU; judging whether the wind speed of the GPU fan of the graphics processing unit exceeds a preset wind speed threshold value or not; and if the wind speed exceeds the preset wind speed threshold value, determining that the GPU is abnormal, controlling the GPU to be disconnected with the server, and sending an alarm signal to the alarm unit.
Optionally, the management control unit MCU is further configured to:
and if the GPU is abnormal, generating abnormal information related to the GPU, and uploading the abnormal information to the server through an I2C bus.
The second aspect of the invention discloses a Graphics Processing Unit (GPU) management method, which is applied to a GPU management system of the graphics processing unit, and comprises the following steps:
acquiring information of a GPU (graphics processing Unit) to obtain the temperature, voltage and current of the GPU;
judging whether the GPU is abnormal or not according to the temperature, the voltage and the current of the GPU;
and if the GPU is abnormal, controlling the connection between the GPU and the server to be disconnected, and triggering an alarm.
Optionally, the acquiring information of the GPU includes:
and collecting the temperature of the GPU, converting the temperature of the GPU into an electric signal, and collecting the voltage and the current of a CPU of the GPU.
Optionally, the determining whether the GPU is abnormal according to the temperature of the GPU, the voltage, and the current includes:
analyzing the electric signal to obtain the temperature of the GPU;
judging whether the temperature of the GPU exceeds a preset temperature threshold value, whether the voltage of the GPU exceeds a preset voltage threshold value and whether the current of the GPU exceeds a preset current threshold value;
and if any one of the temperature, the voltage and the current of the GPU exceeds a corresponding preset threshold value, determining that the GPU is abnormal.
Optionally, the method further includes:
controlling the wind speed of a GPU fan of the graphics processing unit according to the temperature of the GPU;
judging whether the wind speed of the GPU fan of the graphics processing unit exceeds a preset wind speed threshold value or not;
and if the wind speed exceeds the preset wind speed threshold value, determining that the GPU is abnormal, controlling the connection between the GPU and the server to be disconnected, and triggering an alarm.
Optionally, the method further includes:
and if the GPU is abnormal, generating abnormal information related to the GPU, and sending the abnormal information to the server.
The invention provides a graphic processor GPU management method and a system, wherein the graphic processor GPU is connected with a server through Pice data, a graphic processor GPU management system is respectively connected with the server and the graphic processor through buses, and the graphic processor GPU management system comprises a management control unit MCU, an information acquisition unit and an alarm unit; the method comprises the steps that the temperature, the voltage and the current of the GPU are collected through an information collection unit, the collected temperature, the collected voltage and the collected current of the GPU are sent to a management control unit MCU, so that the management control unit MCU can judge whether the GPU sends abnormity according to the received temperature, the received voltage and the received current of the GPU, under the condition that the GPU is determined to be abnormal, the connection between the GPU and a server is controlled to be disconnected, an alarm signal is sent to an alarm unit, and the alarm unit can give an alarm based on received alarm information. The graphic processor GPU management system provided by the invention can be quickly started, so that the state of the graphic processor GPU can be timely monitored, the connection between the graphic processor GPU and the server is disconnected when the graphic processor GPU is monitored to be abnormal, and the risk of burning the graphic processor GPU is avoided.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a schematic structural diagram of a GPU management system of a graphics processor according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating an exemplary architecture of a GPU management system according to an embodiment of the present invention;
fig. 3 is a flowchart illustrating a GPU management method for a graphics processor according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In this application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
It can be known from the above background art that the current management mode of the GPU is implemented based on an operating system, but the operating system is slow to start, and cannot monitor the state of the GPU in time, and when the GPU is abnormal, corresponding alarm cannot be given in time, so that the GPU is at risk of being burned. Moreover, since the manufacturers of the GPUs of the graphics processors have a lot of specifications, the operating system needs to customize different drivers, and thus unified management of the GPUs of different graphics processors cannot be realized.
Therefore, the embodiment of the invention provides a method and a system for managing a graphic processor GPU, which can monitor the state of the graphic processor GPU in time so as to give corresponding alarms in time when the graphic processor GPU is abnormal, and avoid the risk of burning the graphic processor GPU. In addition, the management of the GPU of the graphics processor is realized based on the GPU management system of the graphics processor, different drivers do not need to be customized, and further the unified management of the GPUs of different graphics processors can be realized.
Referring to fig. 1, a schematic structural diagram of a GPU management system of a graphics processor according to an embodiment of the present invention is shown, where a GPU101 of the graphics processor is connected to a server 102 through a Pice data line, a GPU management system 103 of the graphics processor is connected to the server 102 and the GPU101 through buses, respectively, and the GPU management system 101 of the graphics processor includes a management control unit MCU, an information acquisition unit, and an alarm unit. The bus in which the graphics processor GPU management system 103 is connected to the server 102 may be an I2C upload bus.
The output end of the information acquisition unit is connected with the input end of the management control unit MCU and is used for acquiring information of the GPU101 to obtain the temperature, the voltage and the current of the GPU 101.
In this application embodiment, the information acquisition unit includes temperature sensor and power management chip, and temperature sensor's output links to each other with management control unit MCU's first input, and power management chip's output links to each other with management control unit MCU's second input. The input end of the management control unit MCU comprises a first input end and a second input end, and the output end of the information acquisition unit comprises the output end of the temperature sensor and the output end of the power management chip.
The specific process of performing information acquisition on the GPU101 based on the information acquisition unit may be: acquiring the temperature of the GPU101 through a temperature sensor, and converting the acquired temperature of the GPU101 into an electric signal so as to send the obtained electric signal to a Management Control Unit (MCU); the voltage and the current of the GPU101 are collected through the power management chip, and the collected voltage and the collected current of the GPU101 are sent to the management control unit MCU, so that the management control unit MCU can judge whether the GPU is abnormal or not according to the received temperature, the voltage and the current of the GPU 101.
The control end of the management control unit MCU is connected with the alarm unit and is used for receiving the temperature, the voltage and the current of the graphic processor GPU101 sent by the information acquisition unit and judging whether the graphic processor GPU101 is abnormal or not according to the received temperature, the voltage and the current of the graphic processor GPU 101; and under the condition that the graphic processor GPU101 is determined to be abnormal, the connection between the graphic processor GPU101 and the server 102 is controlled to be disconnected, and an alarm signal is sent to an alarm unit.
And the alarm unit is used for receiving the alarm information sent by the management control unit MCU and giving an alarm based on the received alarm signal.
In this embodiment of the present application, a temperature threshold, a voltage threshold, and a current threshold are preset, and a manner for the management control unit MCU to determine whether the GPU101 of the graphics processor is abnormal according to the received temperature, voltage, and current of the GPU of the graphics processor may be: after receiving the electric signal sent by the temperature sensor, the management control unit MCU analyzes the received electric signal to obtain the temperature of the GPU 101; judging whether the temperature of the GPU101 exceeds a preset temperature threshold value, whether the voltage of the GPU101 exceeds a preset voltage threshold value and whether the current of the GPU101 exceeds a preset current threshold value; and if any one of the temperature, the voltage and the current of the GPU101 exceeds a corresponding preset threshold value, determining that the GPU101 is abnormal. If the temperature, the voltage and the current of the GPU101 do not exceed the corresponding preset threshold values, the temperature, the voltage and the current of the GPU101 are collected again, so that whether the GPU is abnormal or not is judged according to the temperature, the voltage and the current of the GPU101 collected again.
In the embodiment of the present application, the preset temperature threshold may be 90 degrees celsius, the preset voltage threshold may be 15V, and the preset current threshold may be 2.5A. The inventor can set specific values of the preset temperature threshold, the preset voltage threshold and the preset current threshold according to his own needs, and the embodiment of the present application is not limited.
For example, the preset temperature threshold is 90 degrees celsius, the preset voltage threshold is 15V, and the preset current threshold is 2.5A; if the temperature of the graphic processor GPU101 obtained by analyzing the electric signal sent by the temperature sensor by the management control unit MCU is 80 degrees celsius, the voltage of the graphic processor GPU101 received from the power management chip is 17V, and the current of the graphic processor GPU101 received from the power management chip is 2.1A, it is determined that the graphic processor GPU101 is abnormal because the received voltage (17V) of the graphic processor GPU101 is greater than the preset voltage threshold (15V).
In the embodiment of the application, after analyzing the electric signal sent by the temperature sensor to obtain the temperature of the GPU101, the MCU generates the abnormal information of the GPU101 according to the obtained temperature of the GPU101 and the received voltage and current of the GPU101 sent by the power management chip, and uploads the abnormal information to the server 102 through the I2C upload bus, so that the server 102 stores the received abnormal information in the built-in memory.
Further, in this embodiment, the GPU management system 103 further includes a GPU fan, and the GPU fan is connected to the MCU through a fan interface of the GPU; after the management control unit MCU analyzes the electric signal sent by the temperature sensor to obtain the temperature of the GPU101, the wind speed of the GPU fan of the graphics processor can be controlled according to the obtained temperature of the GPU, whether the GPU101 is abnormal or not can be further judged according to the wind speed of the GPU fan of the graphics processor, if the GPU101 is abnormal, the connection between the GPU101 and the server 102 is controlled to be disconnected, and an alarm signal is sent to the alarm unit, so that the alarm unit can give an alarm based on the received alarm signal.
In this embodiment of the present application, a wind speed threshold is preset, and a manner for the management control unit MCU to further determine whether the GPU101 of the graphics processor is abnormal according to the wind speed of the GPU fan of the graphics processor may be: and the management control unit MCU judges whether the wind speed of the GPU fan of the graphics processor exceeds a preset wind speed threshold value or not, and if the wind speed of the GPU fan of the graphics processor exceeds the preset wind speed threshold value, the GPU101 of the graphics processor is determined to be abnormal.
In the embodiment of the present application, the management control unit MCU may upload the wind speed of the GPU fan of the graphics processor to the server 102 through the I2C upload bus, so that the server 102 stores the received wind speed of the GPU fan of the graphics processor in the built-in memory.
In the embodiment of the application, after receiving the alarm signal sent by the management control unit MCU, the alarm unit alarms based on the received alarm signal.
As a preferred mode of the embodiment of the present application, if the alarm unit includes an alarm LED, the management control unit MCU sends an alarm signal to the alarm LED when it is determined that the GPU101 of the graphics processor is abnormal, and the alarm LED flashes based on the received alarm signal to prompt the operation and maintenance personnel that the GPU101 of the graphics processor is abnormal, so that the operation and maintenance personnel can maintain the GPU of the graphics processor in time.
As another preferred mode of the embodiment of the application, if the alarm unit includes an alarm buzzer, the management control unit MCU sends an alarm signal to the alarm buzzer when it is determined that the GPU101 of the graphics processor is abnormal, and the alarm buzzer rings based on the received alarm signal to prompt the operation and maintenance personnel that the GPU101 of the graphics processor is abnormal, so that the operation and maintenance personnel can maintain the GPU of the graphics processor in time.
As another preferred mode of the embodiment of the present application, if the alarm unit includes an alarm buzzer and an alarm LED, when it is determined that the GPU101 of the graphics processor is abnormal, the MCU sends an alarm signal to the alarm buzzer and the alarm LED respectively, the alarm buzzer rings based on the received alarm signal, and the alarm LED flashes based on the received alarm signal to prompt the operation and maintenance staff that the GPU101 of the graphics processor is abnormal, so that the operation and maintenance staff can maintain the GPU in time.
The invention provides a graphic processor GPU management system, wherein the graphic processor GPU is connected with a server through Pice data, the graphic processor GPU management system is respectively connected with the server and the graphic processor through a bus, and the graphic processor GPU management system comprises a management control unit MCU, an information acquisition unit and an alarm unit; the method comprises the steps that the temperature, the voltage and the current of the GPU are collected through an information collection unit, the collected temperature, the collected voltage and the collected current of the GPU are sent to a management control unit MCU, so that the management control unit MCU can judge whether the GPU sends abnormity according to the received temperature, the received voltage and the received current of the GPU, under the condition that the GPU is determined to be abnormal, the connection between the GPU and a server is controlled to be disconnected, an alarm signal is sent to an alarm unit, and the alarm unit can give an alarm based on received alarm information. The graphic processor GPU management system provided by the invention can be quickly started, so that the state of the graphic processor GPU can be timely monitored, the connection between the graphic processor GPU and the server is disconnected when the graphic processor GPU is monitored to be abnormal, and the risk of burning the graphic processor GPU is avoided.
To better understand the above, the following is illustrated in the form of one embodiment.
Referring to fig. 2, which shows a structural example diagram of a GPU management system of a graphics processor according to an embodiment of the present invention, a GPU201 of the graphics processor is connected to a server 202 through a Pice data line, a GPU management system 203 of the graphics processor is connected to the server 202 and the GPU201 through buses, respectively, and the GPU management system 201 of the graphics processor includes a management control unit MCU, a temperature sensor, a power management chip, an alarm LED, and an alarm buzzer. The bus in which the graphics processor GPU management system 203 is connected to the server 202 may be an I2C upload bus. The output end of the temperature sensor is connected with the first input end of the management control unit MCU, the output end of the power management chip is connected with the second input end of the management control unit MCU, and the control end of the management control unit MCU is respectively connected with the alarm LED and the alarm buzzer.
Starting a graphic processor GPU management system at the same time when the graphic processor GPU201 and the server 202 are started; acquiring the temperature (100 ℃) of a GPU (graphics processing unit) 101 through a temperature sensor of a GPU management system, and converting the acquired temperature of the GPU into an electric signal so as to send the acquired electric signal to a Management Control Unit (MCU); the voltage 12V) and the current (2.1A) of the GPU101 are collected through the power management chip, and the collected voltage and current of the GPU101 are sent to the management control unit MCU. The management control unit MCU analyzes the electric signal sent by the temperature sensor to obtain the temperature of the graphic processor GPU101 which is 100 ℃, and the wind speed of the graphic processor GPU fan is controlled to be 200m/s according to the obtained temperature of the graphic processor GPU.
The preset temperature threshold is 90 ℃, the preset voltage threshold is 15V, the preset current threshold is 2.5A, and the preset wind speed threshold is 180 m/s; determining that the GPU101 of the graphics processor is abnormal because the received temperature (100 ℃) of the GPU101 of the graphics processor is greater than a preset temperature threshold value (90 ℃) and the wind speed (200m/s) of a fan of the GPU of the graphics processor is greater than a preset wind speed threshold value (180 m/s); the management control unit MCU sends alarm signals to the alarm buzzer and the alarm LED respectively, the alarm buzzer buzzes based on the received alarm signals, the alarm LED flickers based on the received alarm signals to prompt operation and maintenance personnel that the GPU101 is abnormal, so that the operation and maintenance personnel can maintain the GPU in time.
Based on the graphic processor GPU management system shown in FIG. 1, the invention also correspondingly discloses a graphic processor GPU management method, the graphic processor GPU management method is suitable for the graphic processor GPU management system, the graphic processor GPU is connected with the server through the Pice data, and the graphic processor GPU management system is respectively connected with the server and the graphic processor through buses. Referring to fig. 3, a schematic flow chart of a GPU management method for a graphics processor according to an embodiment of the present invention is shown, where the GPU management method for a graphics processor specifically includes the following steps:
s301: and acquiring information of the GPU to obtain the temperature, voltage and current of the GPU.
In the specific process of step S301, the temperature of the GPU is collected by the GPU management system, the collected temperature of the GPU is converted into an electrical signal, and the voltage and current of the CPU are collected.
S302: and judging whether the GPU is abnormal or not according to the temperature, the voltage and the current of the GPU, executing the step S303 if the GPU is abnormal, and returning to continue executing the step S301 if the GPU is not abnormal.
In the embodiment of the application, a temperature threshold, a voltage threshold and a current threshold are preset, and the electrical signals are analyzed through a graphic processor GPU management system to obtain the temperature of the graphic processor GPU; judging whether the temperature of the GPU exceeds a preset temperature threshold value, whether the voltage of the GPU exceeds a preset voltage threshold value and whether the current of the GPU exceeds a preset current threshold value; and if any one of the temperature, the voltage and the current of the GPU exceeds a corresponding preset threshold value, determining that the GPU is abnormal. If the temperature, the voltage and the current of the GPU do not exceed the corresponding preset threshold values, the temperature, the voltage and the current of the GPU are collected again, so that whether the GPU is abnormal or not is judged according to the temperature, the voltage and the current of the GPU collected again.
In this embodiment of the application, after the GPU management system analyzes the electrical signal to obtain the temperature of the GPU, the GPU management system may generate the exception information of the GPU according to the obtained temperature of the GPU, the voltage of the GPU and the current of the GPU, and upload the exception information to the server through the I2C upload bus, so that the server stores the received exception information in the built-in memory.
Further, in the embodiment of the application, after the management system of the GPU analyzes the electrical signal to obtain the temperature of the GPU, the wind speed of the fan of the GPU can be controlled according to the obtained temperature of the GPU, so as to further judge whether the GPU is abnormal according to the wind speed of the fan of the GPU, and if the GPU is abnormal, the management system of the GPU controls the connection between the GPU and the server to be disconnected, and triggers an alarm.
In the embodiment of the present application, the wind speed threshold is preset, and the manner of further determining whether the GPU of the graphics processor is abnormal according to the wind speed of the GPU fan of the graphics processor by the GPU management system of the graphics processor may be: and judging whether the wind speed of the GPU fan of the graphics processor exceeds a preset wind speed threshold value or not through the GPU management system of the graphics processor, and determining that the GPU of the graphics processor is abnormal if the wind speed of the GPU fan of the graphics processor exceeds the preset wind speed threshold value.
In the embodiment of the application, the GPU management system may upload the wind speed of the GPU fan of the graphics processor to the server through the I2C upload bus, so that the server stores the received wind speed of the GPU fan of the graphics processor in the built-in storage.
S303: and controlling the disconnection between the GPU and the server, and triggering an alarm.
In the embodiment of the application, under the condition that the GPU is determined to be abnormal, the connection between the GPU and the server is controlled to be disconnected, and an alarm is triggered.
The invention provides a graphic processor GPU management method, which is applied to a graphic processor GPU management system, wherein the graphic processor GPU is connected with a server through Pice data, the graphic processor GPU management system is respectively connected with the server and a graphic processor through buses, the temperature, the voltage and the current of the graphic processor GPU are collected through the graphic processor GPU, whether the graphic processor GPU sends an exception or not is judged according to the received temperature, the voltage and the current of the graphic processor GPU, and under the condition that the graphic processor GPU is determined to be abnormal, the connection between the graphic processor GPU and the server is controlled to be disconnected, and an alarm is triggered. The graphic processor GPU management system provided by the invention can be quickly started, so that the state of the graphic processor GPU can be timely monitored, the connection between the graphic processor GPU and the server is disconnected when the graphic processor GPU is monitored to be abnormal, and the risk of burning the graphic processor GPU is avoided.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A graphic processor GPU management system is connected with a server through a Pice data line, and is characterized in that the graphic processor GPU management system is respectively connected with the server and the graphic processor GPU through buses, and the graphic processor GPU management system comprises: the management control unit MCU, the information acquisition unit and the alarm unit;
the output end of the information acquisition unit is connected with the input end of the management control unit MCU and is used for acquiring information of the GPU to obtain the temperature, voltage and current of the GPU;
the control end of the management control unit MCU is connected with the alarm unit and is used for receiving the temperature, the voltage and the current of the GPU sent by the information acquisition unit; judging whether the GPU is abnormal or not according to the temperature, the voltage and the current of the GPU; if the GPU is abnormal, controlling the connection between the GPU and the server to be disconnected, and sending an alarm signal to the alarm unit;
and the alarm unit is used for receiving the alarm signal sent by the management control unit MCU and giving an alarm based on the alarm signal.
2. The system according to claim 1, wherein the input terminal of the MCU comprises a first input terminal and a second input terminal, the information acquisition unit comprises a temperature sensor and a power management chip, the output terminal of the temperature sensor is connected to the first input terminal of the MCU, the output terminal of the power management chip is connected to the second input terminal of the MCU, and the information acquisition unit for acquiring information from the GPU is specifically configured to:
acquiring the temperature of the GPU based on the temperature sensor, converting the temperature of the GPU into an electric signal, and sending the electric signal to the MCU;
and acquiring the voltage and the current of the GPU based on the power management chip, and sending the acquired voltage and the acquired current of the GPU to the management control unit MCU.
3. The system according to claim 2, wherein the processor is configured to receive the temperature, the voltage and the current of the GPU sent by the information acquisition unit; the management control unit MCU for judging whether the GPU is abnormal or not according to the temperature, the voltage and the current of the GPU is specifically used for:
receiving an electric signal sent by the temperature sensor, and analyzing the electric signal to obtain the temperature of the GPU; receiving the voltage and the current of the graphics processor sent by the power management chip; judging whether the temperature of the GPU exceeds a preset temperature threshold value, whether the voltage of the GPU exceeds a preset voltage threshold value and whether the current of the GPU exceeds a preset current threshold value; and if any one of the temperature, the voltage and the current of the GPU exceeds a corresponding preset threshold value, determining that the GPU is abnormal.
4. The system of claim 1, wherein the GPU management system further comprises a GPU fan connected to the MCU through a fan interface of the GPU;
the management control unit MCU is also used for controlling the wind speed of the GPU fan according to the temperature of the GPU; judging whether the wind speed of the GPU fan of the graphics processing unit exceeds a preset wind speed threshold value or not; and if the wind speed exceeds the preset wind speed threshold value, determining that the GPU is abnormal, controlling the GPU to be disconnected with the server, and sending an alarm signal to the alarm unit.
5. The system of claim 1, wherein the management control unit MCU is further configured to:
and if the GPU is abnormal, generating abnormal information related to the GPU, and uploading the abnormal information to the server through an I2C bus.
6. A GPU management method is applied to a GPU management system, and comprises the following steps:
acquiring information of a GPU (graphics processing Unit) to obtain the temperature, voltage and current of the GPU;
judging whether the GPU is abnormal or not according to the temperature, the voltage and the current of the GPU;
and if the GPU is abnormal, controlling the connection between the GPU and the server to be disconnected, and triggering an alarm.
7. The method of claim 6, wherein the collecting information for the GPU comprises:
and collecting the temperature of the GPU, converting the temperature of the GPU into an electric signal, and collecting the voltage and the current of a CPU of the GPU.
8. The method of claim 7, wherein the determining whether the GPU is abnormal according to the temperature, the voltage and the current of the GPU comprises:
analyzing the electric signal to obtain the temperature of the GPU;
judging whether the temperature of the GPU exceeds a preset temperature threshold value, whether the voltage of the GPU exceeds a preset voltage threshold value and whether the current of the GPU exceeds a preset current threshold value;
and if any one of the temperature, the voltage and the current of the GPU exceeds a corresponding preset threshold value, determining that the GPU is abnormal.
9. The method of claim 6, further comprising:
controlling the wind speed of a GPU fan of the graphics processing unit according to the temperature of the GPU;
judging whether the wind speed of the GPU fan of the graphics processing unit exceeds a preset wind speed threshold value or not;
and if the wind speed exceeds the preset wind speed threshold value, determining that the GPU is abnormal, controlling the connection between the GPU and the server to be disconnected, and triggering an alarm.
10. The method of claim 6, further comprising:
and if the GPU is abnormal, generating abnormal information related to the GPU, and sending the abnormal information to the server.
CN202010952527.XA 2020-09-11 2020-09-11 Graphic processor GPU management method and system Pending CN112052143A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010952527.XA CN112052143A (en) 2020-09-11 2020-09-11 Graphic processor GPU management method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010952527.XA CN112052143A (en) 2020-09-11 2020-09-11 Graphic processor GPU management method and system

Publications (1)

Publication Number Publication Date
CN112052143A true CN112052143A (en) 2020-12-08

Family

ID=73610547

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010952527.XA Pending CN112052143A (en) 2020-09-11 2020-09-11 Graphic processor GPU management method and system

Country Status (1)

Country Link
CN (1) CN112052143A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1624662A (en) * 2004-12-17 2005-06-08 李谦 Method of monitoring device on AGP plate and its device
CN1704865A (en) * 2004-06-02 2005-12-07 联想(北京)有限公司 Server heat dissipation administrative system and method thereof
CN104182328A (en) * 2014-08-18 2014-12-03 深圳市杰和科技发展有限公司 System and method for recording and managing working states of display cards
CN105762870A (en) * 2016-03-30 2016-07-13 合肥联宝信息技术有限公司 Battery having protection and early warning functions and electronic equipment having the battery
CN108062270A (en) * 2017-12-14 2018-05-22 郑州云海信息技术有限公司 Fan failure management method, system, device and readable storage medium storing program for executing
CN108829565A (en) * 2018-06-30 2018-11-16 常州大学 A kind of computer operation condition monitoring system
CN108874628A (en) * 2018-06-19 2018-11-23 山东超越数控电子股份有限公司 A kind of computer motherboard health and fitness information intelligent management apapratus
CN210052117U (en) * 2019-08-12 2020-02-11 铜仁学院 Big data-based computer performance control system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1704865A (en) * 2004-06-02 2005-12-07 联想(北京)有限公司 Server heat dissipation administrative system and method thereof
CN1624662A (en) * 2004-12-17 2005-06-08 李谦 Method of monitoring device on AGP plate and its device
CN104182328A (en) * 2014-08-18 2014-12-03 深圳市杰和科技发展有限公司 System and method for recording and managing working states of display cards
CN105762870A (en) * 2016-03-30 2016-07-13 合肥联宝信息技术有限公司 Battery having protection and early warning functions and electronic equipment having the battery
CN108062270A (en) * 2017-12-14 2018-05-22 郑州云海信息技术有限公司 Fan failure management method, system, device and readable storage medium storing program for executing
CN108874628A (en) * 2018-06-19 2018-11-23 山东超越数控电子股份有限公司 A kind of computer motherboard health and fitness information intelligent management apapratus
CN108829565A (en) * 2018-06-30 2018-11-16 常州大学 A kind of computer operation condition monitoring system
CN210052117U (en) * 2019-08-12 2020-02-11 铜仁学院 Big data-based computer performance control system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘暐,李国芹: "《传感器原理及应用技术》", 31 August 2019, 北京理工大学出版社, pages: 61 *
贝蒂尔·施密特等: "《并行程序设计》", 31 May 2020, 机械工业出版社, pages: 206 *

Similar Documents

Publication Publication Date Title
WO2018196731A1 (en) Intelligent sensing device and sensing system
US20140074435A1 (en) Acoustic diagnosis and correction system
CN110650243B (en) Alarm method, alarm device, storage medium and terminal
CN111341063A (en) Intelligent control system, method and device for safety early warning and monitoring of electric equipment and terminal equipment
CN114034968A (en) Cable data detection method and device based on distributed nodes
CN113074783A (en) Lighthouse type industrial equipment monitoring method and device
CN117578740B (en) Digital intelligent electricity management system and method
CN112052143A (en) Graphic processor GPU management method and system
CN111650485A (en) Online intermittent monitoring method, medium, sensor and analysis station for power transmission and transformation equipment
CN112822073A (en) Test method, device, system, electronic equipment and readable storage medium
CN112153344A (en) Power distribution room equipment state online intelligent monitoring system and method based on embedded GPU platform and deep learning
CN114880191B (en) Method, device, apparatus and medium for processing power consumption of server
CN116506577A (en) Wiring management system and method based on electronic perception
CN113889287A (en) Data processing method, device, system and storage medium
CN114120103A (en) Intelligent cable monitoring method and device based on image data
CN113971875A (en) Multifunctional Internet of things intelligent alarm control method, terminal, storage medium and system
CN211352419U (en) Body temperature monitoring earphone
CN114624569A (en) Sensor circuit abnormality detection method, circuit, device, intelligent device, and medium
JP2019120998A (en) Control system and control unit
CN113297979A (en) Method and device for identifying heating state of power transmission wire connector
CN109920130B (en) Monitoring method, monitoring device, electronic equipment and computer readable storage medium
CN101360035A (en) Network alarming method and apparatus
CN111343535A (en) Body temperature monitoring earphone and body temperature monitoring method thereof
JP2006033592A (en) Apparatus monitoring device and apparatus monitoring system
CN114562474B (en) Fan life prediction method, system, device and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination