CN118819927A - A server detection link error correction method, device, equipment and medium - Google Patents
A server detection link error correction method, device, equipment and medium Download PDFInfo
- Publication number
- CN118819927A CN118819927A CN202410840617.8A CN202410840617A CN118819927A CN 118819927 A CN118819927 A CN 118819927A CN 202410840617 A CN202410840617 A CN 202410840617A CN 118819927 A CN118819927 A CN 118819927A
- Authority
- CN
- China
- Prior art keywords
- management controller
- link
- interface
- management
- working state
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0793—Remedial or corrective actions
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F1/00—Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
- G06F1/16—Constructional details or arrangements
- G06F1/20—Cooling means
- G06F1/206—Cooling means comprising thermal management
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3058—Monitoring arrangements for monitoring environmental properties or parameters of the computing system or of the computing system component, e.g. monitoring of power, currents, temperature, humidity, position, vibrations
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Human Computer Interaction (AREA)
- Computing Systems (AREA)
- Debugging And Monitoring (AREA)
Abstract
The embodiment of the invention provides an error correction method, device, equipment and medium of a detection link of a server, wherein the server comprises a graphic processor module, the detection link is used for detecting a temperature value of a graphic processor, the detection link comprises a management interface of a substrate management controller, a first link between the substrate management controller and a main management controller and a management interface of the main management controller, and the method comprises the following steps: acquiring a temperature value of a graphic processor in a graphic processor module detected by a detection link; judging whether a temperature abnormality condition is met according to the temperature value of the graphic processor in the graphic processor module; when the temperature abnormality condition is met, the rotating speed of the fan is increased to radiate heat of the graphic processor module, and error correction is sequentially carried out on the detection links, so that the abnormality can be repaired in time when the detection links of the server are abnormal, the problem that the temperature of the GPU is overheated for a long time due to the fact that the detection of the temperature of the GPU is lost is avoided, and the stability and the calculation efficiency of the server are improved.
Description
Technical Field
The present invention relates to the field of circuit detection technologies, and in particular, to an error correction method, apparatus, device, and medium for a detection link of a server.
Background
The updating and updating of graphics processors (GPUs, graphics Processing Unit) and GPU modules brings higher floating point running speed and video memory bandwidth, and the heat generated by GPU components also follows the water rise. The GPU server acquires the temperatures of components with high heat dissipation requirements such as the GPU in real time through the detection link, and executes corresponding heat dissipation strategies according to the temperatures so as to ensure that the server can dissipate heat normally. When the GPU server detects abnormal links, effective monitoring of the real-time temperature state of the GPU is lost.
However, existing GPU management methods lack an error correction mechanism for detecting link anomalies. If the abnormality of the detected link is not repaired in time, the server cannot execute the corresponding heat dissipation policy according to the real-time temperature state, and the temperature of the GPU may be overheated for a long time. Under the high temperature environment, the stability and the service life of the GPU can be greatly reduced, when the temperature of the GPU exceeds the design specification, the speed is reduced, even faults such as over-temperature card falling and error reporting of computing power application can occur, unnecessary loss is caused to the GPU module, and the stability and the computing efficiency of the server are reduced.
Disclosure of Invention
In order to solve the problems, the embodiment of the invention discloses an error correction method, device, equipment and medium for a detection link of a server.
In a first aspect, an embodiment of the present invention provides an error correction method for a detection link of a server, where the server includes a graphics processor module, the detection link is used to detect a temperature value of a graphics processor in the graphics processor module, the detection link includes a management interface of a baseboard management controller, a first link between the baseboard management controller and a main management controller, and a management interface of the main management controller, and the method includes:
Acquiring a temperature value of a graphics processor in the graphics processor module detected by the detection link;
judging whether a temperature abnormality condition is met according to the temperature value of the graphic processor in the graphic processor module;
When the temperature abnormal condition is met, the rotating speed of the fan is increased to radiate heat of the graphic processor module, and error correction is sequentially carried out on the management interface of the baseboard management controller, the first link between the baseboard management controller and the main management controller and the management interface of the main management controller.
Optionally, the correcting the error of the management interface of the baseboard management controller, the first link between the baseboard management controller and the main management controller, and the management interface of the main management controller sequentially includes:
Detecting the working state of a management interface of the substrate management controller;
if the working state of the management interface of the baseboard management controller is abnormal, correcting the error of the management interface of the baseboard management controller;
If the working state of the management interface of the baseboard management controller is normal, detecting the working state of a first link between the baseboard management controller and the main management controller;
if the working state of the first link is abnormal, correcting the error of the first link;
If the working state of the first link is normal, detecting the working state of a management interface of the main management controller;
If the working state of the management interface of the main management controller is abnormal, correcting the error of the management interface of the main management controller;
Acquiring a temperature value of the graphic processor after the working state of the management interface of the substrate management controller and the working state of the first link are normal;
and if the temperature values of all the graphic processors are normal, recovering the rotating speed of the fan.
Optionally, if the working state of the management interface of the baseboard management controller is abnormal, performing error correction on the management interface of the baseboard management controller includes:
if the management interface of the baseboard management controller is not activated or occupied, determining that the working state of the management interface of the baseboard management controller is abnormal;
And re-enabling the authority of the management interface of the baseboard management controller so as to correct the error of the management interface of the baseboard management controller.
Optionally, if the working state of the first link is abnormal, performing error correction on the first link includes:
If the network connection state of the first link is abnormal, determining that the working state of the first link is abnormal;
And reestablishing the network connection of the first link to correct the error of the first link.
Optionally, the management interface of the master management controller includes a first master management controller interface, and if the working state of the management interface of the master management controller is abnormal, performing error correction on the management interface of the master management controller includes:
If the first main management controller interface is not activated or occupied, determining that the working state of the first main management controller interface is abnormal;
resetting the management function corresponding to the master management controller to correct the management interface of the master management controller.
Optionally, the management interface of the master management controller further includes a second master management controller interface, and the error correction is performed on the management interface of the master management controller, and further includes:
If the working state of the management interface of the master management controller is abnormal after the management function corresponding to the master management controller is reset, resetting the master management controller through an I2C command to correct the error of the management interface of the master management controller.
Optionally, the method further comprises:
And recording events which sequentially correct errors of the management interface of the baseboard management controller, the first link between the baseboard management controller and the main management controller and the management interface of the main management controller into a log.
In a second aspect, an embodiment of the present invention provides an error correction apparatus for a detection link of a server, where the server includes a graphics processor module, the detection link is configured to detect a temperature value of a graphics processor in the graphics processor module, the detection link includes a management interface of a baseboard management controller, a first link between the baseboard management controller and a main management controller, and a management interface of the main management controller, and the apparatus includes:
the module detection temperature acquisition module is used for acquiring a temperature value of a graphic processor in the graphic processor module detected by the detection link;
the temperature abnormality condition judging module is used for judging whether the temperature abnormality condition is met or not according to the temperature value of the graphic processor in the graphic processor module;
and the abnormality detection link error correction module is used for increasing the rotating speed of the fan to radiate heat of the graphic processor module when the temperature abnormality condition is met, and correcting errors of the management interface of the substrate management controller, the first link between the substrate management controller and the main management controller and the management interface of the main management controller in sequence.
Optionally, the anomaly detection link error correction module includes:
The first detection submodule is used for detecting the working state of the management interface of the substrate management controller;
The first error correction sub-module is used for correcting errors of the management interface of the baseboard management controller if the working state of the management interface of the baseboard management controller is abnormal;
The second detection sub-module is used for detecting the working state of a first link between the baseboard management controller and the main management controller if the working state of a management interface of the baseboard management controller is normal;
the second error correction sub-module is used for correcting the first link if the working state of the first link is abnormal;
The third detection sub-module is used for detecting the working state of the management interface of the main management controller if the working state of the first link is normal;
The third error correction sub-module is used for correcting errors of the management interface of the main management controller if the working state of the management interface of the main management controller is abnormal;
The temperature acquisition sub-module is used for acquiring the temperature value of the graphic processor after the working state of the management interface of the substrate management controller and the working state of the first link are normal;
And the fan adjusting sub-module is used for recovering the rotating speed of the fan if the temperature values of all the graphic processors are normal.
Optionally, the first error correction sub-module includes:
The first abnormality determination unit is used for determining that the working state of the management interface of the baseboard management controller is abnormal if the management interface of the baseboard management controller is not activated or occupied;
And the first error correction unit is used for re-enabling the authority of the management interface of the baseboard management controller so as to correct the error of the management interface of the baseboard management controller.
Optionally, the second error correction sub-module includes:
A second abnormality determining unit, configured to determine that, if the network connection state of the first link is abnormal, the working state of the first link is abnormal;
and the second error correction unit is used for reestablishing the network connection of the first link so as to correct the error of the first link.
Optionally, the management interface of the master management controller includes a first master management controller interface, and the third error correction sub-module includes:
the third abnormality determining unit is used for determining that the working state of the first main management controller interface is abnormal if the first main management controller interface is not activated or occupied;
and the third error correction unit is used for resetting the management function corresponding to the main management controller so as to correct the management interface of the main management controller.
Optionally, the management interface of the master management controller further includes a second master management controller interface, and the third error correction sub-module further includes:
And the fourth error correction unit is used for resetting the master management controller through an I2C command to correct the error of the management interface of the master management controller if the working state of the management interface of the master management controller is abnormal after the management function corresponding to the master management controller is reset.
Optionally, the apparatus further comprises:
and the error correction event recording log module is used for recording events which are sequentially subjected to error correction on the management interface of the baseboard management controller, the first link between the baseboard management controller and the main management controller and the management interface of the main management controller into a log.
In a third aspect, the present invention shows an electronic device comprising: the method comprises the steps of a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the error correction method of the detection link of the server when executing the computer program.
In a fourth aspect, the present invention shows a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a method of error correction for a detection link of a server as described above.
The embodiment of the invention has the following advantages:
The embodiment of the invention can monitor the temperature value of the graphic processor in real time by acquiring the temperature value of the graphic processor in the graphic processor module detected by the detection link; judging whether a temperature abnormality condition is met according to the temperature value of the graphic processor in the graphic processor module so as to judge the working state of the detection link in real time according to the temperature value of the graphic processor; when the temperature abnormality condition is met, the rotating speed of the fan is increased to radiate heat of the graphic processor module, and error correction is sequentially carried out on the management interface of the baseboard management controller, the first link between the baseboard management controller and the main management controller and the management interface of the main management controller, so that the abnormality can be repaired in time when the detection link of the server is abnormal, the problem that the temperature of the GPU is overheated for a long time due to the loss of the detection of the real-time temperature state of the GPU is avoided, and the stability and the calculation efficiency of large-scale calculation of the GPU module server are improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a block diagram of a server according to an embodiment of the present invention;
FIG. 2 is a logic diagram of an out-of-band management method for GPU temperature acquisition anomalies according to an embodiment of the present invention;
FIG. 3 is a flow chart of steps of a method for error correction of a detection link of a server according to an embodiment of the present invention;
FIG. 4 is a flow chart of steps of another error correction method for a detection link of a server according to an embodiment of the present invention;
FIG. 5 is a logic diagram of a method for error correction of a detection link of a server according to an embodiment of the present invention;
fig. 6 is a block diagram of an error correction apparatus for a detection link of a server according to an embodiment of the present invention;
FIG. 7 is a block diagram of an electronic device according to an embodiment of the invention;
fig. 8 is a block diagram of a computer-readable storage medium according to an embodiment of the present invention.
Detailed Description
At present, artificial intelligence has become a hotspot industry and is permeated into different fields of different industries, and with the rapid development of an artificial intelligence algorithm, the calculation power requirement of linkage artificial intelligence is rapidly increased. The GPU server is used as a core carrier for intelligent computing, has remarkable advantages when parallel computing intensive tasks are processed, excellent graphic processing capacity and high-performance computing capacity provide extreme computing performance, and the GPU server can greatly improve the running speed of an application program, effectively liberate computing pressure and improve the computing processing efficiency and competitiveness of a product by transferring the workload of the computing intensive part of the application program to the GPU and simultaneously running other program codes by a CPU (Central Processing Unit ).
Artificial intelligence, complex simulation, and massive data sets require multiple GPUs with extremely fast interconnect speeds and fully accelerated software stacks. The GPU module can realize GPU computing units which are integrally formed by high-speed interconnection among a plurality of GPUs, so that the powerful performance and extremely expansion of high bandwidth and low delay are realized. The GPU module server becomes a core resource and an algorithm power engine for the ultra-large-scale data center and the intelligent computation center, and the stability of the GPU module server is closely related to the algorithm power guarantee of the intelligent computation center.
Referring to fig. 1, a block diagram of a server according to an embodiment of the present invention is shown. The server comprises a main board, wherein a BMC (Baseboard Management Controller ) module is arranged on the main board of the server, an HMC (Host Management Controller, main management controller) module is arranged on a GPU module of the server, the BMC module and the HMC module are in communication connection through a physical link, and the HMC module is also in communication connection with a GPU on the GPU module through the physical link. Wherein the communication protocol may be Redfish protocol or I2C protocol, redfish is an open management interface and protocol, and I2C is a serial communication protocol. The BMC module may use redfish Out of Band (OOB) protocols to perform Out of Band management monitoring on components such as GPU and switch in the GPU module.
Referring to fig. 2, a logic diagram of an out-of-band management method for GPU temperature acquisition anomalies is shown in accordance with an embodiment of the present invention. In the method, the main process of controlling the temperature of the GPU by the GPU module server through an out-of-band management method is as follows: and the main management controller module is connected to the GPU module through a BMC module physical link, and out-of-band management and monitoring are carried out on each GPU component in the GPU module by using an out-of-band management protocol. When the redfish interface of the BMC or the HMC or a link for establishing a link between the HMC and the BMC is not responded, the BMC module is caused to acquire that the out-of-band temperature return value of the GPU is abnormal, and when the temperature return values of all the GPUs are abnormal after the BMC polls 3 times, the BMC automatically triggers an abnormal fan regulation strategy to ensure heat dissipation of the GPU.
The out-of-band management method in fig. 2 lacks an interface to the BMC, an interface to the HMC, and an error correction mechanism for link anomalies between the BMC and the HMC, and the BMC module does not actively perform intervention repair on the anomalies. Moreover, if the out-of-band temperature acquisition of all GPUs is abnormal (i.e., only the individual GPUs are abnormal), the BMC module does not trigger abnormal fan speed regulation. At this time, the real-time rotation speed of the fan module or the speed of increasing the real-time rotation speed may not meet the heat dissipation requirement caused by the continuous increase of the GPU temperature, so that the phenomenon that the GPU is over-heated to lose the card or the server is down may occur. The problem of over-temperature card dropping of the GPU module is that the whole system of the server is required to be subjected to alternating current power-off switching on and off operation to recover the states of all the GPUs.
Aiming at the problems, the invention provides an error correction method for a detection link of a server, which aims to repair the abnormality in time when the detection link of the server is abnormal, and avoid the long-time overheat of the temperature of the GPU caused by losing the detection of the real-time temperature state of the GPU. In order to achieve the aim, the embodiment of the invention detects the temperature value of the GPU in the GPU module in real time through the detection link, when the temperature value returns to be abnormal and meets the temperature abnormality condition, the rotating speed of the fan is increased to radiate the GPU module, and the management interface of the BMC, the first link between the BMC and the HMC and the management interface of the HMC are sequentially corrected, so that the stability and the calculation efficiency of the GPU module server can be improved.
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.
Referring to fig. 3, a step flowchart of an error correction method of a detection link of a server according to an embodiment of the present invention, where the server includes a graphics processor module, the detection link is used to detect a temperature value of a graphics processor in the graphics processor module, and the detection link includes a management interface of a baseboard management controller, a first link between the baseboard management controller and a master management controller, and a management interface of the master management controller, and the method specifically may include the following steps:
Step 101, obtaining a temperature value of a graphics processor in the graphics processor module detected by the detection link;
In the embodiment of the invention, the server comprises a GPU module, the GPU module comprises a plurality of GPUs, the server can detect the temperature values of all the GPUs in the GPU module through the detection link and return the detection result of the temperature values to the BMC, so that the problem that the temperature of the GPU is overheated for a long time due to the fact that the detection of the real-time temperature state of the GPU is lost is avoided. The detection link includes a management interface of the BMC, a first link between the BMC and the HMC, and a management interface of the HMC, the management interface of the BMC may be a redfish interface, the management interface of the HMC may be a redfish interface, and the first link between the BMC and the HMC may be a redfish link or an I2C link.
Step 102, judging whether a temperature abnormality condition is met according to the temperature value of the graphics processor in the graphics processor module;
In the embodiment of the invention, the BMC can judge whether the GPU module meets the temperature abnormality condition according to the temperature value of each GPU in the GPU module, thereby judging whether the detection link of the server is abnormal in real time. Specifically, the BMC can monitor the temperature value of each GPU in the GPU module in real time, and record the abnormal times when the temperature value of at least one GPU in the GPU module is abnormal. When the abnormal times reach a preset times threshold, determining that the GPU module meets the temperature abnormal condition, wherein the preset times threshold can be 2. The person skilled in the art can set the threshold value of the preset number of times to other appropriate values according to the idea of the invention, which the invention is not limited to.
And step 103, when the temperature abnormal condition is met, increasing the rotating speed of a fan to radiate heat of the graphic processor module, and correcting errors of the management interface of the baseboard management controller, the first link between the baseboard management controller and the main management controller and the management interface of the main management controller in sequence.
In the embodiment of the invention, when the temperature abnormality condition is met, the abnormality of the detection link of the server can be determined, the MBC can increase the rotating speed of the fan to 100% to radiate heat of the GPU module, and error correction is sequentially carried out on the management interface of the BMC, the first link between the BMC and the HMC and the management interface of the HMC, so that the abnormality can be repaired in time when the abnormality of the detection link of the server is met, and the problem that the temperature of the GPU is overheated for a long time due to the fact that the detection of the real-time temperature state of the GPU is lost is avoided.
In one embodiment, the step of sequentially correcting the management interface of the baseboard management controller, the first link between the baseboard management controller and the main management controller, and the management interface of the main management controller may include the following sub-steps:
step S11, detecting the working state of a management interface of the substrate management controller;
Step S12, if the working state of the management interface of the baseboard management controller is abnormal, correcting the error of the management interface of the baseboard management controller;
In one embodiment, if the working state of the management interface of the baseboard management controller is abnormal, performing error correction on the management interface of the baseboard management controller includes: if the management interface of the baseboard management controller is not activated or occupied, determining that the working state of the management interface of the baseboard management controller is abnormal; and re-enabling the authority of the management interface of the baseboard management controller so as to correct the error of the management interface of the baseboard management controller.
In the embodiment of the invention, when the server detects that the link is abnormal, the BMC can firstly diagnose and check the redfish interface of the BMC to check whether the redfish interface state corresponding to the BMC module is Enabled or not and whether the redfish interface state is occupied by SMBPBI authority of the I2C or not. If the redfish interface of the BMC is not activated or occupied, determining that the working state of the redfish interface of the BMC is abnormal, repairing the authority of the redfish interface of the BMC at the moment, namely re-starting the authority of the BMC redfish interface to correct the redfish interface of the BMC, thereby ensuring the stability and reliability of the Redfish interface and improving the usability of management functions of the BMC.
Specifically, the repairing process for the authority of redfish interfaces of the BMC comprises the following steps: the out-of-band management rights are switched from redfish protocol to I2C protocol, then the I2C protocol is released, and then the process is switched to redfish protocol again. The code for repairing the rights of the redfish interface of the BMC is as follows:
***
# switch Out-of-band privilege to I2C
echo"-----Get SMBPBI fencing privilege command for BMC--Opcode a3H,arg1 01H-----"
i2cset-y 11 0x60 0x5c 0x04 0xa3 0x01 0x00 0x80 i
i2ctransfer-y 11w1@0x60 0x5c r5
i2ctransfer-y 11w1@0x60 0x5d r5
Release Out-of-band privilege I2C and switch to redfish
echo"-----release SMBPBI fencing privilege command for BMC--Opcode a3H,arg1 00H-----"
i2cset-y 11 0x60 0x5c 0x04 0xa3 0x00 0x00 0x80 i
i2ctransfer-y 11w1@0x60 0x5c r5
i2ctransfer-y 11w1@0x60 0x5d r5
***
Step S13, if the working state of the management interface of the baseboard management controller is normal, detecting the working state of a first link between the baseboard management controller and the main management controller;
Step S14, if the working state of the first link is abnormal, correcting the error of the first link;
In one embodiment, the performing error correction on the first link if the working state of the first link is abnormal includes: if the network connection state of the first link is abnormal, determining that the working state of the first link is abnormal; and reestablishing the network connection of the first link to correct the error of the first link.
In the embodiment of the invention, if the working state of the redfish interface of the BMC is normal, the working state of the first link between the BMC and the HMC can be continuously detected. Specifically, the BMC module may perform a ping pass attempt on a default IP (default IP: 192.168.31.1) defined by the HMC module, and check a start progress status of the HMC, if the IP cannot perform the ping pass, it indicates that a network connection status of the first link is abnormal, that is, a problem occurs in establishing a USB enumeration link of the first link, and at this time, the network connection of the first link may be re-established to perform error correction on the first link, so as to ensure that the BMC module may interact with the HMC module. The code for reestablishing the network connection of the first link is as follows:
***
ping 192.168.31.1
i2cdump-y 11 0x54
no size specified(using byte-data access)
***
re-enumerating link establishment for # # USB
cd/sys/bus/platform/drivers/ehci-platform/
/sys/bus/platform/drivers/ehci-platform#ls 1e6a3000.usb bind
/sys/bus/platform/drivers/ehci-platform#echo 1e6a3000.usb>unbind
/sys/bus/platform/drivers/ehci-platform#echo 1e6a3000.usb>bind
sleep 30
ping 192.168.31.1
***
Step S15, if the working state of the first link is normal, detecting the working state of a management interface of the main management controller;
s16, if the working state of the management interface of the main management controller is abnormal, correcting errors of the management interface of the main management controller;
in one embodiment, the management interfaces of the master management controller include a first master management controller interface and a second master management controller interface, and if the working state of the management interfaces of the master management controller is abnormal, performing error correction on the management interfaces of the master management controller, including:
If the first main management controller interface is not activated or occupied, determining that the working state of the first main management controller interface is abnormal;
Resetting the management function corresponding to the main management controller to correct the error of the management interface of the main management controller;
If the working state of the management interface of the master management controller is abnormal after the management function corresponding to the master management controller is reset, resetting the master management controller through an I2C command to correct the error of the management interface of the master management controller.
In the embodiment of the invention, after completing the USB reestablishment chain from the HMC module to the BMC module, the BMC can perform diagnosis and check on the redfish interface of the HMC. The redfish interfaces of the HMC include a first HMC interface and a second HMC interface, and when the first HMC interface and the second HMC interface return to the return values of the state enable, the working state of the redfish interface of the HMC is indicated to be normal.
If the first HMC interface is not activated or occupied, the BMC may perform reset initialization on the redfish function-related software stack corresponding to the HMC to perform error correction on the first HMC interface, so as to solve the problem caused by the HMC configuration error or damage, and clear the error state of the software stack. If the working state of the redfish interface of the HMC is abnormal after the BMC resets the redfish function related software stack corresponding to the HMC, the HMC module may be reset through the I2C command, that is, factory setting is restored to each redfish interface of the HMC module through the I2C command, and after factory setting is restored, the Redfish interfaces may be reconfigured according to the current requirements and best practices, so as to ensure that all Redfish interfaces of the HMC are consistent in configuration, and convenient for management and maintenance. The code for repairing the first HMC interface and the second HMC interface is as follows:
***
###reset HMC redfish interface
curl--insecure-uroot:0penBmc-XPOSThttp://192.168.31.1/redfish/v1/Manag ers/HGX_BMC_0/Actions/Manager.ResetToDefaults-d'{"ResetToDefaultsType":"ResetAll"}'
***
###HMC factory reset
i2cset-f-y 13 0x54 0x00 0x0f
***
It should be noted that, in the HMC module, a micro-plate-specific linux system is integrated similarly, and there are two levels for repairing the management interface of the HMC module: first, RESET HMC REDFISH INTERFACE, performing redfish hardware reset of the HMC through redfish command to realize reset initialization of redfish function related software stack in the linux system integrated with the HMC; second, HMC factory reset, restoring factory settings to the HMC through the I2C command to implement reset initialization of the HMC integrated linux system. Wherein the second level is longer than the first level.
Step S17, after the working state of the management interface of the baseboard management controller and the working state of the first link are normal, the temperature value of the graphic processor is obtained;
And S18, if the temperature values of all the graphic processors are normal, recovering the rotating speed of the fan.
In the embodiment of the invention, after the working state of the redfish interface of the BMC, the working state of the first link and the working state of the redfish interface of the HMC are all normal, the BMC can acquire the temperature values of the GPUs again to determine whether the temperature values of all the GPUs are normal. The BMC module can detect the redfish interface of the HMC module again and judge the abnormality of the temperature polling of the GPU again. The re-detection code is as follows:
***
curl--insecure-uroot:0penBmc-XGETthttp://192.168.31.1/redfish/v1/Manag ers/HGX_BMC_0
curl--insecure-uroot:0penBmc-XGEThttp://192.168.31.1/redfish/v1/Telemet ryService/MetricReports/HGX_PlatformEnvironmentMetrics_0
***
after determining that the temperature values of all GPUs are normal, that is, when the BMC module can acquire and confirm that all the GPUs are out-of-band and the temperature is normally available, and further judging and checking the complete machine state satisfaction of the current GPU module server, the rotating speed of the fan can be recovered to be the value before detecting the link abnormality, so that the noise of the server is reduced.
In the embodiment of the invention, the baseboard management controller can monitor the temperature value of the graphic processor in the graphic processor module in real time by acquiring the temperature value of the graphic processor detected by the detection link; judging whether a temperature abnormality condition is met according to the temperature value of the graphic processor in the graphic processor module so as to judge the working state of the detection link in real time according to the temperature value of the graphic processor; when the temperature abnormality condition is met, the rotating speed of the fan is increased to radiate heat of the graphic processor module, and error correction is sequentially carried out on the management interface of the baseboard management controller, the first link between the baseboard management controller and the main management controller and the management interface of the main management controller, so that the abnormality can be repaired in time when the detection link of the server is abnormal, the problem that the temperature of the GPU is overheated for a long time due to the loss of the detection of the real-time temperature state of the GPU is avoided, and the stability and the calculation efficiency of large-scale calculation of the GPU module server are improved.
Referring to fig. 4, there is shown a step flow chart of an error correction method of a detection link of another server according to an embodiment of the present invention, where the server includes a graphics processor module, the detection link is used to detect a temperature value of a graphics processor in the graphics processor module, the detection link includes a management interface of a baseboard management controller, a first link between the baseboard management controller and a master management controller, and a management interface of the master management controller, and the method specifically may include the following steps:
Step 201, obtaining a temperature value of a graphics processor in the graphics processor module detected by the detection link;
For step 201, since it is the same as step 101, the description of step 101 is referred to for the relevant points.
Step 202, judging whether a temperature abnormality condition is met according to a temperature value of a graphics processor in the graphics processor module;
For step 202, since it is the same as step 102, the description of step 102 is referred to for the relevant points.
Step 203, when the temperature abnormal condition is met, increasing the rotation speed of a fan to radiate heat of the graphics processor module, and correcting errors of the management interface of the baseboard management controller, the first link between the baseboard management controller and the main management controller, and the management interface of the main management controller in sequence;
For step 203, since it is the same as step 103, the description of step 103 is referred to for the relevant points.
And 204, recording events of error correction to the management interface of the baseboard management controller, the first link between the baseboard management controller and the main management controller and the management interface of the main management controller in sequence into a log.
In the embodiment of the invention, in the process of sequentially correcting the rsdfish interface of the BMC, the first link between the BMC and the HMC and the redfish interface of the HMC, the event corresponding to each correction can be recorded into the log, so that the detection link of the server can be monitored for a long time, and future errors and faults can be reduced by analyzing the log. In addition, the server can also respectively count the abnormal probabilities of the fedfish interface of the BMC, the fedfish interface of the HMC and the first link according to the log of the error correction event, and set the priority of error correction for the components according to the probabilities, wherein the higher the abnormal probability is, the higher the error correction priority of the component is, so that when the error of the detection link occurs, the error correction for the detection link can be performed according to the error correction priority of the components of the detection link, and the error correction efficiency of the detection link is improved.
The code for logging is as follows:
***
###selftest dump log
curl--insecure-uroot:0penBmc-XPOSThttp://$HMC_IP/redfish/v1/Systems/HGX_Baseboard_0/LogServices/Dump/Actions/LogService.CollectDiagnosticD ata/-d'{\"DiagnosticDataType\":\"OEM\",\"OEMDiagnosticDataType\":\"Diagnos ticType=SelfTest\"}'">>$OOB_Redfish_log 2>&1
###FPGAdump log
curl--insecure-uroot:0penBmc-XPOSThttp://$HMC_IP/redfish/v1/Systems/HGX_Baseboard_0/LogServices/Dump/Actions/LogService.CollectDiagnosticD ata/-d'{\"DiagnosticDataType\":\"OEM\",\"OEMDiagnosticDataType\":\"Diagnos ticType=FPGA\"}'">>$OOB_Redfish_log 2>&1
###EROT dump log
curl--insecure-uroot:0penBmc-XPOSThttp://$HMC_IP/redfish/v1/Systems/HGX_Baseboard_0/LogServices/Dump/Actions/LogService.CollectDiagnosticD ata/-d'{\"DiagnosticDataType\":\"OEM\",\"OEMDiagnosticDataType\":\"Diagnos ticType=EROT\"}'">>$OOB_Redfish_log 2>&1
###HMC dump log
curl--insecure-uroot:0penBmc-XPOSThttp://$HMC_IP/redfish/v1/Managers/HGX_BMC_0/LogServices/Dump/Actions/LogService.CollectDiagnosticData-d'{\"DiagnosticDataType\":\"Manager\"}'">>$OOB_Redfish_log 2>&1
***
According to the embodiment of the invention, through optimizing the out-of-band management functions of the BMC module and the GPU module in the GPU module server, real-time monitoring and automatic diagnosis and error correction of the out-of-band management interface and the out-of-band management redfish link of the GPU module are realized, the stability of out-of-band monitoring and the timeliness and accuracy of fan regulation and control of the whole system of the GPU module server in high heat dissipation requirement components such as GPU temperature are ensured, the optimization of the whole out-of-band management monitoring and the optimization of heat dissipation strategy regulation and control of the intelligent computing center GPU module server are realized, unnecessary loss and service life attenuation of the GPU module server caused by overheating are reduced, and the sustainability and the computing energy efficiency of computing power of the GPU module server are improved.
Referring to fig. 5, a logic diagram of an error correction method for a detection link of a server according to an embodiment of the present invention is shown, and in order to enable those skilled in the art to better understand the embodiment of the present invention, the embodiment of the present invention is described below by using fig. 5:
1) The BMC monitors the GPU module through the HMC;
2) Judging whether the GPU temperature abnormality occurs when the BMC performs 2 times of temperature polling;
3) If the BMC performs 2 times of temperature polling and GPU temperature abnormality does not occur, the server fan normally operates;
4) If the BMC performs 2 times of temperature polling and the GPU temperature abnormality occurs, the BMC triggers abnormality diagnosis and error correction;
5) The rotating speed of the fan of the server is adjusted to 100%;
6) Judging whether the working state of redfish interfaces of the BMC is abnormal or not:
7) If the working state of redfish interfaces of the BMC is abnormal, restarting redfish interface authority of the BMC and recording a log;
8) If the working state of the redfish interface of the BMC is normal, judging whether the IP of the HMC module is ping-passed by the BMC module abnormally or not;
9) If the BMC module carries out ping communication abnormality on the IP of the HMC module, reestablishing network connection of the first link and recording a log;
10 If the BMC module carries out ping on the IP of the HMC module normally, judging whether the working state of redfish interfaces of the HMC is normal or not;
11 If the working state of the redfish interface of the HMC is abnormal, resetting the HMC module through redfish command and recording a log;
12 After resetting the HMC module, judging whether the working state of the redfish interface of the HMC is normal or not again;
13 If the working state of the redfish interface of the HMC is abnormal after the HMC module is reset, restoring the factory setting of the HMC module through an I2C command and recording a log;
14 If the working state of the redfish interface of the HMC is normal after the HMC module is reset, judging whether the GPU temperature abnormality occurs in 2 times of temperature polling by the BMC;
15 If the working state of the redfish interface of the HMC is normal, judging whether the GPU temperature abnormality occurs or not by carrying out 2 times of temperature polling on the BMC;
16 If the BMC performs the temperature polling for 2 times again and the GPU temperature is abnormal, judging whether the working state of the redfish interface of the BMC is abnormal or not again;
17 If the BMC performs the temperature polling for 2 times again and the temperature abnormality of the GPU does not occur, the rotating speed of the server fan is recovered.
The embodiment of the invention monitors the temperature value of the graphic processor in real time by acquiring the temperature value of the graphic processor in the graphic processor module detected by the detection link; judging whether a temperature abnormality condition is met according to the temperature value of the graphic processor in the graphic processor module so as to judge the working state of the detection link in real time according to the temperature value of the graphic processor; when the temperature abnormality condition is met, the rotating speed of the fan is increased to radiate heat of the graphic processor module, and error correction is sequentially carried out on the management interface of the baseboard management controller, the first link between the baseboard management controller and the main management controller and the management interface of the main management controller, so that the abnormality can be repaired in time when the detection link of the server is abnormal, the problem that the temperature of the GPU is overheated for a long time due to the loss of the detection of the real-time temperature state of the GPU is avoided, and the stability and the calculation efficiency of large-scale calculation of the GPU module server are improved.
It should be noted that, for simplicity of description, the method embodiments are shown as a series of acts, but it should be understood by those skilled in the art that the embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required by the embodiments of the invention.
Referring to fig. 6, a block diagram of a structure of an error correction device of a detection link of a server according to an embodiment of the present invention is shown, where the server includes a graphics processor module, the detection link is used to detect a temperature value of a graphics processor in the graphics processor module, and the detection link includes a management interface of a baseboard management controller, a first link between the baseboard management controller and a main management controller, and a management interface of the main management controller, and specifically may include the following modules:
a module detection temperature obtaining module 301, configured to obtain a temperature value of a graphics processor in the graphics processor module detected by the detection link;
A temperature anomaly condition judging module 302, configured to judge whether a temperature anomaly condition is satisfied according to a temperature value of a graphics processor in the graphics processor module;
And the abnormality detection link error correction module 303 is configured to increase a rotation speed of a fan to dissipate heat of the graphics processor module when the temperature abnormality condition is satisfied, and to correct errors of the management interface of the baseboard management controller, the first link between the baseboard management controller and the main management controller, and the management interface of the main management controller in sequence.
In the embodiment of the present invention, the anomaly detection link error correction module 303 includes:
The first detection submodule is used for detecting the working state of the management interface of the substrate management controller;
The first error correction sub-module is used for correcting errors of the management interface of the baseboard management controller if the working state of the management interface of the baseboard management controller is abnormal;
The second detection sub-module is used for detecting the working state of a first link between the baseboard management controller and the main management controller if the working state of a management interface of the baseboard management controller is normal;
the second error correction sub-module is used for correcting the first link if the working state of the first link is abnormal;
The third detection sub-module is used for detecting the working state of the management interface of the main management controller if the working state of the first link is normal;
The third error correction sub-module is used for correcting errors of the management interface of the main management controller if the working state of the management interface of the main management controller is abnormal;
The temperature acquisition sub-module is used for acquiring the temperature value of the graphic processor after the working state of the management interface of the substrate management controller and the working state of the first link are normal;
And the fan adjusting sub-module is used for recovering the rotating speed of the fan if the temperature values of all the graphic processors are normal.
In an embodiment of the present invention, the first error correction sub-module includes:
The first abnormality determination unit is used for determining that the working state of the management interface of the baseboard management controller is abnormal if the management interface of the baseboard management controller is not activated or occupied;
And the first error correction unit is used for re-enabling the authority of the management interface of the baseboard management controller so as to correct the error of the management interface of the baseboard management controller.
In an embodiment of the present invention, the second error correction sub-module includes:
A second abnormality determining unit, configured to determine that, if the network connection state of the first link is abnormal, the working state of the first link is abnormal;
and the second error correction unit is used for reestablishing the network connection of the first link so as to correct the error of the first link.
In an embodiment of the present invention, the management interface of the master management controller includes a first master management controller interface, and the third error correction sub-module includes:
the third abnormality determining unit is used for determining that the working state of the first main management controller interface is abnormal if the first main management controller interface is not activated or occupied;
and the third error correction unit is used for resetting the management function corresponding to the main management controller so as to correct the management interface of the main management controller.
In an embodiment of the present invention, the management interface of the master management controller further includes a second master management controller interface, and the third error correction sub-module further includes:
And the fourth error correction unit is used for resetting the master management controller through an I2C command to correct the error of the management interface of the master management controller if the working state of the management interface of the master management controller is abnormal after the management function corresponding to the master management controller is reset.
In an embodiment of the present invention, the apparatus further includes:
and the error correction event recording log module is used for recording events which are sequentially subjected to error correction on the management interface of the baseboard management controller, the first link between the baseboard management controller and the main management controller and the management interface of the main management controller into a log.
For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.
Referring to fig. 7, a block diagram of an electronic device according to an embodiment of the present invention is shown. The embodiment of the invention also provides electronic equipment, which comprises: the method comprises the steps of a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the error correction method of the detection link of the server when executing the computer program.
Referring to FIG. 8, a block diagram of a computer-readable storage medium according to an embodiment of the present invention is shown. The embodiment of the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program realizes the steps of the error correction method of the detection link of the server when being executed by a processor.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.
It will be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Moreover, embodiments of the invention may take the form of a computer program product embodied on one or more machine-readable media (including, but not limited to, magnetic disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the invention.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or terminal device that comprises the element.
The above description of the error correction method of the detection link of the server and the error correction device of the detection link of the server provided by the invention applies specific examples to illustrate the principles and embodiments of the invention, and the above examples are only used to help understand the method and core ideas of the invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.
Claims (10)
1. An error correction method for a detection link of a server, wherein the server includes a graphics processor module, the detection link is used for detecting a temperature value of a graphics processor in the graphics processor module, the detection link includes a management interface of a baseboard management controller, a first link between the baseboard management controller and a master management controller, and a management interface of the master management controller, the method includes:
Acquiring a temperature value of a graphics processor in the graphics processor module detected by the detection link;
judging whether a temperature abnormality condition is met according to the temperature value of the graphic processor in the graphic processor module;
When the temperature abnormal condition is met, the rotating speed of the fan is increased to radiate heat of the graphic processor module, and error correction is sequentially carried out on the management interface of the baseboard management controller, the first link between the baseboard management controller and the main management controller and the management interface of the main management controller.
2. The method of claim 1, wherein the sequentially error correcting the management interface of the baseboard management controller, the first link between the baseboard management controller and the master management controller, and the management interface of the master management controller comprises:
Detecting the working state of a management interface of the substrate management controller;
if the working state of the management interface of the baseboard management controller is abnormal, correcting the error of the management interface of the baseboard management controller;
If the working state of the management interface of the baseboard management controller is normal, detecting the working state of a first link between the baseboard management controller and the main management controller;
if the working state of the first link is abnormal, correcting the error of the first link;
If the working state of the first link is normal, detecting the working state of a management interface of the main management controller;
If the working state of the management interface of the main management controller is abnormal, correcting the error of the management interface of the main management controller;
Acquiring a temperature value of the graphic processor after the working state of the management interface of the substrate management controller and the working state of the first link are normal;
and if the temperature values of all the graphic processors are normal, recovering the rotating speed of the fan.
3. The method according to claim 2, wherein if the operation state of the management interface of the baseboard management controller is abnormal, performing error correction on the management interface of the baseboard management controller includes:
if the management interface of the baseboard management controller is not activated or occupied, determining that the working state of the management interface of the baseboard management controller is abnormal;
And re-enabling the authority of the management interface of the baseboard management controller so as to correct the error of the management interface of the baseboard management controller.
4. The method according to claim 2, wherein if the operation state of the first link is abnormal, performing error correction on the first link includes:
If the network connection state of the first link is abnormal, determining that the working state of the first link is abnormal;
And reestablishing the network connection of the first link to correct the error of the first link.
5. The method according to claim 2, wherein the management interface of the master management controller includes a first master management controller interface, and the error correction is performed on the management interface of the master management controller if the working state of the management interface of the master management controller is abnormal, including:
If the first main management controller interface is not activated or occupied, determining that the working state of the first main management controller interface is abnormal;
resetting the management function corresponding to the master management controller to correct the management interface of the master management controller.
6. The method of claim 5, wherein the management interface of the master management controller further comprises a second master management controller interface, the error correcting the management interface of the master management controller further comprising:
If the working state of the management interface of the master management controller is abnormal after the management function corresponding to the master management controller is reset, resetting the master management controller through an I2C command to correct the error of the management interface of the master management controller.
7. The method according to claim 1, wherein the method further comprises:
And recording events which sequentially correct errors of the management interface of the baseboard management controller, the first link between the baseboard management controller and the main management controller and the management interface of the main management controller into a log.
8. An error correction apparatus for a detection link of a server, wherein the server includes a graphic processor module, the detection link is used for detecting a temperature value of a graphic processor in the graphic processor module, the detection link includes a management interface of a baseboard management controller, a first link between the baseboard management controller and a main management controller, and a management interface of the main management controller, the apparatus includes:
the module detection temperature acquisition module is used for acquiring a temperature value of a graphic processor in the graphic processor module detected by the detection link;
the temperature abnormality condition judging module is used for judging whether the temperature abnormality condition is met or not according to the temperature value of the graphic processor in the graphic processor module;
and the abnormality detection link error correction module is used for increasing the rotating speed of the fan to radiate heat of the graphic processor module when the temperature abnormality condition is met, and correcting errors of the management interface of the substrate management controller, the first link between the substrate management controller and the main management controller and the management interface of the main management controller in sequence.
9. An electronic device, comprising: processor, memory and a computer program stored on the memory and capable of running on the processor, which when executed by the processor carries out the steps of the error correction method of the detection link of a server according to any one of claims 1-7.
10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the error correction method of the detection link of a server according to any of claims 1-7.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410840617.8A CN118819927A (en) | 2024-06-26 | 2024-06-26 | A server detection link error correction method, device, equipment and medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410840617.8A CN118819927A (en) | 2024-06-26 | 2024-06-26 | A server detection link error correction method, device, equipment and medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN118819927A true CN118819927A (en) | 2024-10-22 |
Family
ID=93077797
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202410840617.8A Withdrawn CN118819927A (en) | 2024-06-26 | 2024-06-26 | A server detection link error correction method, device, equipment and medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN118819927A (en) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119512349A (en) * | 2025-01-20 | 2025-02-25 | 杭州天舰信息技术股份有限公司 | Data center energy saving control method, system and storage medium |
| CN120723528A (en) * | 2025-09-02 | 2025-09-30 | 安擎计算机信息股份有限公司 | A method and device for quickly locating faults in an MGX system |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109611367A (en) * | 2018-12-11 | 2019-04-12 | 英业达科技有限公司 | Fan control system and server based on CPLD |
| CN113606176A (en) * | 2021-07-23 | 2021-11-05 | 苏州浪潮智能科技有限公司 | Fan running state detection method and device |
| CN116401115A (en) * | 2022-12-16 | 2023-07-07 | 苏州浪潮智能科技有限公司 | A DeltaNext platform GPU monitoring method, system, terminal and storage medium |
| CN116643631A (en) * | 2023-05-09 | 2023-08-25 | 苏州浪潮智能科技有限公司 | A server cooling method, device, electronic equipment and storage medium |
| CN220022314U (en) * | 2023-06-13 | 2023-11-14 | 合肥市卓怡恒通信息安全有限公司 | Overheat protection circuit of graphic processor |
| CN117519334A (en) * | 2023-10-30 | 2024-02-06 | 苏州元脑智能科技有限公司 | Temperature control method and device for server, electronic equipment and storage medium |
-
2024
- 2024-06-26 CN CN202410840617.8A patent/CN118819927A/en not_active Withdrawn
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN109611367A (en) * | 2018-12-11 | 2019-04-12 | 英业达科技有限公司 | Fan control system and server based on CPLD |
| CN113606176A (en) * | 2021-07-23 | 2021-11-05 | 苏州浪潮智能科技有限公司 | Fan running state detection method and device |
| CN116401115A (en) * | 2022-12-16 | 2023-07-07 | 苏州浪潮智能科技有限公司 | A DeltaNext platform GPU monitoring method, system, terminal and storage medium |
| CN116643631A (en) * | 2023-05-09 | 2023-08-25 | 苏州浪潮智能科技有限公司 | A server cooling method, device, electronic equipment and storage medium |
| CN220022314U (en) * | 2023-06-13 | 2023-11-14 | 合肥市卓怡恒通信息安全有限公司 | Overheat protection circuit of graphic processor |
| CN117519334A (en) * | 2023-10-30 | 2024-02-06 | 苏州元脑智能科技有限公司 | Temperature control method and device for server, electronic equipment and storage medium |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119512349A (en) * | 2025-01-20 | 2025-02-25 | 杭州天舰信息技术股份有限公司 | Data center energy saving control method, system and storage medium |
| CN119512349B (en) * | 2025-01-20 | 2025-07-22 | 杭州天舰信息技术股份有限公司 | Data center energy-saving control method, system and storage medium |
| CN120723528A (en) * | 2025-09-02 | 2025-09-30 | 安擎计算机信息股份有限公司 | A method and device for quickly locating faults in an MGX system |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN118819927A (en) | A server detection link error correction method, device, equipment and medium | |
| TWI595760B (en) | Server resource management system and management method thereof | |
| CN116820820A (en) | Server fault monitoring method and system | |
| US20120136502A1 (en) | Fan speed control system and fan speed reading method thereof | |
| CN106844162A (en) | Storage server cabinet management system and method based on BMC | |
| CN108319525A (en) | Switch device and method for detecting integrated circuit bus | |
| US10129114B1 (en) | Protocol exposure as network health detection | |
| CN105335214A (en) | A Method for Virtual Machine Fault Detection and Recovery | |
| US8055725B2 (en) | Method, apparatus and program product for remotely restoring a non-responsive computing system | |
| CN114218004A (en) | Method and system for fault handling of physical nodes of Kubernetes cluster based on BMC | |
| CN118708418B (en) | Server software and hardware information diagnosis system and method | |
| JP2024156644A (en) | How to manage servers with an IT resource management system | |
| CN107528705B (en) | Troubleshooting method and device | |
| CN117931581A (en) | Graphic processor monitoring method, device, medium and server monitoring system | |
| CN117500227A (en) | Heat dissipation control method and device, electronic equipment and storage medium | |
| US20230024444A1 (en) | Systems and methods for mitigating power failover | |
| CN113867815B (en) | Method for monitoring server suspension and automatically restarting and server applying same | |
| US11294761B1 (en) | Apparatus, system, and method for correcting slow field-replaceable units in network devices | |
| CN118631545B (en) | Isolation method and device for distributed nodes | |
| CN110377450A (en) | A kind of hardware anomalies processing method, system and associated component | |
| US10365934B1 (en) | Determining and reporting impaired conditions in a multi-tenant web services environment | |
| CN109491867A (en) | A kind of communication automatic recovery method and device | |
| Kitamura | Configuration of a Power-saving High-availability Server System Incorporating a Hybrid Operation Method | |
| CN107590053A (en) | A kind of hardware monitoring system and method | |
| CN118555147B (en) | A protection method, firewall system and device |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| WW01 | Invention patent application withdrawn after publication |
Application publication date: 20241022 |