CN115586983A - Server fault recovery method, device, equipment and storage medium - Google Patents

Server fault recovery method, device, equipment and storage medium Download PDF

Info

Publication number
CN115586983A
CN115586983A CN202211190686.6A CN202211190686A CN115586983A CN 115586983 A CN115586983 A CN 115586983A CN 202211190686 A CN202211190686 A CN 202211190686A CN 115586983 A CN115586983 A CN 115586983A
Authority
CN
China
Prior art keywords
server
platform management
management controller
controlling
restart
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211190686.6A
Other languages
Chinese (zh)
Inventor
曹卫国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202211190686.6A priority Critical patent/CN115586983A/en
Publication of CN115586983A publication Critical patent/CN115586983A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1438Restarting or rejuvenating

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Hardware Redundancy (AREA)

Abstract

The application discloses a method, a device, equipment and a storage medium for server fault recovery, which relate to the technical field of servers and comprise the following steps: when detecting that a binding relationship exists between a first server and a second server, monitoring heartbeat message sending operation between a first platform management controller of the first server and a second platform management controller of the second server; if the first server is monitored not to receive response information of the heartbeat message, judging that the second server has a fault problem; and detecting whether the intelligent platform management interface of the second server is normal, and if not, controlling the complex programmable logic device to restart the second server to complete fault recovery. It can be seen that whether a binding relationship exists between two nearby servers is detected, if yes, the two servers are proved to be mutually detected and controlled after being communicated through a hardware interface, and when a fault occurs, a restart activity is initiated through a complex programmable logic device to complete fault recovery.

Description

Server fault recovery method, device, equipment and storage medium
Technical Field
The present invention relates to the field of server technologies, and in particular, to a method, an apparatus, a device, and a storage medium for server failure recovery.
Background
Currently, a server is the core of the entire network system and computing platform, and many important data are stored on the server. The BMC (Baseboard Management Controller) is also a server manager that uses sensors to monitor the status of a computer, web server, or other hardware driven device. The control right of the server is mastered. The BMC may also control the refresh of a BIOS (Basic Input Output System), which is the top line of server startup and holds the highest speaking right of the server. Therefore, the stable work of the BMC is guaranteed, which is equivalent to the stable work of the server. When the server fails, if the server failure is discovered and solved only by manual monitoring and based on the self experience of the administrator, the degree of automation is low, and secondary failure is easily caused by manual handling of the server failure operation by human.
In summary, how to implement automatic fault check of a server and actively complete fault recovery is a technical problem to be solved in the field.
Disclosure of Invention
In view of the above, the present invention provides a method, an apparatus, a device, and a storage medium for server failure recovery, which can automatically check a failure of a server and actively complete failure recovery. The specific scheme is as follows:
in a first aspect, the present application discloses a server failure recovery method, including:
when the binding relationship between a first server and a second server is detected, monitoring heartbeat message sending operation between a first platform management controller of the first server and a second platform management controller of the second server;
if the first server is monitored not to receive the response information of the heartbeat message, judging that the second server has a fault;
and detecting whether the intelligent platform management interface of the second server is normal, and if not, controlling a complex programmable logic device to restart the second server to complete fault recovery.
Optionally, when it is detected that a binding relationship exists between the first server and the second server, the monitoring a heartbeat packet sending operation between the first platform management controller of the first server and the second platform management controller of the second server includes:
when detecting that a binding relationship exists between a first server and a second server through a preset communication interface, controlling the first substrate management controller and the second substrate management controller to send heartbeat messages and/or receive response information aiming at the heartbeat messages at preset time intervals;
and monitoring a heartbeat message sending operation and/or a response information receiving operation between the first platform management controller of the first server and the second platform management controller of the second server.
Optionally, if it is monitored that the first server does not receive the response information of the heartbeat message, it is determined that the second server has a failure problem, including:
and if the first server is monitored not to receive the response information of the heartbeat message, judging that a hang-up fault occurs in a second platform management controller of the second server.
Optionally, after monitoring a heartbeat packet sending operation between the first platform management controller of the first server and the second platform management controller of the second server, the method further includes:
if the second server is monitored not to receive the response information of the heartbeat message, judging that the first server has a fault problem;
correspondingly, the detecting whether the intelligent platform management interface of the second server is normal or not, if not, controlling the complex programmable logic device to restart the second server to complete fault recovery includes:
and detecting whether the intelligent platform management interface of the first server is normal, and if not, controlling the complex programmable logic device to restart the first server to complete fault recovery.
Optionally, the detecting whether the intelligent platform management interface of the second server is normal, if not, controlling the complex programmable logic device to restart the second server, and completing the fault recovery process further includes:
and if the intelligent platform management interface of the second server is detected to be abnormal, recording a field log so as to be used for analyzing the fault reason.
Optionally, after detecting whether the intelligent platform management interface of the second server is normal, the method further includes:
and accessing the second server by using the first server through a preset communication interface, and controlling the second server to restart to complete fault recovery.
Optionally, the controlling the complex programmable logic device to restart the second server to complete fault recovery includes:
and controlling the complex programmable logic device to restart the second server through GPIO to complete fault recovery.
In a second aspect, the present application discloses a server failure recovery apparatus, including:
the message sending module is used for monitoring heartbeat message sending operation between a first platform management controller of the first server and a second platform management controller of a second server when the binding relationship between the first server and the second server is detected;
the failure judgment module is used for judging that the second server has a failure problem if the first server is monitored not to receive the response information of the heartbeat message;
and the fault recovery module is used for detecting whether the intelligent platform management interface of the second server is normal or not, and controlling the complex programmable logic device to restart the second server to complete fault recovery if the intelligent platform management interface of the second server is abnormal.
In a third aspect, the present application discloses an electronic device, comprising:
a memory for storing a computer program;
a processor for executing the computer program to implement the steps of the server failure recovery method disclosed in the foregoing.
In a fourth aspect, the present application discloses a computer readable storage medium for storing a computer program; wherein the computer program realizes the steps of the server failure recovery method disclosed in the foregoing when executed by a processor.
Therefore, the application discloses a server failure recovery method, which comprises the following steps: when detecting that a binding relationship exists between a first server and a second server, monitoring heartbeat message sending operation between a first platform management controller of the first server and a second platform management controller of the second server; if the first server is monitored not to receive the response information of the heartbeat message, judging that the second server has a fault; and detecting whether the intelligent platform management interface of the second server is normal, and if not, controlling the complex programmable logic device to restart the second server to complete fault recovery. It can be seen that whether a binding relationship exists between two nearby servers is detected, if so, the two servers are proved to be capable of mutually detecting the other side and the control side, detecting whether the other side has a fault, and controlling the other side to perform fault recovery and the like; and mutually detecting whether the other side has a fault by sending a heartbeat message, if detecting that a certain side has no response information of the heartbeat message, indicating that the other server has a fault problem, immediately controlling the complex programmable logic device to restart the server with the fault, therefore, mutually detecting and controlling after being communicated through the hardware interface, and initiating a restarting activity through the complex programmable logic device when the fault occurs, so as to complete fault recovery.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flow chart of a server failure recovery method disclosed in the present application;
FIG. 2 is a flow chart of a particular server failure recovery method disclosed herein;
FIG. 3 is a flowchart of a method for controlling interaction between two servers according to the disclosure;
fig. 4 is a schematic structural diagram of a server failure recovery apparatus disclosed in the present application;
fig. 5 is a block diagram of an electronic device disclosed in the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
Currently, a server is the core of the entire network system and computing platform, and many important data are stored on the server. The BMC, in turn, is the server's housekeeper that uses sensors to monitor the status of a computer, web server, or other hardware driven device. The control right of the server is mastered. And the BMC can also control the refreshing of the BIOS, and the BIOS is used as the most front line for starting the server to master the highest speaking right of the server. Therefore, the stable work of the BMC is guaranteed, which is equivalent to the stable work of the server. When the server fails, if the server failure is discovered and solved only by manual monitoring and based on the self experience of the administrator, the degree of automation is low, and secondary failure is easily caused by manual handling of the server failure operation by human.
Therefore, the server fault recovery scheme is provided, fault automatic check of the server can be achieved, and fault recovery can be completed actively.
Referring to fig. 1, an embodiment of the present invention discloses a server failure recovery method, including:
step S11: when the binding relationship between a first server and a second server is detected, the heartbeat message sending operation between a first platform management controller of the first server and a second platform management controller of the second server is monitored.
In this embodiment, when it is detected that a binding relationship exists between two servers nearby, a heartbeat message monitoring operation is performed on the two servers with the binding relationship, where a heartbeat message sending operation between the two monitored servers may be sent to the second server by the first server, may also be sent to the first server by the second server, and may also be sent by the two servers, which is not specifically limited herein.
In this embodiment, when it is detected that a binding relationship exists between a first server and a second server through a preset communication interface, the first baseboard management controller and the second baseboard management controller are controlled to send heartbeat messages and/or receive response information for the heartbeat messages at preset time intervals; and monitoring a heartbeat message sending operation and/or a response information receiving operation between the first platform management controller of the first server and the second platform management controller of the second server. It is understood that the heartbeat message is sent and/or the response information for the heartbeat message is received at preset time intervals, for example: and the two servers communicate by sending heartbeat messages every 5 minutes. When it is detected that the specific hardware interfaces provided by the server main boards of the first server and the second server are used as the preset communication interfaces for communication, the first server and the second server have a binding relationship through the preset communication interfaces. After the binding relationship between the two nearby servers is determined, controlling the BMC of the first server and the BMC of the second server to start heartbeat message sending operation; in an embodiment, when a first server is configured to send a heartbeat message to a second server, the first server is used as a sender, and the second server is used as a receiver, that is, when the first server performs a heartbeat packet sending operation with the second server through a preset communication interface, for example, a serial port, the second server in a normal state needs to perform a return operation of a response message according to a current heartbeat packet.
In another embodiment, when a second server is set to send a heartbeat message to a first server, the second server serves as a sender, and the first server serves as a receiver, that is, when the second server performs a heartbeat packet sending operation with the first server through a preset communication interface, such as a serial port, the first server in a normal state needs to perform a response message returning operation according to a current heartbeat packet.
In another embodiment, the first server and the second server may be set as the sender of the heartbeat packet, and accordingly, the second server and the first server may also be used as the sender of the response message of the heartbeat packet.
Step S12: if the first server is monitored not to receive the response information of the heartbeat message, the second server is judged to have a fault problem.
In this embodiment, if it is monitored that the first server does not receive the response information of the heartbeat message, it is determined that a hang-up fault occurs in the second platform management controller of the second server. It can be understood that, after the first server performs the heartbeat packet sending operation with the second server through the serial port communication mode, the first server starts to wait for the response information feedback of the second server, and if the response information of the second server is received within a preset time period, it indicates that the second server has no fault; and if the response information of the second server is not received within the preset time period, indicating that the second server has a fault.
In this embodiment, if it is monitored that the second server does not receive the response information of the heartbeat message, it is determined that the first server has a failure; it can be understood that, when the first server serves as a heartbeat packet responder and the second server serves as a heartbeat packet sender, after the second server sends a heartbeat packet and waits for a preset time, if response information of the first server for the heartbeat packet is not received, it is indicated that the first server has a failure.
In this embodiment, if a server on one side that responds to a heartbeat packet does not receive a heartbeat packet sent by another server having a binding relationship for a long time, it may also be determined that another server may be currently failed.
Step S13: and detecting whether the intelligent platform management interface of the second server is normal, and if not, controlling the complex programmable logic device to restart the second server to complete fault recovery.
In this embodiment, the GPIO is used to control the complex programmable logic device to restart the second server, thereby completing fault recovery. It can be understood that whether the IPMI (Intelligent Platform Management Interface) of the second server is normal is detected through the network, and if not, the CPLD of the second server is controlled by GPIO (General-purpose input/output) to restart the second server, so as to recover the hang-up.
In this embodiment, whether the intelligent platform management interface of the first server is normal is detected, and if not, the complex programmable logic device is controlled to restart the first server to complete fault recovery. It can be understood that, whether the IPMI of the first server is normal is detected through the network, and if not, the CPLD of the first server is controlled by the GPIO to restart the first server, so as to recover the hang-up.
Therefore, the application discloses a server failure recovery method, which comprises the following steps: when detecting that a binding relationship exists between a first server and a second server, monitoring heartbeat message sending operation between a first platform management controller of the first server and a second platform management controller of the second server; if the first server is monitored not to receive the response information of the heartbeat message, judging that the second server has a fault problem; and detecting whether the intelligent platform management interface of the second server is normal, and if not, controlling the complex programmable logic device to restart the second server to complete fault recovery. It can be seen that whether a binding relationship exists between two nearby servers is detected, if so, the two servers are proved to be capable of mutually detecting the other side and the control side, detecting whether the other side has a fault, and controlling the other side to perform fault recovery and the like; and mutually detecting whether the other side has a fault by sending a heartbeat message, if detecting that a certain side has no response information of the heartbeat message, indicating that the other server has a fault problem, immediately controlling the complex programmable logic device to restart the server with the fault, therefore, mutually detecting and controlling after being communicated through the hardware interface, and initiating a restarting activity through the complex programmable logic device when the fault occurs, so as to complete fault recovery.
Referring to fig. 2, the embodiment of the present invention discloses a specific server failure recovery method, and compared with the previous embodiment, the present embodiment further describes and optimizes the technical solution. Specifically, the method comprises the following steps:
step S21: when the binding relationship between a first server and a second server is detected, the heartbeat message sending operation between a first platform management controller of the first server and a second platform management controller of the second server is monitored.
Step S22: and if the first server is monitored not to receive the response information of the heartbeat message, judging that the second server has a fault problem.
For a more detailed processing procedure in steps S21 and S22, please refer to the content of the foregoing disclosed embodiments, which is not described herein again.
Step S23: and detecting whether the intelligent platform management interface of the second server is normal, and recording a field log if the intelligent platform management interface of the second server is detected to be abnormal so as to be used for analyzing the fault reason.
In this embodiment, if it is detected that the intelligent platform management interface of the second server is abnormal, the abnormality is recorded and reported to the web control end, and the dying end records the end-to-end testimony for analyzing the dying reason.
Step S24: and accessing the second server by using the first server through a preset communication interface, and controlling the second server to restart to complete fault recovery.
In this embodiment, the first server is used to access the second server through the preset communication interface, and in the process of controlling the second server to perform the restart operation, when the IPMI master process of the server or one of the server networks is abnormal, the IPMI master process may be indirectly accessed and controlled through another server, so that only based on the hardware communication interface established between the first server and the second server, the server in the normal state controls the abnormal server to perform the operation of recovering the fault, such as the restart. The invention is described by a server architecture of an Intel platform, but the method is not limited to the server of the Intel platform, and has a general application value in servers of other platforms and computer platforms, and is not particularly limited herein. Referring to fig. 3, first, two server motherboards provide a specific hardware interface, such as a serial port, for the two servers to communicate through the specific hardware interface; the two GPIOs are used for restarting the BMC by controlling the CPLD after the BMC is hung dead; two Reset keys are used for restarting the server under the condition of no power failure when the server is halted; two UARTs (Universal Asynchronous Receiver/Transmitter) are used as a chip with parallel input and serial output, and are usually integrated on a motherboard, and most of them are 16550AFN chips. Therefore, firstly, a server A is determined, then whether a server bound nearby exists is monitored for the server A, if the server B and the server A are detected to exist in a binding relationship, the heartbeat message sending operation between the server A and the server B is monitored, if the server A sends the heartbeat message, the server B returns response information of the server A aiming at the heartbeat message, and the sending operation and the returning operation information between the two servers are monitored, wherein in the monitoring process, monitoring can be set to be carried out at preset time intervals; if the server A suddenly monitors that the server A does not receive the response information of the server B in a certain monitoring process, judging that the server B has a fault; at this time, whether GPIO in the server B is normal is detected, if not, the CPLD can be controlled to Reset the BMC of the server B, or if the IPMI main process of the server B is abnormal, the server A can be used for accessing and controlling the server B through a hardware interface, so that the BMC of the server B is recovered to be normal.
Therefore, the fault recovery and protection mechanism for mutual control and detection of the two nearby servers BMC protects the server to be more robust and stable service, further guarantees the stability of the current big data service, and immediately collects the current fault log when a fault occurs so as to analyze the fault reason.
Referring to fig. 4, an embodiment of the present invention discloses a specific server failure recovery apparatus, including:
a message sending module 11, configured to monitor a heartbeat message sending operation between a first platform management controller of a first server and a second platform management controller of a second server when detecting that a binding relationship exists between the first server and the second server;
a failure determining module 12, configured to determine that a failure occurs in the second server if it is monitored that the first server does not receive the response information of the heartbeat packet;
and the fault recovery module 13 is configured to detect whether an intelligent platform management interface of the second server is normal, and if the intelligent platform management interface of the second server is not normal, control the complex programmable logic device to restart the second server to complete fault recovery.
The message sending module 11 is specifically configured to, when it is detected that a binding relationship exists between two nearby servers, perform heartbeat message monitoring operation on the two servers having the binding relationship, where the heartbeat message sending operation between the two monitored servers may be sent to the second server by the first server, may also be sent to the first server by the second server, and may also be sent by the two servers, which is not specifically limited herein. When detecting that a binding relationship exists between a first server and a second server through a preset communication interface, controlling the first substrate management controller and the second substrate management controller to send heartbeat messages and/or receive response information aiming at the heartbeat messages at preset time intervals; and monitoring a heartbeat message sending operation and/or a response information receiving operation between the first platform management controller of the first server and the second platform management controller of the second server. It is understood that the heartbeat message is sent and/or the response information for the heartbeat message is received at preset time intervals, for example: and the two servers communicate by sending heartbeat messages every 5 minutes. When it is detected that the specific hardware interfaces provided by the server main boards of the first server and the second server are used as the preset communication interfaces for communication, the first server and the second server have a binding relationship through the preset communication interfaces. After the binding relationship between the two nearby servers is determined, controlling the BMC of the first server and the BMC of the second server to start heartbeat message sending operation; in an embodiment, when a first server is configured to send a heartbeat message to a second server, the first server is used as a sender, and the second server is used as a receiver, that is, when the first server performs a heartbeat packet sending operation with the second server through a preset communication interface, for example, a serial port, the second server in a normal state needs to perform a return operation of a response message according to a current heartbeat packet. In another embodiment, when a second server is set to send a heartbeat message to a first server, the second server serves as a sender, and the first server serves as a receiver, that is, when the second server performs a heartbeat packet sending operation with the first server through a preset communication interface, such as a serial port, the first server in a normal state needs to perform a response message returning operation according to a current heartbeat packet. In another embodiment, the first server and the second server may be set as the sender of the heartbeat packet, and correspondingly, the second server and the first server may also be used as the sender of the response message of the heartbeat packet.
The failure recovery module 13 is specifically configured to, in the process of accessing the second server through the preset communication interface by using the first server and controlling the second server to perform a restart operation, when an IPMI main process of the server or one of the server networks is abnormal, indirectly access and control through another server, so that only based on the hardware communication interface established between the first server and the second server, the server in the normal state controls the abnormal server to perform operations of recovering the failure, such as restart. The invention is described by a server architecture of an Intel platform, but the method is not limited to the server of the Intel platform, and has a general application value in servers of other platforms and computer platforms, and is not particularly limited herein. Firstly, two server mainboards provide specific hardware interfaces, such as serial ports, for the two servers to communicate through the specific hardware interfaces; the two GPIOs are used for restarting the BMC by controlling the CPLD after the BMC is hung up; two Reset keys are used for restarting the server under the condition of no power failure when the server is halted; two UARTs, UARTs for a chip with parallel input and serial output, are usually integrated on a motherboard, mostly 16550AFN chips. Therefore, firstly, a server A is determined, then whether a server bound nearby exists is monitored for the server A, if the server B and the server A are detected to exist in a binding relationship, the heartbeat message sending operation between the server A and the server B is monitored, if the server A sends the heartbeat message, the server B returns response information of the server A aiming at the heartbeat message, and the sending operation and the returning operation information between the two servers are monitored, wherein in the monitoring process, the monitoring can be carried out once at preset time intervals; if the server A suddenly monitors that the server A does not receive the response information of the server B in a certain monitoring process, judging that the server B has a fault; at this time, whether GPIO in the server B is normal is detected, if not, the CPLD can be controlled to Reset the BMC of the server B, or if the IPMI main process of the server B is abnormal, the server A can be used for accessing and controlling the server B through a hardware interface, so that the BMC of the server B is recovered to be normal.
Therefore, the application discloses a server failure recovery method, which comprises the following steps: when detecting that a binding relationship exists between a first server and a second server, monitoring heartbeat message sending operation between a first platform management controller of the first server and a second platform management controller of the second server; if the first server is monitored not to receive the response information of the heartbeat message, judging that the second server has a fault problem; and detecting whether the intelligent platform management interface of the second server is normal, and if not, controlling the complex programmable logic device to restart the second server to complete fault recovery. It can be seen that whether a binding relationship exists between two nearby servers is detected, if so, the two servers are proved to be capable of mutually detecting the other side and the control side, detecting whether the other side has a fault, and controlling the other side to perform fault recovery and the like; and mutually detecting whether the other side has a fault by sending a heartbeat message, if detecting that a certain side has no response information of the heartbeat message, indicating that the other server has a fault problem, immediately controlling the complex programmable logic device to restart the server with the fault, therefore, mutually detecting and controlling after being communicated through the hardware interface, and initiating a restarting activity through the complex programmable logic device when the fault occurs, so as to complete fault recovery.
In some specific embodiments, the message sending module 11 may specifically include:
the message sending unit is used for controlling the first substrate management controller and the second substrate management controller to send heartbeat messages and/or receive response information aiming at the heartbeat messages at preset time intervals when the first server and the second server are detected to have a binding relationship through a preset communication interface;
and monitoring a heartbeat message sending operation and/or a response information receiving operation between the first platform management controller of the first server and the second platform management controller of the second server.
In some specific embodiments, the fault determining module 12 may specifically include:
and the fault judging unit is used for judging that a hang-up fault occurs in a second platform management controller of the second server if the first server is monitored not to receive the response information of the heartbeat message.
In some specific embodiments, the server failure recovery apparatus may specifically include:
a failure determining unit, configured to determine that a failure occurs in the first server if it is monitored that the second server does not receive response information of the heartbeat message;
correspondingly, the detecting whether the intelligent platform management interface of the second server is normal or not, if not, controlling the complex programmable logic device to restart the second server to complete fault recovery includes: and detecting whether the intelligent platform management interface of the first server is normal, and if not, controlling the complex programmable logic device to restart the first server to complete fault recovery.
In some specific embodiments, the server failure recovery apparatus may specifically include:
and the log recording unit is used for recording a field log if the intelligent platform management interface of the second server is detected to be abnormal so as to analyze the fault reason.
In some specific embodiments, the server failure recovery apparatus may specifically include:
and the access restarting unit is used for accessing the second server by utilizing the first server through a preset communication interface, controlling the second server to restart and completing fault recovery.
In some specific embodiments, the failure recovery module 13 may specifically include:
and the fault recovery unit is used for controlling the complex programmable logic device to restart the second server through the GPIO to complete fault recovery.
Further, an electronic device is disclosed in the embodiments of the present application, and fig. 5 is a block diagram of the electronic device 20 according to an exemplary embodiment, which should not be construed as limiting the scope of the application.
Fig. 5 is a schematic structural diagram of an electronic device 20 according to an embodiment of the present disclosure. The electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input output interface 25, and a communication bus 26. Wherein, the memory 22 is used for storing a computer program, and the computer program is loaded and executed by the processor 21 to implement the relevant steps in the server failure recovery method disclosed in any of the foregoing embodiments. In addition, the electronic device 20 in the present embodiment may be specifically an electronic computer.
In this embodiment, the power supply 23 is configured to provide an operating voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and an external device, and a communication protocol followed by the communication interface is any communication protocol applicable to the technical solution of the present application, and is not specifically limited herein; the input/output interface 25 is configured to obtain external input data or output data to the outside, and a specific interface type thereof may be selected according to specific application requirements, which is not specifically limited herein.
The processor 21 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 21 may be implemented in at least one hardware form of DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), PLA (Programmable Logic Array). The processor 21 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 21 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 21 may further include an AI (Artificial Intelligence) processor for processing a calculation operation related to machine learning.
In addition, the storage 22 is used as a carrier for resource storage, and may be a read-only memory, a random access memory, a magnetic disk or an optical disk, etc., and the resources stored thereon may include an operating system 221, a computer program 222, etc., and the storage manner may be a transient storage or a permanent storage.
The operating system 221 is used for managing and controlling each hardware device and the computer program 222 on the electronic device 20, so as to realize the operation and processing of the mass data 223 in the memory 22 by the processor 21, and may be Windows Server, netware, unix, linux, and the like. The computer program 222 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the server failure recovery method performed by the electronic device 20 disclosed in any of the foregoing embodiments. The data 223 may include data received by the electronic device and transmitted from an external device, or may include data collected by the input/output interface 25 itself.
Further, the present application also discloses a computer-readable storage medium for storing a computer program; wherein the computer program when executed by a processor implements the server failure recovery method disclosed above. For the specific steps of the method, reference may be made to the corresponding contents disclosed in the foregoing embodiments, which are not described herein again.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application. The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.
The server failure recovery method, device, apparatus, and storage medium provided by the present invention are described in detail above, and a specific example is applied in the present disclosure to explain the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understand the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (10)

1. A method for server failure recovery, comprising:
when the binding relationship between a first server and a second server is detected, monitoring heartbeat message sending operation between a first platform management controller of the first server and a second platform management controller of the second server;
if the first server is monitored not to receive the response information of the heartbeat message, judging that the second server has a fault;
and detecting whether the intelligent platform management interface of the second server is normal, and if not, controlling the complex programmable logic device to restart the second server to complete fault recovery.
2. The method for recovering server failure according to claim 1, wherein the monitoring a heartbeat messaging operation between a first platform management controller of a first server and a second platform management controller of a second server when detecting that a binding relationship exists between the first server and the second server includes:
when detecting that a binding relationship exists between a first server and a second server through a preset communication interface, controlling the first substrate management controller and the second substrate management controller to send heartbeat messages and/or receive response information aiming at the heartbeat messages at preset time intervals;
and monitoring a heartbeat message sending operation and/or a response information receiving operation between the first platform management controller of the first server and the second platform management controller of the second server.
3. The method for recovering server failure according to claim 1, wherein if it is monitored that the first server does not receive the response information of the heartbeat packet, determining that the second server has a failure problem comprises:
and if the first server is monitored not to receive the response information of the heartbeat message, judging that a hang-up fault occurs in a second platform management controller of the second server.
4. The method of claim 1, wherein after monitoring the heartbeat messaging operation between the first platform management controller of the first server and the second platform management controller of the second server, the method further comprises:
if the second server is monitored not to receive the response information of the heartbeat message, judging that the first server has a fault problem;
correspondingly, the detecting whether the intelligent platform management interface of the second server is normal or not, if not, controlling the complex programmable logic device to restart the second server to complete fault recovery includes:
and detecting whether the intelligent platform management interface of the first server is normal, and if not, controlling the complex programmable logic device to restart the first server to complete fault recovery.
5. The method for recovering server failure according to claim 1, wherein the detecting whether the intelligent platform management interface of the second server is normal or not, and if not, controlling the complex programmable logic device to restart the second server, and completing the failure recovery process further comprises:
and if the intelligent platform management interface of the second server is detected to be abnormal, recording a field log so as to be used for analyzing the fault reason.
6. The method for recovering server failure according to any one of claims 1 to 5, wherein after detecting whether the intelligent platform management interface of the second server is normal, the method further includes:
and accessing the second server by using the first server through a preset communication interface, and controlling the second server to restart to complete fault recovery.
7. The server failure recovery method of claim 1, wherein controlling the complex programmable logic device to restart the second server to complete failure recovery comprises:
and controlling the complex programmable logic device to restart the second server through the GPIO to complete fault recovery.
8. A server failure recovery apparatus, comprising:
the message sending module is used for monitoring heartbeat message sending operation between a first platform management controller of the first server and a second platform management controller of the second server when the binding relationship exists between the first server and the second server;
the failure judgment module is used for judging that the second server has a failure problem if the first server is monitored not to receive the response information of the heartbeat message;
and the fault recovery module is used for detecting whether the intelligent platform management interface of the second server is normal or not, and controlling the complex programmable logic device to restart the second server to complete fault recovery if the intelligent platform management interface of the second server is abnormal.
9. An electronic device, comprising:
a memory for storing a computer program;
a processor for executing the computer program to implement the steps of the server failure recovery method according to any of claims 1 to 7.
10. A computer-readable storage medium for storing a computer program; wherein the computer program when executed by a processor implements the steps of the server failure recovery method of any one of claims 1 to 7.
CN202211190686.6A 2022-09-28 2022-09-28 Server fault recovery method, device, equipment and storage medium Pending CN115586983A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211190686.6A CN115586983A (en) 2022-09-28 2022-09-28 Server fault recovery method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211190686.6A CN115586983A (en) 2022-09-28 2022-09-28 Server fault recovery method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115586983A true CN115586983A (en) 2023-01-10

Family

ID=84773107

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211190686.6A Pending CN115586983A (en) 2022-09-28 2022-09-28 Server fault recovery method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115586983A (en)

Similar Documents

Publication Publication Date Title
CN108430116B (en) Disconnected network reconnection method, medium, device and computing equipment
US20110004791A1 (en) Server apparatus, fault detection method of server apparatus, and fault detection program of server apparatus
US7788520B2 (en) Administering a system dump on a redundant node controller in a computer system
GB2518052A (en) Group server performance correction via actions to server subset
US10452469B2 (en) Server performance correction using remote server actions
US7734948B2 (en) Recovery of a redundant node controller in a computer system
CN109245966A (en) The monitoring method and device of the service state of cloud platform
CN101771565B (en) Analogy method for realizing multitudinous or different baseboard management controllers by single server
KR102176028B1 (en) System for Real-time integrated monitoring and method thereof
JP5425720B2 (en) Virtualization environment monitoring apparatus and monitoring method and program thereof
CN110896362B (en) Fault detection method and device
CN114363334A (en) Network configuration method, device and equipment for cloud system and cloud desktop virtual machine
US9317355B2 (en) Dynamically determining an external systems management application to report system errors
CN115599617B (en) Bus detection method and device, server and electronic equipment
US20080216057A1 (en) Recording medium storing monitoring program, monitoring method, and monitoring system
JP2012038257A (en) Os operating state confirmation system, confirmation object device, os operating state confirmation device, and os operating state confirmation method and program
CN114826886B (en) Disaster recovery method and device for application software and electronic equipment
CN115586983A (en) Server fault recovery method, device, equipment and storage medium
TWI685740B (en) Method for remotely clearing abnormal status of racks applied in data center
RU2710288C1 (en) Method of remote abnormal state reset of racks used in data center
CN111400094A (en) Method, device, equipment and medium for restoring factory settings of server system
CN112036828A (en) Bare metal management method, device, equipment and medium
CN111416721A (en) Far-end eliminating method for abnormal state of cabinet applied to data center
CN111414267A (en) Far-end eliminating method for abnormal state of cabinet applied to data center
CN111414274A (en) Far-end eliminating method for abnormal state of cabinet applied to data center

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination