CN115586983A

CN115586983A - Server fault recovery method, device, equipment and storage medium

Info

Publication number: CN115586983A
Application number: CN202211190686.6A
Authority: CN
Inventors: 曹卫国
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2022-09-28
Filing date: 2022-09-28
Publication date: 2023-01-10

Abstract

The application discloses a method, a device, equipment and a storage medium for server fault recovery, which relate to the technical field of servers and comprise the following steps: when detecting that a binding relationship exists between a first server and a second server, monitoring heartbeat message sending operation between a first platform management controller of the first server and a second platform management controller of the second server; if the first server is monitored not to receive response information of the heartbeat message, judging that the second server has a fault problem; and detecting whether the intelligent platform management interface of the second server is normal, and if not, controlling the complex programmable logic device to restart the second server to complete fault recovery. It can be seen that whether a binding relationship exists between two nearby servers is detected, if yes, the two servers are proved to be mutually detected and controlled after being communicated through a hardware interface, and when a fault occurs, a restart activity is initiated through a complex programmable logic device to complete fault recovery.

Description

Server fault recovery method, device, equipment and storage medium

Technical Field

The present invention relates to the field of server technologies, and in particular, to a method, an apparatus, a device, and a storage medium for server failure recovery.

Background

Currently, a server is the core of the entire network system and computing platform, and many important data are stored on the server. The BMC (Baseboard Management Controller) is also a server manager that uses sensors to monitor the status of a computer, web server, or other hardware driven device. The control right of the server is mastered. The BMC may also control the refresh of a BIOS (Basic Input Output System), which is the top line of server startup and holds the highest speaking right of the server. Therefore, the stable work of the BMC is guaranteed, which is equivalent to the stable work of the server. When the server fails, if the server failure is discovered and solved only by manual monitoring and based on the self experience of the administrator, the degree of automation is low, and secondary failure is easily caused by manual handling of the server failure operation by human.

In summary, how to implement automatic fault check of a server and actively complete fault recovery is a technical problem to be solved in the field.

Disclosure of Invention

In view of the above, the present invention provides a method, an apparatus, a device, and a storage medium for server failure recovery, which can automatically check a failure of a server and actively complete failure recovery. The specific scheme is as follows:

in a first aspect, the present application discloses a server failure recovery method, including:

when the binding relationship between a first server and a second server is detected, monitoring heartbeat message sending operation between a first platform management controller of the first server and a second platform management controller of the second server;

if the first server is monitored not to receive the response information of the heartbeat message, judging that the second server has a fault;

and detecting whether the intelligent platform management interface of the second server is normal, and if not, controlling a complex programmable logic device to restart the second server to complete fault recovery.

Optionally, when it is detected that a binding relationship exists between the first server and the second server, the monitoring a heartbeat packet sending operation between the first platform management controller of the first server and the second platform management controller of the second server includes:

when detecting that a binding relationship exists between a first server and a second server through a preset communication interface, controlling the first substrate management controller and the second substrate management controller to send heartbeat messages and/or receive response information aiming at the heartbeat messages at preset time intervals;

and monitoring a heartbeat message sending operation and/or a response information receiving operation between the first platform management controller of the first server and the second platform management controller of the second server.

Optionally, if it is monitored that the first server does not receive the response information of the heartbeat message, it is determined that the second server has a failure problem, including:

and if the first server is monitored not to receive the response information of the heartbeat message, judging that a hang-up fault occurs in a second platform management controller of the second server.

Optionally, after monitoring a heartbeat packet sending operation between the first platform management controller of the first server and the second platform management controller of the second server, the method further includes:

if the second server is monitored not to receive the response information of the heartbeat message, judging that the first server has a fault problem;

correspondingly, the detecting whether the intelligent platform management interface of the second server is normal or not, if not, controlling the complex programmable logic device to restart the second server to complete fault recovery includes:

and detecting whether the intelligent platform management interface of the first server is normal, and if not, controlling the complex programmable logic device to restart the first server to complete fault recovery.

Optionally, the detecting whether the intelligent platform management interface of the second server is normal, if not, controlling the complex programmable logic device to restart the second server, and completing the fault recovery process further includes:

and if the intelligent platform management interface of the second server is detected to be abnormal, recording a field log so as to be used for analyzing the fault reason.

Optionally, after detecting whether the intelligent platform management interface of the second server is normal, the method further includes:

and accessing the second server by using the first server through a preset communication interface, and controlling the second server to restart to complete fault recovery.

Optionally, the controlling the complex programmable logic device to restart the second server to complete fault recovery includes:

and controlling the complex programmable logic device to restart the second server through GPIO to complete fault recovery.

In a second aspect, the present application discloses a server failure recovery apparatus, including:

the message sending module is used for monitoring heartbeat message sending operation between a first platform management controller of the first server and a second platform management controller of a second server when the binding relationship between the first server and the second server is detected;

the failure judgment module is used for judging that the second server has a failure problem if the first server is monitored not to receive the response information of the heartbeat message;

and the fault recovery module is used for detecting whether the intelligent platform management interface of the second server is normal or not, and controlling the complex programmable logic device to restart the second server to complete fault recovery if the intelligent platform management interface of the second server is abnormal.

In a third aspect, the present application discloses an electronic device, comprising:

a memory for storing a computer program;

a processor for executing the computer program to implement the steps of the server failure recovery method disclosed in the foregoing.

In a fourth aspect, the present application discloses a computer readable storage medium for storing a computer program; wherein the computer program realizes the steps of the server failure recovery method disclosed in the foregoing when executed by a processor.

Therefore, the application discloses a server failure recovery method, which comprises the following steps: when detecting that a binding relationship exists between a first server and a second server, monitoring heartbeat message sending operation between a first platform management controller of the first server and a second platform management controller of the second server; if the first server is monitored not to receive the response information of the heartbeat message, judging that the second server has a fault; and detecting whether the intelligent platform management interface of the second server is normal, and if not, controlling the complex programmable logic device to restart the second server to complete fault recovery. It can be seen that whether a binding relationship exists between two nearby servers is detected, if so, the two servers are proved to be capable of mutually detecting the other side and the control side, detecting whether the other side has a fault, and controlling the other side to perform fault recovery and the like; and mutually detecting whether the other side has a fault by sending a heartbeat message, if detecting that a certain side has no response information of the heartbeat message, indicating that the other server has a fault problem, immediately controlling the complex programmable logic device to restart the server with the fault, therefore, mutually detecting and controlling after being communicated through the hardware interface, and initiating a restarting activity through the complex programmable logic device when the fault occurs, so as to complete fault recovery.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of a server failure recovery method disclosed in the present application;

FIG. 2 is a flow chart of a particular server failure recovery method disclosed herein;

FIG. 3 is a flowchart of a method for controlling interaction between two servers according to the disclosure;

fig. 4 is a schematic structural diagram of a server failure recovery apparatus disclosed in the present application;

fig. 5 is a block diagram of an electronic device disclosed in the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

Currently, a server is the core of the entire network system and computing platform, and many important data are stored on the server. The BMC, in turn, is the server's housekeeper that uses sensors to monitor the status of a computer, web server, or other hardware driven device. The control right of the server is mastered. And the BMC can also control the refreshing of the BIOS, and the BIOS is used as the most front line for starting the server to master the highest speaking right of the server. Therefore, the stable work of the BMC is guaranteed, which is equivalent to the stable work of the server. When the server fails, if the server failure is discovered and solved only by manual monitoring and based on the self experience of the administrator, the degree of automation is low, and secondary failure is easily caused by manual handling of the server failure operation by human.

Therefore, the server fault recovery scheme is provided, fault automatic check of the server can be achieved, and fault recovery can be completed actively.

Referring to fig. 1, an embodiment of the present invention discloses a server failure recovery method, including:

step S11: when the binding relationship between a first server and a second server is detected, the heartbeat message sending operation between a first platform management controller of the first server and a second platform management controller of the second server is monitored.

In this embodiment, when it is detected that a binding relationship exists between two servers nearby, a heartbeat message monitoring operation is performed on the two servers with the binding relationship, where a heartbeat message sending operation between the two monitored servers may be sent to the second server by the first server, may also be sent to the first server by the second server, and may also be sent by the two servers, which is not specifically limited herein.

In this embodiment, when it is detected that a binding relationship exists between a first server and a second server through a preset communication interface, the first baseboard management controller and the second baseboard management controller are controlled to send heartbeat messages and/or receive response information for the heartbeat messages at preset time intervals; and monitoring a heartbeat message sending operation and/or a response information receiving operation between the first platform management controller of the first server and the second platform management controller of the second server. It is understood that the heartbeat message is sent and/or the response information for the heartbeat message is received at preset time intervals, for example: and the two servers communicate by sending heartbeat messages every 5 minutes. When it is detected that the specific hardware interfaces provided by the server main boards of the first server and the second server are used as the preset communication interfaces for communication, the first server and the second server have a binding relationship through the preset communication interfaces. After the binding relationship between the two nearby servers is determined, controlling the BMC of the first server and the BMC of the second server to start heartbeat message sending operation; in an embodiment, when a first server is configured to send a heartbeat message to a second server, the first server is used as a sender, and the second server is used as a receiver, that is, when the first server performs a heartbeat packet sending operation with the second server through a preset communication interface, for example, a serial port, the second server in a normal state needs to perform a return operation of a response message according to a current heartbeat packet.

In another embodiment, when a second server is set to send a heartbeat message to a first server, the second server serves as a sender, and the first server serves as a receiver, that is, when the second server performs a heartbeat packet sending operation with the first server through a preset communication interface, such as a serial port, the first server in a normal state needs to perform a response message returning operation according to a current heartbeat packet.

In another embodiment, the first server and the second server may be set as the sender of the heartbeat packet, and accordingly, the second server and the first server may also be used as the sender of the response message of the heartbeat packet.

Step S12: if the first server is monitored not to receive the response information of the heartbeat message, the second server is judged to have a fault problem.

In this embodiment, if it is monitored that the first server does not receive the response information of the heartbeat message, it is determined that a hang-up fault occurs in the second platform management controller of the second server. It can be understood that, after the first server performs the heartbeat packet sending operation with the second server through the serial port communication mode, the first server starts to wait for the response information feedback of the second server, and if the response information of the second server is received within a preset time period, it indicates that the second server has no fault; and if the response information of the second server is not received within the preset time period, indicating that the second server has a fault.

In this embodiment, if it is monitored that the second server does not receive the response information of the heartbeat message, it is determined that the first server has a failure; it can be understood that, when the first server serves as a heartbeat packet responder and the second server serves as a heartbeat packet sender, after the second server sends a heartbeat packet and waits for a preset time, if response information of the first server for the heartbeat packet is not received, it is indicated that the first server has a failure.

In this embodiment, if a server on one side that responds to a heartbeat packet does not receive a heartbeat packet sent by another server having a binding relationship for a long time, it may also be determined that another server may be currently failed.

Step S13: and detecting whether the intelligent platform management interface of the second server is normal, and if not, controlling the complex programmable logic device to restart the second server to complete fault recovery.

In this embodiment, the GPIO is used to control the complex programmable logic device to restart the second server, thereby completing fault recovery. It can be understood that whether the IPMI (Intelligent Platform Management Interface) of the second server is normal is detected through the network, and if not, the CPLD of the second server is controlled by GPIO (General-purpose input/output) to restart the second server, so as to recover the hang-up.

In this embodiment, whether the intelligent platform management interface of the first server is normal is detected, and if not, the complex programmable logic device is controlled to restart the first server to complete fault recovery. It can be understood that, whether the IPMI of the first server is normal is detected through the network, and if not, the CPLD of the first server is controlled by the GPIO to restart the first server, so as to recover the hang-up.

Therefore, the application discloses a server failure recovery method, which comprises the following steps: when detecting that a binding relationship exists between a first server and a second server, monitoring heartbeat message sending operation between a first platform management controller of the first server and a second platform management controller of the second server; if the first server is monitored not to receive the response information of the heartbeat message, judging that the second server has a fault problem; and detecting whether the intelligent platform management interface of the second server is normal, and if not, controlling the complex programmable logic device to restart the second server to complete fault recovery. It can be seen that whether a binding relationship exists between two nearby servers is detected, if so, the two servers are proved to be capable of mutually detecting the other side and the control side, detecting whether the other side has a fault, and controlling the other side to perform fault recovery and the like; and mutually detecting whether the other side has a fault by sending a heartbeat message, if detecting that a certain side has no response information of the heartbeat message, indicating that the other server has a fault problem, immediately controlling the complex programmable logic device to restart the server with the fault, therefore, mutually detecting and controlling after being communicated through the hardware interface, and initiating a restarting activity through the complex programmable logic device when the fault occurs, so as to complete fault recovery.

Referring to fig. 2, the embodiment of the present invention discloses a specific server failure recovery method, and compared with the previous embodiment, the present embodiment further describes and optimizes the technical solution. Specifically, the method comprises the following steps:

step S21: when the binding relationship between a first server and a second server is detected, the heartbeat message sending operation between a first platform management controller of the first server and a second platform management controller of the second server is monitored.

Step S22: and if the first server is monitored not to receive the response information of the heartbeat message, judging that the second server has a fault problem.

For a more detailed processing procedure in steps S21 and S22, please refer to the content of the foregoing disclosed embodiments, which is not described herein again.

Step S23: and detecting whether the intelligent platform management interface of the second server is normal, and recording a field log if the intelligent platform management interface of the second server is detected to be abnormal so as to be used for analyzing the fault reason.

In this embodiment, if it is detected that the intelligent platform management interface of the second server is abnormal, the abnormality is recorded and reported to the web control end, and the dying end records the end-to-end testimony for analyzing the dying reason.

Step S24: and accessing the second server by using the first server through a preset communication interface, and controlling the second server to restart to complete fault recovery.

In this embodiment, the first server is used to access the second server through the preset communication interface, and in the process of controlling the second server to perform the restart operation, when the IPMI master process of the server or one of the server networks is abnormal, the IPMI master process may be indirectly accessed and controlled through another server, so that only based on the hardware communication interface established between the first server and the second server, the server in the normal state controls the abnormal server to perform the operation of recovering the fault, such as the restart. The invention is described by a server architecture of an Intel platform, but the method is not limited to the server of the Intel platform, and has a general application value in servers of other platforms and computer platforms, and is not particularly limited herein. Referring to fig. 3, first, two server motherboards provide a specific hardware interface, such as a serial port, for the two servers to communicate through the specific hardware interface; the two GPIOs are used for restarting the BMC by controlling the CPLD after the BMC is hung dead; two Reset keys are used for restarting the server under the condition of no power failure when the server is halted; two UARTs (Universal Asynchronous Receiver/Transmitter) are used as a chip with parallel input and serial output, and are usually integrated on a motherboard, and most of them are 16550AFN chips. Therefore, firstly, a server A is determined, then whether a server bound nearby exists is monitored for the server A, if the server B and the server A are detected to exist in a binding relationship, the heartbeat message sending operation between the server A and the server B is monitored, if the server A sends the heartbeat message, the server B returns response information of the server A aiming at the heartbeat message, and the sending operation and the returning operation information between the two servers are monitored, wherein in the monitoring process, monitoring can be set to be carried out at preset time intervals; if the server A suddenly monitors that the server A does not receive the response information of the server B in a certain monitoring process, judging that the server B has a fault; at this time, whether GPIO in the server B is normal is detected, if not, the CPLD can be controlled to Reset the BMC of the server B, or if the IPMI main process of the server B is abnormal, the server A can be used for accessing and controlling the server B through a hardware interface, so that the BMC of the server B is recovered to be normal.

Therefore, the fault recovery and protection mechanism for mutual control and detection of the two nearby servers BMC protects the server to be more robust and stable service, further guarantees the stability of the current big data service, and immediately collects the current fault log when a fault occurs so as to analyze the fault reason.

Referring to fig. 4, an embodiment of the present invention discloses a specific server failure recovery apparatus, including:

a message sending module 11, configured to monitor a heartbeat message sending operation between a first platform management controller of a first server and a second platform management controller of a second server when detecting that a binding relationship exists between the first server and the second server;

a failure determining module 12, configured to determine that a failure occurs in the second server if it is monitored that the first server does not receive the response information of the heartbeat packet;

and the fault recovery module 13 is configured to detect whether an intelligent platform management interface of the second server is normal, and if the intelligent platform management interface of the second server is not normal, control the complex programmable logic device to restart the second server to complete fault recovery.

The message sending module 11 is specifically configured to, when it is detected that a binding relationship exists between two nearby servers, perform heartbeat message monitoring operation on the two servers having the binding relationship, where the heartbeat message sending operation between the two monitored servers may be sent to the second server by the first server, may also be sent to the first server by the second server, and may also be sent by the two servers, which is not specifically limited herein. When detecting that a binding relationship exists between a first server and a second server through a preset communication interface, controlling the first substrate management controller and the second substrate management controller to send heartbeat messages and/or receive response information aiming at the heartbeat messages at preset time intervals; and monitoring a heartbeat message sending operation and/or a response information receiving operation between the first platform management controller of the first server and the second platform management controller of the second server. It is understood that the heartbeat message is sent and/or the response information for the heartbeat message is received at preset time intervals, for example: and the two servers communicate by sending heartbeat messages every 5 minutes. When it is detected that the specific hardware interfaces provided by the server main boards of the first server and the second server are used as the preset communication interfaces for communication, the first server and the second server have a binding relationship through the preset communication interfaces. After the binding relationship between the two nearby servers is determined, controlling the BMC of the first server and the BMC of the second server to start heartbeat message sending operation; in an embodiment, when a first server is configured to send a heartbeat message to a second server, the first server is used as a sender, and the second server is used as a receiver, that is, when the first server performs a heartbeat packet sending operation with the second server through a preset communication interface, for example, a serial port, the second server in a normal state needs to perform a return operation of a response message according to a current heartbeat packet. In another embodiment, when a second server is set to send a heartbeat message to a first server, the second server serves as a sender, and the first server serves as a receiver, that is, when the second server performs a heartbeat packet sending operation with the first server through a preset communication interface, such as a serial port, the first server in a normal state needs to perform a response message returning operation according to a current heartbeat packet. In another embodiment, the first server and the second server may be set as the sender of the heartbeat packet, and correspondingly, the second server and the first server may also be used as the sender of the response message of the heartbeat packet.

The failure recovery module 13 is specifically configured to, in the process of accessing the second server through the preset communication interface by using the first server and controlling the second server to perform a restart operation, when an IPMI main process of the server or one of the server networks is abnormal, indirectly access and control through another server, so that only based on the hardware communication interface established between the first server and the second server, the server in the normal state controls the abnormal server to perform operations of recovering the failure, such as restart. The invention is described by a server architecture of an Intel platform, but the method is not limited to the server of the Intel platform, and has a general application value in servers of other platforms and computer platforms, and is not particularly limited herein. Firstly, two server mainboards provide specific hardware interfaces, such as serial ports, for the two servers to communicate through the specific hardware interfaces; the two GPIOs are used for restarting the BMC by controlling the CPLD after the BMC is hung up; two Reset keys are used for restarting the server under the condition of no power failure when the server is halted; two UARTs, UARTs for a chip with parallel input and serial output, are usually integrated on a motherboard, mostly 16550AFN chips. Therefore, firstly, a server A is determined, then whether a server bound nearby exists is monitored for the server A, if the server B and the server A are detected to exist in a binding relationship, the heartbeat message sending operation between the server A and the server B is monitored, if the server A sends the heartbeat message, the server B returns response information of the server A aiming at the heartbeat message, and the sending operation and the returning operation information between the two servers are monitored, wherein in the monitoring process, the monitoring can be carried out once at preset time intervals; if the server A suddenly monitors that the server A does not receive the response information of the server B in a certain monitoring process, judging that the server B has a fault; at this time, whether GPIO in the server B is normal is detected, if not, the CPLD can be controlled to Reset the BMC of the server B, or if the IPMI main process of the server B is abnormal, the server A can be used for accessing and controlling the server B through a hardware interface, so that the BMC of the server B is recovered to be normal.

In some specific embodiments, the message sending module 11 may specifically include:

the message sending unit is used for controlling the first substrate management controller and the second substrate management controller to send heartbeat messages and/or receive response information aiming at the heartbeat messages at preset time intervals when the first server and the second server are detected to have a binding relationship through a preset communication interface;

In some specific embodiments, the fault determining module 12 may specifically include:

and the fault judging unit is used for judging that a hang-up fault occurs in a second platform management controller of the second server if the first server is monitored not to receive the response information of the heartbeat message.

In some specific embodiments, the server failure recovery apparatus may specifically include:

a failure determining unit, configured to determine that a failure occurs in the first server if it is monitored that the second server does not receive response information of the heartbeat message;

correspondingly, the detecting whether the intelligent platform management interface of the second server is normal or not, if not, controlling the complex programmable logic device to restart the second server to complete fault recovery includes: and detecting whether the intelligent platform management interface of the first server is normal, and if not, controlling the complex programmable logic device to restart the first server to complete fault recovery.

and the log recording unit is used for recording a field log if the intelligent platform management interface of the second server is detected to be abnormal so as to analyze the fault reason.

and the access restarting unit is used for accessing the second server by utilizing the first server through a preset communication interface, controlling the second server to restart and completing fault recovery.

In some specific embodiments, the failure recovery module 13 may specifically include:

and the fault recovery unit is used for controlling the complex programmable logic device to restart the second server through the GPIO to complete fault recovery.

Further, an electronic device is disclosed in the embodiments of the present application, and fig. 5 is a block diagram of the electronic device 20 according to an exemplary embodiment, which should not be construed as limiting the scope of the application.

Fig. 5 is a schematic structural diagram of an electronic device 20 according to an embodiment of the present disclosure. The electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input output interface 25, and a communication bus 26. Wherein, the memory 22 is used for storing a computer program, and the computer program is loaded and executed by the processor 21 to implement the relevant steps in the server failure recovery method disclosed in any of the foregoing embodiments. In addition, the electronic device 20 in the present embodiment may be specifically an electronic computer.

In this embodiment, the power supply 23 is configured to provide an operating voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and an external device, and a communication protocol followed by the communication interface is any communication protocol applicable to the technical solution of the present application, and is not specifically limited herein; the input/output interface 25 is configured to obtain external input data or output data to the outside, and a specific interface type thereof may be selected according to specific application requirements, which is not specifically limited herein.

The processor 21 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 21 may be implemented in at least one hardware form of DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), PLA (Programmable Logic Array). The processor 21 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 21 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 21 may further include an AI (Artificial Intelligence) processor for processing a calculation operation related to machine learning.

In addition, the storage 22 is used as a carrier for resource storage, and may be a read-only memory, a random access memory, a magnetic disk or an optical disk, etc., and the resources stored thereon may include an operating system 221, a computer program 222, etc., and the storage manner may be a transient storage or a permanent storage.

The operating system 221 is used for managing and controlling each hardware device and the computer program 222 on the electronic device 20, so as to realize the operation and processing of the mass data 223 in the memory 22 by the processor 21, and may be Windows Server, netware, unix, linux, and the like. The computer program 222 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the server failure recovery method performed by the electronic device 20 disclosed in any of the foregoing embodiments. The data 223 may include data received by the electronic device and transmitted from an external device, or may include data collected by the input/output interface 25 itself.

Further, the present application also discloses a computer-readable storage medium for storing a computer program; wherein the computer program when executed by a processor implements the server failure recovery method disclosed above. For the specific steps of the method, reference may be made to the corresponding contents disclosed in the foregoing embodiments, which are not described herein again.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application. The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

The server failure recovery method, device, apparatus, and storage medium provided by the present invention are described in detail above, and a specific example is applied in the present disclosure to explain the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understand the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for server failure recovery, comprising:

and detecting whether the intelligent platform management interface of the second server is normal, and if not, controlling the complex programmable logic device to restart the second server to complete fault recovery.

2. The method for recovering server failure according to claim 1, wherein the monitoring a heartbeat messaging operation between a first platform management controller of a first server and a second platform management controller of a second server when detecting that a binding relationship exists between the first server and the second server includes:

3. The method for recovering server failure according to claim 1, wherein if it is monitored that the first server does not receive the response information of the heartbeat packet, determining that the second server has a failure problem comprises:

4. The method of claim 1, wherein after monitoring the heartbeat messaging operation between the first platform management controller of the first server and the second platform management controller of the second server, the method further comprises:

5. The method for recovering server failure according to claim 1, wherein the detecting whether the intelligent platform management interface of the second server is normal or not, and if not, controlling the complex programmable logic device to restart the second server, and completing the failure recovery process further comprises:

6. The method for recovering server failure according to any one of claims 1 to 5, wherein after detecting whether the intelligent platform management interface of the second server is normal, the method further includes:

7. The server failure recovery method of claim 1, wherein controlling the complex programmable logic device to restart the second server to complete failure recovery comprises:

and controlling the complex programmable logic device to restart the second server through the GPIO to complete fault recovery.

8. A server failure recovery apparatus, comprising:

the message sending module is used for monitoring heartbeat message sending operation between a first platform management controller of the first server and a second platform management controller of the second server when the binding relationship exists between the first server and the second server;

9. An electronic device, comprising:

a memory for storing a computer program;

a processor for executing the computer program to implement the steps of the server failure recovery method according to any of claims 1 to 7.

10. A computer-readable storage medium for storing a computer program; wherein the computer program when executed by a processor implements the steps of the server failure recovery method of any one of claims 1 to 7.