CN114189429A

CN114189429A - System, method, device and medium for monitoring server cluster faults

Info

Publication number: CN114189429A
Application number: CN202111415524.3A
Authority: CN
Inventors: 苏康; 郭芬; 满宏涛; 李拓
Original assignee: Shandong Yunhai Guochuang Cloud Computing Equipment Industry Innovation Center Co Ltd
Current assignee: Shandong Yunhai Guochuang Cloud Computing Equipment Industry Innovation Center Co Ltd
Priority date: 2021-11-25
Filing date: 2021-11-25
Publication date: 2022-03-15

Abstract

The application discloses a monitoring system, a method, a device and a medium for server cluster faults, wherein the monitoring system comprises an active server and a standby server; the active server BMC chip is in communication connection with the standby server BMC chip; the active server BMC chip comprises a first private memory, and the standby server BMC chip comprises a first shared memory; the active server BMC chip is used for writing the data information of the active server into the first private memory and sending the data information to the standby server BMC chip; the standby server BMC chip is used for writing the data information into the first shared memory so as to judge whether the active server fails according to the data information. By the interconnection between the active server BMC chip and the standby server BMC chip, the standby server can monitor the real-time fault of the active server, the fault transfer time is reduced, the fault-tolerant capability of a server cluster is enhanced, and the loss caused by the fault of the active server is reduced.

Description

System, method, device and medium for monitoring server cluster faults

Technical Field

The present application relates to the field of big data processing technologies, and in particular, to a method, a system, an apparatus, and a medium for monitoring a server cluster fault.

Background

With the development of services and the continuous accumulation of data, a single high-performance server cannot process a large amount of data and the centralized access of high-concurrency users. Moreover, the fault tolerance of a single server is very limited, and when the server fails, service forced interruption, data loss and other losses can occur. In order to improve the overall computing capacity and fault-tolerant capability of the server, a server cluster is generated. The server cluster can utilize a plurality of computers to perform parallel computation so as to obtain high computation speed, and also can use a plurality of computers to perform backup so that the whole system can still normally operate when any one server fails. Currently, a failover cluster is designed for an application program with a long-running memory state or a large and frequently updated data state, and typical application ranges include a file server, a print server and a database server. The method is mainly used for building a high-availability framework. A plurality of cluster servers (called nodes) are connected by physical cables and software, and if one node fails, the other node can start providing service through a failover process instead.

The primary step in the failover process is to determine that the active server is no longer functioning properly. Typically, the system uses a heartbeat mechanism to do this by the active server sending a specified signal to the standby server at defined intervals or the standby server sending a request to the active server and waiting for the active server to return a response. Determining that the active server has failed in a heartbeat mechanism requires a certain time interval and in order to determine that the active server has indeed failed, the standby server may need to set a longer time interval to wait for the active server to send a signal or response. Moreover, when some hardware parameters (such as the rotating speed of a fan, the temperature of a chassis and the like) of the active server exceed a threshold value, the system can still normally operate for a period of time, the CPU cannot grasp the fault information in the first time, and the standby server still receives all normal signals of the active server at this time and cannot accurately grasp the operation condition of the active server in time.

Therefore, how to improve the timeliness of server cluster fault monitoring to effectively reduce the loss caused by the fault of the active server is an urgent problem to be solved by those skilled in the art.

Disclosure of Invention

The application aims to provide a system, a method, a device and a medium for monitoring server cluster faults, which are used for improving the timeliness of server cluster fault monitoring so as to effectively reduce the loss caused by the faults of active servers.

In order to solve the technical problem, the application provides a server cluster fault monitoring system, which comprises an active server and a standby server;

the active server BMC chip is in communication connection with the standby server BMC chip;

the active server BMC chip comprises a first private memory, and the standby server BMC chip comprises a first shared memory;

the active server BMC chip is used for writing the data information of the active server into the first private memory and simultaneously sending the data information to the standby server BMC chip;

the standby server BMC chip is used for writing the data information into the first shared memory so as to read the data information in the first shared memory in real time and judge whether the active server fails according to the data information.

Preferably, the active server BMC chip further includes a second shared memory, and the standby server BMC chip further includes a second private memory.

The application also provides a method for monitoring the server cluster fault, which is applied to the active server BMC chip and comprises the following steps:

acquiring data information of an active server;

and writing the data information into a first private memory, and simultaneously sending the data information to a standby server BMC chip, so that the standby server BMC chip can write the data information into a first shared memory, and read the data information in the first shared memory in real time, so as to judge whether the active server fails according to the data information.

The application also provides a method for monitoring the server cluster fault, which is applied to the standby server BMC chip and comprises the following steps:

when an active server BMC chip acquires data information of an active server and writes the data information into a first private memory, receiving the data information sent by the active server BMC chip;

writing the data information into a first shared memory;

and reading the data information in the first shared memory in real time so as to judge whether the active server fails according to the data information.

Preferably, the determining whether the active server fails according to the data information includes:

judging whether the data information meets a preset requirement or not;

and if not, determining that the active server fails.

Preferably, after determining that the active server fails, the method further includes:

an alert prompt is sent to a CPU of a standby server so that the CPU initiates a state synchronization mechanism to take over tasks performed by the active server.

The present application further provides a monitoring device for server cluster faults, including:

the receiving module is used for receiving the data information sent by the BMC chip of the active server when the BMC chip of the active server acquires the data information of the active server and writes the data information into a first private memory;

the write-in module is used for writing the data information into a first shared memory;

and the judging module is used for reading the data information in the first shared memory in real time so as to judge whether the active server fails according to the data information.

Preferably, the method further comprises the following steps:

and the alarm module is used for sending an alarm prompt to the CPU of the standby server so that the CPU starts a state synchronization mechanism to take over the task executed by the active server.

The application also provides a device for monitoring the server cluster faults, which comprises a memory, a monitoring module and a monitoring module, wherein the memory is used for storing a computer program;

and the processor is used for realizing the steps of the monitoring method for the server cluster faults when executing the computer program.

The present application further provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the method for monitoring server cluster faults.

The server cluster fault monitoring system comprises an active server and a standby server; the active server BMC chip is in communication connection with the standby server BMC chip; the active server BMC chip comprises a first private memory, and the standby server BMC chip comprises a first shared memory; the active server BMC chip is used for writing the data information of the active server into the first private memory and sending the data information to the standby server BMC chip; the standby server BMC chip is used for writing the data information into the first shared memory so as to read the data information in the first shared memory in real time and judge whether the active server fails according to the data information. According to the method and the system, real-time fault monitoring of the active server by the standby server is realized through interconnection between the BMC chip of the active server and the BMC chip of the standby server, so that the fault transfer time is reduced, the fault-tolerant capability of a server cluster is effectively enhanced, and the loss caused by the fault of the active server is reduced.

The method, the device and the medium for monitoring the server cluster faults correspond to the system, and the effect is as above.

Drawings

In order to more clearly illustrate the embodiments of the present application, the drawings needed for the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.

Fig. 1 is a structural diagram of a monitoring system for server cluster faults according to an embodiment of the present application;

fig. 2 is a flowchart of a method for monitoring a server cluster fault according to an embodiment of the present application;

fig. 3 is a structural diagram of a monitoring device for server cluster faults according to an embodiment of the present application;

fig. 4 is a structural diagram of another monitoring device for server cluster faults according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without any creative effort belong to the protection scope of the present application.

The core of the application is to provide a system, a method, a device and a medium for monitoring server cluster faults.

In order that those skilled in the art will better understand the disclosure, the following detailed description will be given with reference to the accompanying drawings. In addition, the present application relates generally to a Baseboard Management Controller (BMC), which can monitor the temperature, voltage, fan, power supply, etc. of the system; the BMC is also responsible for recording information and log records of various hardware, and is used for prompting a user and positioning subsequent problems, and of course, the BMC also has other functions, which are not listed.

Fig. 1 is a structural diagram of a monitoring system for a server cluster fault according to an embodiment of the present disclosure. As shown in fig. 1, the monitoring system for server cluster faults includes an active server 1 and a standby server 2, the active server 1 is provided with an active server BMC chip 3, the standby server 2 is provided with a standby server BMC chip 4, the active server BMC chip 3 includes a first private memory 5 and a second shared memory 6, and the standby server BMC chip 4 includes a second private memory 7 and a first shared memory 8. The active server BMC chip 3 and the standby server BMC chip 4 are in communication connection; the active server BMC chip 3 is used for writing the data information of the active server 1 into the first private memory 5 and sending the data information to the standby server BMC chip 4; the standby server BMC chip 4 is configured to write data information into the first shared memory 8, so as to read the data information in the first shared memory 8 in real time, and determine whether the active server 1 fails according to the data information.

In this embodiment, the active server BMC chip 3 and the standby server BMC chip 4 are in communication connection, which may be wired connection or wireless connection, and the communication connection mode is not specifically limited in this embodiment, the active server BMC chip 3 refers to a BMC chip provided by the active server 1, and the standby server BMC chip 4 refers to a BMC chip provided by the standby server 2. The data information of the active server 1 in this embodiment may be hardware information in the active server 1, which is monitored by the BMC chip 3 of the active server, including a fan rotation speed, a Central Processing Unit (CPU) temperature, a power supply condition, and the like. The active server BMC chip 3 writes the data information into the first private memory 5 of itself for recording, and writes the data information into the first shared memory of the standby server BMC chip 4 at the same time. The standby server BMC chip 4 reads the data information in the first shared memory 8 in real time, and once a certain hardware parameter in the data information is found to exceed a threshold value, the active server 1 is determined to have a fault, the active server 2 is warned to the CPU, the CPU immediately starts a state synchronization mechanism to take over a task executed by the active server 1, and therefore the fault transfer process is completed in the first time.

Similarly, when the standby server 2 operates, the standby server BMC chip 4 writes the monitored data information of the standby server 2 into the second private memory 7 of the standby server, and simultaneously sends the data information of the standby server 2 to the active server 1, at this time, the active server 1 plays a role of the standby server 2, writes the data information of the standby server 2 into the second shared memory 6 of the standby server, and reads the data information of the standby server 2 in the second shared memory 6 in real time to determine whether the standby server 2 fails, if the standby server 2 fails, the CPU of the active server 1 immediately starts a state synchronization mechanism to take over a task executed by the standby server 2, thereby completing a failure transfer process at the first time.

Based on the foregoing embodiment of a monitoring system for server cluster failures, an embodiment of the present application provides a monitoring method for server cluster failures, where the method is applied to a BMC chip of an active server, and includes: acquiring data information of an active server; and writing the data information into the first private memory, and simultaneously sending the data information to the standby server BMC chip so that the standby server BMC chip can write the data information into the first shared memory and read the data information in the first shared memory in real time to judge whether the active server fails according to the data information.

Since the embodiment of the method portion corresponds to the embodiment of the system portion, please refer to the description of the embodiment of the system portion for the embodiment of the method portion, which is not repeated here.

In the embodiment of the application, the active server BMC chip writes data information of a related active server into a first private memory of the active server after acquiring the data information, and simultaneously sends the data information to the standby server BMC chip, and the standby server BMC chip reads the data information in the first shared memory in real time to realize real-time fault monitoring of the active server by the standby server, so that the fault transfer time is reduced, the fault tolerance of a server cluster is effectively enhanced, and the loss caused by the fault of the active server is reduced.

Based on the monitoring system for the server cluster fault in the embodiment, the embodiment of the application further provides a monitoring method for the server cluster fault, and the monitoring method is applied to a standby server BMC chip. Fig. 2 is a flowchart of a method for monitoring a server cluster fault according to an embodiment of the present application, where as shown in fig. 2, the method for monitoring a server cluster fault includes:

s10: and when the active server BMC chip acquires the data information of the active server and writes the data information into the first private memory, receiving the data information sent by the active server BMC chip.

S11: and writing the data information into the first shared memory.

S12: and reading the data information in the first shared memory in real time.

S13: judging whether the data information meets the preset requirement or not; if not, go to step S14.

S14: it is determined that the active server has failed.

S15: an alert prompt is sent to the CPU of the standby server so that the CPU initiates a state synchronization mechanism to take over the tasks performed by the active server.

In the embodiment of the application, whether the data information meets the preset requirement or not can be judged by judging whether some hardware parameters, such as the rotating speed of a fan, the temperature of a CPU and the like, exceed a threshold value or not, if the hardware parameters exceed the threshold value, the active server is determined to send a fault, and an alarm prompt is sent to the CPU of the standby server, so that the CPU starts a state synchronization mechanism to take over tasks executed by the active server.

In the embodiment of the application, the standby server BMC chip receives the data information of the active server sent by the active server BMC chip at the first time, and the standby server BMC chip can read the data information in the first shared memory in real time to realize real-time fault monitoring of the standby server on the active server, so that the fault transfer time is reduced, the fault tolerance of a server cluster is effectively enhanced, and the loss caused by the fault of the active server is reduced.

In the above embodiments, a monitoring system for a server cluster fault is described in detail, and the present application also provides embodiments corresponding to a monitoring device for a server cluster fault. It should be noted that the present application describes the embodiments of the apparatus portion from two perspectives, one from the perspective of the function module and the other from the perspective of the hardware.

Fig. 3 is a structural diagram of a monitoring device for server cluster faults according to an embodiment of the present application. As shown in fig. 3, the apparatus for monitoring server cluster failure includes:

the receiving module 10 is configured to receive the data information sent by the BMC chip of the active server when the BMC chip of the active server obtains the data information of the active server and writes the data information into the first private memory.

The writing module 11 is configured to write the data information into the first shared memory.

The determining module 12 is configured to read data information in the first shared memory in real time, so as to determine whether the active server fails according to the data information.

Based on the above embodiment, as a preferred embodiment, the judging module includes:

the judging unit is used for judging whether the data information meets the preset requirement or not;

and the determining unit is used for determining that the active server fails when the data information does not meet the preset requirement.

Based on the above embodiment, as a preferred embodiment, the method further includes:

Since the embodiment of the apparatus portion corresponds to the embodiment of the system portion, please refer to the description of the embodiment of the system portion for the embodiment of the apparatus portion, which is not repeated here.

According to the monitoring device for the server cluster fault, when the active server BMC chip acquires the data information of the active server and writes the data information into the first private memory, the data information sent by the active server BMC chip is received; writing the data information into a first shared memory; and reading the data information in the first shared memory in real time so as to judge whether the active server fails according to the data information. In the embodiment of the application, the standby server BMC chip receives the data information of the active server sent by the active server BMC chip at the first time, and the standby server BMC chip can read the data information in the first shared memory in real time to realize real-time fault monitoring of the standby server on the active server, so that the fault transfer time is reduced, the fault tolerance of a server cluster is effectively enhanced, and the loss caused by the fault of the active server is reduced.

Fig. 4 is a structural diagram of another monitoring device for server cluster faults provided in an embodiment of the present application, and as shown in fig. 4, the monitoring device for server cluster faults includes: a memory 20 for storing a computer program;

the processor 21 is configured to implement the steps of the method for monitoring server cluster failure according to the above embodiment when executing the computer program.

The monitoring device for the server cluster fault provided by this embodiment may include, but is not limited to, a smart phone, a tablet computer, a notebook computer, or a desktop computer.

The processor 21 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 21 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 21 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 21 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 21 may further include an AI (Artificial Intelligence) processor for processing a calculation operation related to machine learning.

The memory 20 may include one or more computer-readable storage media, which may be non-transitory. Memory 20 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 20 is at least used for storing the following computer program 201, wherein after being loaded and executed by the processor 21, the computer program can implement the relevant steps of the monitoring method for server cluster failure disclosed in any of the foregoing embodiments. In addition, the resources stored in the memory 20 may also include an operating system 202, data 203, and the like, and the storage manner may be a transient storage manner or a permanent storage manner. Operating system 202 may include, among others, Windows, Unix, Linux, and the like. Data 203 may include, but is not limited to, data information, and the like.

In some embodiments, the device for monitoring server cluster failure may further include a display screen 22, an input/output interface 23, a communication interface 24, a power supply 25, and a communication bus 26.

Those skilled in the art will appreciate that the configuration shown in fig. 4 does not constitute a limitation of the means for monitoring server cluster failures and may include more or fewer components than those shown.

The device for monitoring the server cluster faults comprises a memory and a processor, and the processor can realize the method for monitoring the server cluster faults when executing programs stored in the memory.

In the monitoring device for the server cluster fault provided by the embodiment of the application, when the active server BMC chip acquires the data information of the active server and writes the data information into the first private memory, the data information sent by the active server BMC chip is received; writing the data information into a first shared memory; and reading the data information in the first shared memory in real time so as to judge whether the active server fails according to the data information. In the embodiment of the application, the standby server BMC chip receives the data information of the active server sent by the active server BMC chip at the first time, and the standby server BMC chip can read the data information in the first shared memory in real time to realize real-time fault monitoring of the standby server on the active server, so that the fault transfer time is reduced, the fault tolerance of a server cluster is effectively enhanced, and the loss caused by the fault of the active server is reduced.

Finally, the application also provides a corresponding embodiment of the computer readable storage medium. The computer-readable storage medium stores thereon a computer program, which when executed by the processor implements the steps described in the above method embodiments (which may be a method corresponding to the active server BMC chip side, a method corresponding to the standby server BMC chip side, or a method corresponding to the active server BMC chip side and the standby server BMC chip side).

It is to be understood that if the method in the above embodiments is implemented in the form of software functional units and sold or used as a stand-alone product, it can be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium and executes all or part of the steps of the methods described in the embodiments of the present application, or all or part of the technical solutions. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The monitoring medium for server cluster faults provided by the embodiment of the application can realize the following method: the standby server BMC chip receives the data information of the active server sent by the active server BMC chip at the first time, and the standby server BMC chip can read the data information in the first shared memory in real time to realize real-time fault monitoring of the active server by the standby server, so that the fault transfer time is reduced, the fault-tolerant capability of a server cluster is effectively enhanced, and the loss caused by the fault of the active server is reduced.

The above details describe a system, a method, a device and a medium for monitoring server cluster faults provided by the present application. The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The method, the device and the medium disclosed by the embodiment correspond to the system disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the system part for description. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A monitoring system for server cluster faults is characterized by comprising an active server and a standby server;

2. The system of claim 1, wherein the active server BMC chip further comprises a second shared memory and the standby server BMC chip further comprises a second private memory.

3. A method for monitoring server cluster failures, applied to the active server BMC chip of claim 1 or 2, comprising:

acquiring data information of an active server;

4. A method for monitoring server cluster failure, applied to the standby server BMC chip of claim 1 or 2, comprising:

writing the data information into a first shared memory;

5. The method for monitoring server cluster faults according to claim 4, wherein the determining whether the active server fails according to the data information includes:

judging whether the data information meets a preset requirement or not;

and if not, determining that the active server fails.

6. The method for monitoring server cluster failure according to claim 5, further comprising, after determining that the active server fails:

7. A device for monitoring server cluster faults is characterized by comprising:

8. The apparatus for monitoring server cluster failure according to claim 7, further comprising:

9. A server cluster failure monitoring apparatus comprising a memory for storing a computer program;

a processor for implementing the steps of the method for monitoring of server cluster failures according to any of claims 3 to 6 when executing said computer program.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for monitoring a server cluster failure according to any one of claims 3 to 6.