CN114189429B

CN114189429B - Monitoring system, method, device and medium for server cluster faults

Info

Publication number: CN114189429B
Application number: CN202111415524.3A
Authority: CN
Inventors: 苏康; 郭芬; 满宏涛; 李拓
Original assignee: Shandong Yunhai Guochuang Cloud Computing Equipment Industry Innovation Center Co Ltd
Current assignee: Shandong Yunhai Guochuang Cloud Computing Equipment Industry Innovation Center Co Ltd
Priority date: 2021-11-25
Filing date: 2021-11-25
Publication date: 2024-07-26
Anticipated expiration: 2041-11-25
Also published as: CN114189429A

Abstract

The application discloses a monitoring system, a method, a device and a medium for server cluster faults, wherein the monitoring system comprises an active server and a standby server; the active server BMC chip is in communication connection with the standby server BMC chip; the active server BMC chip comprises a first private memory, and the standby server BMC chip comprises a first shared memory; the active server BMC chip is used for writing the data information of the active server into the first private memory and sending the data information to the standby server BMC chip; the standby server BMC chip is used for writing data information into the first shared memory so as to judge whether the active server fails according to the data information. Through interconnection between the active server BMC chip and the standby server BMC chip, real-time fault monitoring of the standby server to the active server is realized, the fault transfer time is reduced, the fault tolerance of the server cluster is enhanced, and the loss caused by the fault of the active server is reduced.

Description

Monitoring system, method, device and medium for server cluster faults

Technical Field

The present application relates to the field of big data processing technologies, and in particular, to a method, a system, an apparatus, and a medium for monitoring a server cluster fault.

Background

With the development of business and the continuous accumulation of data, a single server with high performance cannot handle a large amount of data and centralized access of high concurrent users. And the fault tolerance of a single server is very limited, and when the server fails, the loss of forced interruption of service, data loss and the like can occur. To improve the overall computing power and fault tolerance of the servers, server clusters have been created. The server cluster can use a plurality of computers to perform parallel computation so as to obtain high computation speed, and can also use a plurality of computers to perform backup, so that the whole system can still normally operate when any one server fails. Currently, failover clusters are designed for applications with long-running memory states or with large, frequently updated data states, with typical application areas including file servers, print servers, database servers. The method is mainly used for building a high-availability architecture. Multiple cluster servers (called nodes) are connected by physical cables and software, and if one node fails, the other node begins to provide services through the failover process instead.

The primary step of the failover process is to determine that the active server is no longer functioning properly. Typically, the system uses a heartbeat mechanism to do this by the active server sending a specified signal to the standby server at defined intervals or the standby server sending a request to the active server and waiting for the active server to return a response. Determining that an active server fails in a heartbeat mechanism requires a certain time interval and to determine that the active server does fail, a standby server may need to set a longer time interval to wait for the active server to send a signal or response. Furthermore, when certain hardware parameters (such as fan rotation speed, chassis temperature, etc.) of the active server exceed the threshold, the system can still operate normally for a period of time, the CPU cannot grasp the fault information at the first time, and at this time, the standby server still receives all normal signals of the active server, and cannot accurately grasp the operation condition of the active server in time.

Therefore, how to improve the timeliness of server cluster fault monitoring to effectively reduce the loss caused by the fault of the active server is a problem that needs to be solved by those skilled in the art.

Disclosure of Invention

The application aims to provide a monitoring system, a method, a device and a medium for server cluster faults, which are used for improving the timeliness of server cluster fault monitoring so as to effectively reduce the loss caused by active server faults.

In order to solve the technical problems, the application provides a monitoring system for server cluster faults, which comprises an active server and a standby server;

The active server BMC chip is in communication connection with the standby server BMC chip;

the active server BMC chip comprises a first private memory, and the standby server BMC chip comprises a first shared memory;

The active server BMC chip is used for writing the data information of the active server into the first private memory and simultaneously sending the data information to the standby server BMC chip;

The standby server BMC chip is used for writing the data information into the first shared memory so as to read the data information in the first shared memory in real time, and judging whether the active server fails according to the data information.

Preferably, the active server BMC chip further includes a second shared memory, and the standby server BMC chip further includes a second private memory.

The application also provides a monitoring method of server cluster faults, which is applied to the BMC chip of the active server and comprises the following steps:

Acquiring data information of an activity server;

And writing the data information into a first private memory, and simultaneously sending the data information to a standby server BMC chip so that the standby server BMC chip writes the data information into the first shared memory, and reading the data information in the first shared memory in real time to judge whether the active server fails according to the data information.

The application also provides a monitoring method of server cluster faults, which is applied to the standby server BMC chip and comprises the following steps:

when the BMC chip of the active server acquires data information of the active server and writes the data information into a first private memory, receiving the data information sent by the BMC chip of the active server;

writing the data information into a first shared memory;

And reading the data information in the first shared memory in real time to judge whether the active server fails according to the data information.

Preferably, the determining whether the active server fails according to the data information includes:

Judging whether the data information meets a preset requirement or not;

if not, determining that the active server fails.

Preferably, after determining that the active server fails, the method further comprises:

and sending an alarm prompt to the CPU of the standby server so that the CPU starts a state synchronization mechanism to take over tasks executed by the active server.

The application also provides a device for monitoring the server cluster faults, which comprises the following steps:

The receiving module is used for receiving the data information sent by the active server BMC chip when the active server BMC chip acquires the data information of the active server and writes the data information into the first private memory;

the writing module is used for writing the data information into a first shared memory;

And the judging module is used for reading the data information in the first shared memory in real time so as to judge whether the active server fails according to the data information.

Preferably, the method further comprises:

and the alarm module is used for sending an alarm prompt to the CPU of the standby server so that the CPU starts a state synchronization mechanism to take over the task executed by the active server.

The application also provides a monitoring device for server cluster faults, which comprises a memory, a first storage unit and a second storage unit, wherein the memory is used for storing a computer program;

and the processor is used for realizing the steps of the server cluster fault monitoring method when executing the computer program.

The application also provides a computer readable storage medium, on which a computer program is stored, which when being executed by a processor implements the steps of the method for monitoring server cluster faults.

The application provides a monitoring system for server cluster faults, which comprises an active server and a standby server; the active server BMC chip is in communication connection with the standby server BMC chip; the active server BMC chip comprises a first private memory, and the standby server BMC chip comprises a first shared memory; the active server BMC chip is used for writing the data information of the active server into the first private memory and simultaneously sending the data information to the standby server BMC chip; the standby server BMC chip is used for writing the data information into the first shared memory so as to read the data information in the first shared memory in real time and judging whether the active server fails according to the data information. According to the application, through interconnection between the BMC chips of the active server and the BMC chips of the standby server, real-time fault monitoring of the standby server on the active server is realized, so that the fault transfer time is reduced, the fault tolerance of a server cluster is effectively enhanced, and the loss caused by the fault of the active server is reduced.

The method, the device and the medium for monitoring the server cluster faults correspond to the system, and the effect is as above.

Drawings

For a clearer description of embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described, it being apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.

Fig. 1 is a block diagram of a monitoring system for server cluster failure according to an embodiment of the present application;

fig. 2 is a flowchart of a method for monitoring a server cluster fault according to an embodiment of the present application;

fig. 3 is a block diagram of a monitoring device for server cluster failure according to an embodiment of the present application;

Fig. 4 is a block diagram of another monitoring device for server cluster failure according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. Based on the embodiments of the present application, all other embodiments obtained by a person of ordinary skill in the art without making any inventive effort are within the scope of the present application.

The application provides a monitoring system, a method, a device and a medium for server cluster faults.

In order to better understand the aspects of the present application, the present application will be described in further detail with reference to the accompanying drawings and detailed description. In addition, the present application relates generally to baseboard management controllers (Baseboard Management Controller, BMC) capable of monitoring system temperature, voltage, fans, power, etc.; the BMC is also responsible for recording information and log records of various hardware and is used for prompting a user and positioning subsequent problems, and of course, has other functions, which are not listed here.

Fig. 1 is a block diagram of a monitoring system for server cluster failure according to an embodiment of the present application. As shown in fig. 1, the monitoring system for server cluster failure includes an active server 1 and a standby server 2, where the active server 1 is provided with an active server BMC chip 3, and the standby server 2 is provided with a standby server BMC chip 4, and the active server BMC chip 3 includes a first private memory 5 and a second shared memory 6, and the standby server BMC chip 4 includes a second private memory 7 and a first shared memory 8. The active server BMC chip 3 and the standby server BMC chip 4 are in communication connection; the active server BMC chip 3 is used for writing the data information of the active server 1 into the first private memory 5 and simultaneously sending the data information to the standby server BMC chip 4; the standby server BMC chip 4 is configured to write data information into the first shared memory 8, so as to read the data information in the first shared memory 8 in real time, and determine whether the active server 1 fails according to the data information.

In the embodiment of the present application, the active server BMC chip 3 and the standby server BMC chip 4 are in communication connection, and may be wired connection or wireless connection, and the mode of communication connection is not particularly limited in the embodiment of the present application, the active server BMC chip 3 refers to a BMC chip provided by the active server 1, and the standby server BMC chip 4 refers to a BMC chip provided by the standby server 2. The data information about the active server 1 in the embodiment of the present application may be the hardware information of the active server 1 monitored by the active server BMC chip 3, including the fan rotation speed, the central processing unit (Central Processing Unit, CPU) temperature, the power supply condition, etc., and the embodiment of the present application is not limited to the data information specifically. The active server BMC chip 3 writes the data information into the first private memory 5 of the active server BMC chip to record, and simultaneously writes the data information into the first shared memory of the standby server BMC chip 4. The standby server BMC chip 4 reads the data information in the first shared memory 8 in real time, and once a certain hardware parameter in the data information exceeds a threshold value, the active server 1 is determined to be faulty, the CPU of the standby server 2 is alerted, and immediately starts a state synchronization mechanism to take over tasks executed by the active server 1, so that the fault transfer process is completed in the first time.

Similarly, when the standby server 2 is running, the standby server BMC chip 4 writes the monitored data information of the standby server 2 into the own second private memory 7, and simultaneously sends the data information of the standby server 2 to the active server 1, at this time, the active server 1 plays the role of the standby server 2, writes the data information of the standby server 2 into the own second shared memory 6, reads the data information of the standby server 2 in the second shared memory 6 in real time, so as to determine whether the standby server 2 fails, and if the standby server 2 fails, the CPU of the active server 1 immediately starts the state synchronization mechanism to take over the task executed by the standby server 2, thereby completing the failover process in the first time.

Based on the monitoring system for server cluster faults in the above embodiment, the embodiment of the present application provides a method for monitoring server cluster faults, where the method is applied to an active server BMC chip, and includes: acquiring data information of an activity server; and writing the data information into the first private memory, and simultaneously sending the data information to the standby server BMC chip so that the standby server BMC chip writes the data information into the first shared memory, and reading the data information in the first shared memory in real time to judge whether the active server fails according to the data information.

Since the embodiments of the method portion correspond to those of the system portion, the embodiments of the method portion are described with reference to the embodiments of the system portion, which are not repeated herein.

According to the embodiment of the application, the active server BMC chip acquires the data information of the related active server and writes the data information into the first private memory of the active server and simultaneously sends the data information to the standby server BMC chip, and the standby server BMC chip reads the data information in the first shared memory in real time to realize real-time fault monitoring of the standby server on the active server, so that the fault transfer time is reduced, the fault tolerance of a server cluster is effectively enhanced, and the loss caused by the fault of the active server is reduced.

Based on the monitoring system for the server cluster fault in the above embodiment, the embodiment of the application also provides a monitoring method for the server cluster fault, which is applied to a standby server BMC chip. Fig. 2 is a flowchart of a method for monitoring a server cluster fault, where the method for monitoring a server cluster fault, as shown in fig. 2, includes:

S10: when the active server BMC chip acquires the data information of the active server and writes the data information into the first private memory, the data information sent by the active server BMC chip is received.

S11: and writing the data information into the first shared memory.

S12: and reading the data information in the first shared memory in real time.

S13: judging whether the data information meets the preset requirement or not; if not, go to step S14.

S14: it is determined that the active server is malfunctioning.

S15: an alert prompt is sent to the CPU of the standby server so that the CPU initiates a state synchronization mechanism to take over tasks performed by the active server.

In the embodiment of the application, whether the data information meets the preset requirement or not can be judged by judging whether hardware parameters such as the rotating speed of a fan, the temperature of a CPU and the like exceed a threshold value, if the hardware parameters exceed the threshold value, determining that the active server sends a fault, and sending an alarm prompt to the CPU of the standby server so that the CPU starts a state synchronization mechanism to take over the task executed by the active server.

According to the embodiment of the application, the standby server BMC chip receives the data information of the active server sent by the active server BMC chip at the first time, and can read the data information in the first shared memory in real time to realize real-time fault monitoring of the standby server on the active server, so that the fault transfer time is reduced, the fault tolerance of a server cluster is effectively enhanced, and the loss caused by the fault of the active server is reduced.

In the above embodiment, the detailed description is given to the monitoring system of the server cluster fault, and the application further provides a corresponding embodiment of the monitoring device of the server cluster fault. It should be noted that the present application describes an embodiment of the device portion from two angles, one based on the angle of the functional module and the other based on the angle of the hardware.

Fig. 3 is a block diagram of a monitoring device for server cluster faults according to an embodiment of the present application. As shown in fig. 3, the monitoring device for server cluster failure includes:

The receiving module 10 is configured to receive, when the active server BMC chip obtains the data information of the active server and writes the data information into the first private memory, the data information sent by the active server BMC chip.

The writing module 11 is configured to write data information into the first shared memory.

The judging module 12 is configured to read the data information in the first shared memory in real time, so as to judge whether the active server fails according to the data information.

Based on the above embodiments, as a preferred embodiment, the judging module includes:

the judging unit is used for judging whether the data information meets the preset requirement;

And the determining unit is used for determining that the data information does not meet the preset requirement and determining that the active server fails.

Based on the above embodiment, as a preferred embodiment, further comprising:

Since the embodiments of the device portion correspond to those of the system portion, reference is made to the description of the embodiments of the system portion, and thus the description thereof is omitted herein.

According to the monitoring device for server cluster faults, when the active server BMC chip acquires data information of an active server and writes the data information into the first private memory, the data information sent by the active server BMC chip is received; writing data information into a first shared memory; and reading the data information in the first shared memory in real time to judge whether the active server fails according to the data information. According to the embodiment of the application, the standby server BMC chip receives the data information of the active server sent by the active server BMC chip at the first time, and can read the data information in the first shared memory in real time to realize real-time fault monitoring of the standby server on the active server, so that the fault transfer time is reduced, the fault tolerance of a server cluster is effectively enhanced, and the loss caused by the fault of the active server is reduced.

Fig. 4 is a block diagram of another monitoring device for server cluster failure according to an embodiment of the present application, where, as shown in fig. 4, the monitoring device for server cluster failure includes: a memory 20 for storing a computer program;

a processor 21 for implementing the steps of the method for monitoring a server cluster failure according to the above embodiment when executing a computer program.

The monitoring device for server cluster faults provided in this embodiment may include, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, or the like.

Processor 21 may include one or more processing cores, such as a 4-core processor, an 8-core processor, etc. The processor 21 may be implemented in at least one hardware form of DSP (DIGITAL SIGNAL Processing), FPGA (Field-Programmable gate array), PLA (Programmable Logic Array ). The processor 21 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 21 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 21 may also include an AI (ARTIFICIAL INTELLIGENCE ) processor for processing computing operations related to machine learning.

Memory 20 may include one or more computer-readable storage media, which may be non-transitory. Memory 20 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 20 is at least used for storing a computer program 201, where the computer program, after being loaded and executed by the processor 21, can implement the relevant steps of the method for monitoring server cluster faults disclosed in any of the foregoing embodiments. In addition, the resources stored in the memory 20 may further include an operating system 202, data 203, and the like, where the storage manner may be transient storage or permanent storage. Operating system 202 may include Windows, unix, linux, among other things. The data 203 may include, but is not limited to, data information, and the like.

In some embodiments, the monitoring device for server cluster faults may further include a display 22, an input/output interface 23, a communication interface 24, a power supply 25, and a communication bus 26.

Those skilled in the art will appreciate that the architecture shown in fig. 4 is not limiting of a server cluster failure monitoring apparatus and may include more or fewer components than shown.

The device for monitoring the server cluster faults comprises a memory and a processor, wherein the processor can realize a method for monitoring the server cluster faults when executing a program stored in the memory.

According to the monitoring device for server cluster faults, which is provided by the embodiment of the application, when the active server BMC chip acquires the data information of the active server and writes the data information into the first private memory, the data information sent by the active server BMC chip is received; writing data information into a first shared memory; and reading the data information in the first shared memory in real time to judge whether the active server fails according to the data information. According to the embodiment of the application, the standby server BMC chip receives the data information of the active server sent by the active server BMC chip at the first time, and can read the data information in the first shared memory in real time to realize real-time fault monitoring of the standby server on the active server, so that the fault transfer time is reduced, the fault tolerance of a server cluster is effectively enhanced, and the loss caused by the fault of the active server is reduced.

Finally, the application also provides a corresponding embodiment of the computer readable storage medium. The computer readable storage medium stores a computer program, and when executed by a processor, the computer program realizes the steps described in the above method embodiments (the method may be a method corresponding to the active server BMC chip side, a method corresponding to the standby server BMC chip side, or a method corresponding to the active server BMC chip side and the standby server BMC chip side).

It will be appreciated that the methods of the above embodiments, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored on a computer readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium for performing all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The monitoring medium for server cluster faults provided by the embodiment of the application can realize the following method: the standby server BMC chip receives the data information of the active server sent by the active server BMC chip at the first time, and can read the data information in the first shared memory in real time to realize real-time fault monitoring of the standby server on the active server, so that the fault transfer time is reduced, the fault tolerance of a server cluster is effectively enhanced, and the loss caused by the fault of the active server is reduced.

The system, the method, the device and the medium for monitoring the server cluster faults are described in detail. In the description, each embodiment is described in a progressive manner, and each embodiment is mainly described by the differences from other embodiments, so that the same similar parts among the embodiments are mutually referred. The method, apparatus and medium disclosed in the embodiments correspond to the system disclosed in the embodiments, so that the description is simpler, and the relevant points refer to the system part. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the application can be made without departing from the principles of the application and these modifications and adaptations are intended to be within the scope of the application as defined in the following claims.

It should also be noted that in this specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims

1. The monitoring system for the server cluster faults is characterized by comprising an active server and a standby server;

The standby server BMC chip is used for writing the data information into the first shared memory so as to read the data information in the first shared memory in real time and judging whether the active server fails according to the data information; the hardware parameter in the data information exceeds a threshold value, the standby server BMC chip determines that the active server fails, alarms to the CPU of the standby server, and the CPU immediately starts a state synchronization mechanism;

the active server BMC chip further comprises a second shared memory, and the standby server BMC chip further comprises a second private memory; when the standby server runs, the standby server BMC chip writes the monitored data information of the standby server into the second private memory of the standby server, meanwhile, the data information of the standby server is sent to the active server, the active server plays a role of the standby server, the data information of the standby server is written into the second shared memory of the standby server, the data information of the standby server in the second shared memory is read in real time, so that whether the standby server fails or not is judged, if the standby server fails, a CPU of the active server immediately starts a state synchronization mechanism, and tasks executed by the standby server are taken over, so that the failover process is completed in the first time.

2. A method for monitoring server cluster faults, which is applied to the active server BMC chip of claim 1, and comprises the following steps:

Acquiring data information of an activity server;

Writing the data information into a first private memory, and simultaneously sending the data information to a standby server BMC chip so that the standby server BMC chip writes the data information into the first shared memory, and reading the data information in the first shared memory in real time to judge whether the active server fails according to the data information;

3. The method for monitoring server cluster faults is characterized by being applied to the standby server BMC chip in claim 1, and comprising the following steps:

writing the data information into a first shared memory;

reading the data information in the first shared memory in real time to judge whether the active server fails according to the data information;

The active server BMC chip further comprises a second shared memory, and the standby server BMC chip further comprises a second private memory; when the standby server runs, the standby server BMC chip writes the monitored data information of the standby server into a second private memory of the standby server, and simultaneously sends the data information of the standby server to the active server, the active server plays a role of the standby server, writes the data information of the standby server into a second shared memory of the standby server, reads the data information of the standby server in the second shared memory in real time so as to judge whether the standby server fails, and if the standby server fails, a CPU of the active server immediately starts a state synchronization mechanism to take over tasks executed by the standby server, so that the failover process is completed in the first time;

the step of judging whether the active server fails according to the data information comprises the following steps:

Judging whether the data information meets a preset requirement or not; the hardware parameter in the data information exceeds a threshold value, and the standby server BMC chip determines that the active server fails;

After determining that the active server fails, further comprising:

4. A device for monitoring a server cluster failure, comprising:

The judging module is used for reading the data information in the first shared memory in real time so as to judge whether the active server fails according to the data information; the hardware parameter in the data information exceeds a threshold value, the standby server BMC chip determines that the active server fails, alarms to the CPU of the standby server, and the CPU immediately starts a state synchronization mechanism;

the active server BMC chip further comprises a second shared memory, and the standby server BMC chip further comprises a second private memory; when the standby server runs, the standby server BMC chip writes the monitored data information of the standby server into the second private memory of the standby server, and simultaneously sends the data information of the standby server to the active server, the active server plays a role of the standby server, writes the data information of the standby server into the second shared memory of the standby server, reads the data information of the standby server in the second shared memory in real time so as to judge whether the standby server fails, if the standby server fails, the CPU of the active server immediately starts a state synchronization mechanism to take over tasks executed by the standby server, thereby completing the failover process in the first time

Further comprises:

5. A monitoring device for a server cluster failure, comprising a memory for storing a computer program;

A processor for implementing the steps of the method for monitoring a server cluster failure according to claim 2 or 3 when executing said computer program.

6. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the method for monitoring a server cluster failure according to claim 2 or 3.