CN109684128B

CN109684128B - Cluster overall fault recovery method of message middleware, server and storage medium

Info

Publication number: CN109684128B
Application number: CN201811379792.2A
Authority: CN
Inventors: 陈子文; 陈滨; 李玉龙; 邓硕灵; 彭世雄; 俞瑾; 郭未
Original assignee: SHENZHEN STOCK EXCHANGE
Current assignee: SHENZHEN STOCK EXCHANGE
Priority date: 2018-11-16
Filing date: 2018-11-16
Publication date: 2020-12-08
Anticipated expiration: 2038-11-16
Also published as: CN109684128A

Abstract

The invention discloses a cluster overall fault recovery method of message middleware, which comprises the following steps: when the cluster of the message middleware has overall fault, searching candidate nodes to be recovered from the nodes of the cluster; taking the candidate node to be recovered as a first node to process the history information of local persistent storage; after the historical information is processed, receiving a real-time input information so that the first node completes fault recovery; and sequentially adding other nodes in the cluster into the cluster so as to complete the overall fault recovery of the cluster. The invention also discloses a server and a computer readable storage medium. The invention can also recover the cluster function after the integral cluster fails under the condition of meeting the requirements of high performance and low time delay of the message middleware, thereby ensuring the availability of the system.

Description

Cluster overall fault recovery method of message middleware, server and storage medium

Technical Field

The invention relates to the technical field of internet finance, in particular to a cluster overall fault recovery method of message middleware, a server and a computer readable storage medium.

Background

The message middleware utilizes an efficient and reliable message transfer mechanism to exchange messages independent of a platform, and is an important basic system for communication among application components in a distributed environment. A cluster is a computer system that is connected by a set of loosely integrated computer software, hardware, and that cooperate to perform computing work highly closely, in the sense that they can be considered a computer. The message middleware cluster is a group of application programs or nodes which receive the same message through the message middleware, and the nodes are mutually master and standby and perform the same operation logic to obtain the same calculation result. The function of the whole cluster is not influenced by the fault of any N-1 node in the cluster (N redundant nodes are assumed). The cluster overall failure refers to that all nodes in the cluster fail, all copies of the application program fail at the moment, and the cluster function is interrupted.

Conventional message middleware (e.g., Kafka, RabbitMQ, etc.) typically has a central message server (Broker, simply a proxy node) to which both the sender and the recipient of a message are connected, relying on the proxy node for message distribution rather than communicating directly with each other. The proxy nodes decouple the sender and the receiver, a plurality of proxy nodes may form a proxy cluster to increase message middleware availability, and the sender and the receiver may also form a cluster to enhance application availability. The main problem of the message middleware based on the agent node is that all messages must be forwarded through the agent node, the time delay is high, and the requirements of an ultra-low time delay environment such as a security trading system cannot be met.

In order to solve the problem, the existing scheme adopts message middleware without a proxy node (such as a non-middleware mode of zeroMQ), and application programs can directly communicate with each other without passing through the proxy node, so that the method is very suitable for being used in an ultra-low-delay scene. However, since there is no proxy node, the message middleware can only implement clustering at the sender and the receiver respectively to enhance the availability, and under the scenario of cluster overall failure (such as all process exits caused by software defects, configuration errors, dirty data, machine power failure, disk device damage, etc.), the whole system is not available, and is not suitable for high-reliability application scenarios of the stock exchange system.

Disclosure of Invention

The invention mainly aims to provide a cluster overall fault recovery method of a message middleware, a server and a computer readable storage medium, aiming at meeting the requirement that the cluster function can be recovered after the cluster overall fault under the condition of high performance and low time delay of the message middleware, thereby ensuring the availability of a system.

In order to achieve the above object, the present invention provides a method for recovering a cluster overall failure of a message middleware, comprising the following steps:

when the cluster of the message middleware has overall fault, searching candidate nodes to be recovered from the nodes of the cluster;

taking the candidate node to be recovered as a first node to process the history information of local persistent storage;

after the historical information is processed, receiving a real-time input information so that the first node completes fault recovery;

and sequentially adding other nodes in the cluster into the cluster so as to complete the overall fault recovery of the cluster.

Preferably, when a cluster of the message middleware fails integrally, the step of searching for a candidate node to be recovered from nodes of the cluster includes:

acquiring the number of received messages of all nodes in the cluster, and recording the number of nodes successfully acquiring the messages;

and when the number of the nodes successfully acquiring the messages is equal to the preset number of the configuration nodes, taking the node with the maximum number of the received messages as the candidate node to be recovered.

Preferably, the step of acquiring the number of messages received by all nodes in the cluster and recording the number of nodes that successfully acquire messages further includes:

when the number of the nodes successfully acquiring the message is not equal to the preset number of the configuration nodes, acquiring the number of the sending messages of the cluster and the number of the receiving messages of the downstream cluster relative to the current cluster;

judging whether the number of the sent messages is larger than or equal to the number of the received messages;

and if so, taking the node with the maximum number of the received messages as the candidate node to be recovered.

Preferably, the step of processing the history message stored locally and persistently by using the candidate node to be restored as the first node comprises:

taking the candidate node to be recovered as a first node, reading historical messages and the sequence numbers of the responded messages of each sending subject from a local persistent storage, and sending the historical messages to an application;

acquiring a sending message serial number of each message submitted by the application;

if the sending message sequence number is less than or equal to the responded message sequence number corresponding to the sending subject, discarding the message corresponding to the sending message sequence number.

Preferably, the step of obtaining the sending message sequence number of each message submitted by the application further includes:

and if the sending message sequence number is larger than the responded message sequence number corresponding to the sending subject, sending the message corresponding to the sending message sequence number.

Preferably, after the step of receiving the real-time input message after the historical message processing is completed, the method further includes:

carrying out asynchronous persistence processing on the real-time input message;

and sending the real-time input message subjected to asynchronous persistence processing to an application.

To achieve the above object, the present invention further provides a server, which includes a processor and a cluster integrity failure recovery program of message middleware stored in and executable on the processor, wherein the cluster integrity failure recovery program of message middleware, when executed by the processor, implements the steps of the cluster integrity failure recovery method of message middleware as described above.

In order to achieve the above object, the present invention also provides a server, including:

the system comprises a searching module, a recovery module and a recovery module, wherein the searching module is used for searching candidate nodes to be recovered from nodes of a cluster when the cluster of the message middleware has an integral fault;

the processing module is used for processing the history information of local persistent storage by taking the candidate node to be recovered as a first node;

the receiving module is used for receiving a real-time input message after the historical message is processed so as to enable the first node to complete fault recovery;

and the adding module is used for sequentially adding other nodes in the cluster into the cluster so as to complete the overall fault recovery of the cluster.

To achieve the above object, the present invention further provides a cluster overall failure recovery program of message middleware stored on the computer readable storage medium, the cluster overall failure recovery program of message middleware being executed by a processor to implement the steps of the cluster overall failure recovery method of message middleware as described above.

According to the cluster overall fault recovery method of the message middleware, the server and the computer readable storage medium, when the cluster of the message middleware has overall fault, the candidate node to be recovered is searched from the nodes of the cluster, then the candidate node to be recovered is used as the first node to process the history message of local persistent storage, after the history message is processed, the real-time input message is received, so that the first node completes fault recovery, and other nodes in the cluster are sequentially added into the cluster, so that the cluster completes overall fault recovery. Therefore, under the condition of meeting the requirements of high performance and low time delay of the message middleware, the cluster function can be recovered after the integral cluster fails, and the availability of the system can be ensured.

Drawings

FIG. 1 is a schematic diagram of a server in a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a first embodiment of a method for recovering a cluster global failure in message middleware according to the present invention;

FIG. 3 is a detailed flowchart of step S1 in FIG. 2;

FIG. 4 is a detailed flowchart of step S2 in FIG. 2;

fig. 5 is a functional module diagram of a server according to an embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, the server of the present invention includes: a processor 1001, such as a CPU, a user interface 1002, a memory 1003, and a communication bus 1004. Wherein a communication bus 1004 is used to enable connective communication between these components. The user interface 1002 may include a Display screen (Display), an input unit. The memory 1003 may be a high-speed RAM memory or a non-volatile memory (e.g., a disk memory). The memory 1003 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the server architecture shown in FIG. 1 is not meant to be limiting, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, the memory 1003, which is a kind of computer storage medium, may include a cluster whole failure recovery program of an operating system, a network communication module, a user interface module, and message middleware therein.

In the server shown in fig. 1, the user interface 1002 is mainly used for receiving a user instruction triggered by a user through touching a display screen or inputting an instruction on an input unit; the sender and the receiver of the message middleware of the server are correspondingly provided with asynchronous persistence components based on shared memory, and the processor 1001 is configured to invoke a cluster overall failure recovery program of the message middleware stored in the storage 1003 and execute the following operations:

Further, processor 1001 may invoke a cluster integrity failure recovery program of message middleware stored in memory 1003, and also perform the following operations:

Referring to fig. 2, in a first embodiment, the present invention provides a method for recovering a cluster integrity failure of message middleware, including the following steps:

step S1, when the cluster of the message middleware has overall fault, searching candidate nodes to be recovered from the nodes of the cluster;

in this embodiment, the message middleware cluster refers to a group of application programs or nodes that receive the same message through the message middleware, and these nodes are mutually active and standby, and perform the same operation logic to obtain the same calculation result. When any N-1 node in the cluster (assuming N redundant nodes) fails, the function of the whole cluster is not influenced. The cluster is subjected to overall failure, namely all nodes in the cluster are failed, all copies of an application program are failed, and the cluster function is interrupted.

Each node within a cluster includes a receiver and a sender, wherein the receiver is operable to receive messages from an upstream cluster; the sender is used for sending the processed output message to downstream clusters, and each cluster is redundantly deployed through multiple nodes to guarantee respective availability. The overall failure recovery of the cluster includes receiver recovery and sender recovery. When the overall fault of the cluster is recovered, the sizes of the input message persistent files of all the nodes in the cluster are obtained first, and the largest one is found out from the input message persistent files and is used as a candidate node to be recovered.

Step S2, taking the candidate node to be recovered as a first node to process the history information stored locally and persistently;

in this embodiment, the candidate node to be restored is used as the first node to restore the node first, and a receiver in the cluster performs input message replay based on the history input message persistent file stored in the local persistent storage, and adds the replay to the reception of the real-time message after the local input message replay is completed. And compared with the upstream cluster of the cluster with the overall fault, the sender in the upstream cluster adopts a reliable transmission mode based on response, resends the message which is not responded during the overall fault of the cluster, and finally switches back to the real-time message sending process.

It can be appreciated that the present invention places shared memory based asynchronous persistence components in both the sender and receiver directions of the message middleware, respectively. And after the sender in the upstream cluster sends the message to the network, putting the message into the corresponding shared memory, and writing the message into a disk of the asynchronous persistent component under the action of the corresponding asynchronous persistent component.

Step S3, after the history information processing is completed, receiving a real-time input information to make the first node complete the fault recovery;

in this embodiment, after the history message is processed, a real-time input message is received, and a receiver in the cluster performs asynchronous persistence on the real-time input message and sends the real-time input message after the asynchronous persistence to an application. It will be appreciated that the recipients within the cluster will persist the received message asynchronously before submitting it to the application.

And step S4, sequentially adding other nodes in the cluster into the cluster so as to complete the overall fault recovery of the cluster.

In this embodiment, when the first node completes failure recovery, the cluster is successfully recovered from the outside. And other nodes are sequentially added into the cluster to complete recovery, so that the cluster completes the whole fault recovery. Therefore, the cluster function can be recovered, the ordered connection with the messages of the upstream and downstream clusters is ensured, and the messages are not lost, repeated and disordered.

The invention provides a cluster overall fault recovery method of message middleware, which is characterized in that when the cluster of the message middleware has overall fault, candidate nodes to be recovered are searched from the nodes of the cluster, then the candidate nodes to be recovered are used as a first node to process locally persistently stored historical messages, after the historical messages are processed, real-time input messages are received, so that the first node completes fault recovery, and other nodes in the cluster are sequentially added into the cluster, so that the cluster completes overall fault recovery. Therefore, under the condition of meeting the requirements of high performance and low time delay of the message middleware, the cluster function can be recovered after the integral cluster fails, and the availability of the system can be ensured.

Referring to fig. 3, in the second embodiment, based on the first embodiment, the step S1 includes:

step S11, acquiring the number of messages received by all nodes in the cluster, and recording the number of nodes successfully acquiring the messages;

in this embodiment, it is first determined whether all nodes in the cluster have stopped working, and when it is determined that all nodes have stopped working, the number of messages received by all nodes is obtained. Because all the nodes cannot successfully acquire the message, that is, when the node fails, the message cannot be successfully acquired, the number of the nodes which successfully acquire the message can be recorded.

And step S12, when the number of the nodes that successfully acquire the message is equal to the preset number of configuration nodes, taking the node with the largest number of received messages as the candidate node to be recovered.

In this embodiment, when the number of the nodes that successfully acquire the message is equal to the number of the configuration nodes preset in the cluster, it indicates that no hardware fault occurs in any node in the cluster. At this time, the node with the largest number of received messages is taken as the candidate node to be recovered. Because the candidate node to be restored has received the most messages, if the node with less messages is used as the first candidate node to be restored, the master node and the slave node may be inconsistent. For example, 100 messages are received before a node a in the cluster fails, 90 messages are received by a node B in the cluster, and if the node B is taken as a first candidate node to be recovered, the node B will receive messages 91 to 100 again, and the sequence may be different from the original sequence of the node a. Therefore, the node with the largest number of received messages is used as the candidate node to be recovered, so that the condition of inconsistent main and standby messages can be prevented.

Step S13, when the number of the nodes successfully acquiring the message is not equal to the preset number of the configuration nodes, acquiring the number of the sending message of the cluster and the number of the receiving message of the downstream cluster relative to the current cluster;

step S14, judging whether the number of the sent messages is larger than or equal to the number of the received messages;

and step S15, if yes, taking the node with the largest number of received messages as the candidate node to be recovered.

In this embodiment, when the number of nodes that successfully acquire a message is not equal to the number of configured nodes preset in the cluster, it is indicated that a node in the cluster has a hardware fault, at this time, it is necessary to further compare whether a sending message of a cluster to be recovered can cover a receiving message of a downstream cluster, that is, determine whether the number of sending messages is greater than or equal to the number of receiving messages, and if so, take the node with the largest number of receiving messages as the candidate node to be recovered; if not, the comparison process is ended, and at the moment, the overall fault of the cluster is not recoverable.

Referring to fig. 4, in a third embodiment, based on any of the above embodiments, the step S2 includes:

step S21, taking the candidate node to be recovered as a first node, reading history information and the sequence number of the responded message of each transmission subject from a local persistent storage, and transmitting the history information to an application;

in this embodiment, the candidate node to be restored is used as the first node to be restored first, and the restoration is performed in the full-layer restoration mode, and the message middleware delivers the historical input message from the system starting this time to the application, and seamlessly switches to the real-time input message flow when the historical input message flow catches up with the real-time input message flow. The application need not distinguish between historical messages or real-time messages, but only needs to process incoming messages, generate and send outgoing messages.

At this time, the cluster may turn on the function of sending messages, but not turn on the function of receiving messages. The message middleware reads the history messages from the local persistent storage, and the sequence number of the replied messages of each transmission subject. And the sender of the cluster will periodically persist the message sequence numbers that have been answered on each topic. The message middleware will read the history message from the local persistent storage and submit it to the application.

Step S22, obtaining the sending message serial number of each message submitted by the application;

step S23, if the sending message sequence number is less than or equal to the already replied message sequence number corresponding to the sending subject, discarding the message corresponding to the sending message sequence number.

Step S24, if the sending message sequence number is greater than the already replied message sequence number corresponding to the sending subject, sending a message corresponding to the sending message sequence number.

In this embodiment, the sender of the cluster needs to send the message submitted by the application, because the persistence component already records the message sequence numbers that have been responded on each topic, the sender of the cluster does not need to start sending from the first message, but starts sending from the responded message sequence number +1, and messages smaller than the sequence number can be directly discarded. Therefore, a sending message sequence number of each message submitted by the application can be acquired, the size relationship between the sending message sequence number and the responded message sequence number corresponding to the sending subject is compared, and when the sending message sequence number is smaller than or equal to the responded message sequence number, the message corresponding to the sending message sequence number is directly discarded; correspondingly, when the sending message sequence number is greater than the responded message sequence number, the message corresponding to the sending message sequence number is sent.

It will be appreciated that the persistence component of the sender of the cluster may filter duplicate messages based on message sequence numbers, and the receiver of the downstream cluster may also filter duplicate messages based on message sequence numbers.

The present invention also provides a server, which includes a processor and a cluster integrity failure recovery program of message middleware stored in the processor and operable on the processor, wherein the cluster integrity failure recovery program of message middleware, when executed by the processor, implements the steps of the cluster integrity failure recovery method of message middleware as described above.

The present invention also provides a server 10, and referring to fig. 5, in an embodiment, the server 10 includes:

the recovery method comprises a searching module 101, configured to search a candidate node to be recovered from nodes of a cluster of a message middleware when the cluster fails integrally;

A processing module 102, configured to use the candidate node to be recovered as a first node to process a locally persistently stored history message;

A receiving module 103, configured to receive a real-time input message after the history message is processed, so that the first node completes fault recovery;

And the adding module 104 is configured to sequentially add other nodes in the cluster to the cluster, so that the cluster completes overall fault recovery.

According to the server provided by the invention, when the cluster of the message middleware has an overall fault, the candidate node to be recovered is searched from the nodes of the cluster, then the candidate node to be recovered is used as a first node to process the locally persistently stored historical message, after the historical message is processed, the real-time input message is received, so that the first node completes the fault recovery, and other nodes in the cluster are sequentially added into the cluster, so that the cluster completes the overall fault recovery. Therefore, under the condition of meeting the requirements of high performance and low time delay of the message middleware, the cluster function can be recovered after the integral cluster fails, and the availability of the system can be ensured.

The present invention also provides a computer readable storage medium having stored thereon a cluster integrity failure recovery program of message middleware, which is executed by a processor to implement the steps of the cluster integrity failure recovery method of message middleware as described above.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes several instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, a controlled terminal, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A cluster overall fault recovery method of message middleware is characterized by comprising the following steps:

when the cluster of the message middleware has overall fault, searching candidate nodes to be recovered from the nodes of the cluster, including: acquiring the number of received messages of all nodes in the cluster, and recording the number of nodes successfully acquiring the messages; when the number of the nodes successfully acquiring the messages is equal to the preset number of configuration nodes, taking the node with the largest number of the received messages as the candidate node to be recovered;

2. The method for recovering cluster overall failure of message middleware of claim 1, wherein the step of obtaining the number of received messages of all nodes in the cluster and recording the number of nodes successfully obtaining messages further comprises:

3. The method for cluster global failure recovery of message middleware as claimed in claim 1 or 2, wherein the step of processing the candidate node to be recovered as the first node with locally persistently stored history messages comprises:

4. The method for cluster integrity failure recovery of message middleware of claim 3, wherein the step of obtaining the sent message sequence number of each message submitted by the application is followed by further comprising:

5. The method for cluster integrity failure recovery of message middleware of claim 1, wherein after the step of receiving real-time incoming messages after the historical message processing is complete, further comprising:

6. A server, characterized in that the server comprises a processor and a cluster integrity failure recovery program of message middleware stored in the processor and operable on the processor, wherein the cluster integrity failure recovery program of message middleware, when executed by the processor, implements the steps of the cluster integrity failure recovery method of message middleware according to any one of claims 1 to 5.

7. A server, characterized in that the server comprises:

the searching module is used for searching candidate nodes to be recovered from the nodes of the cluster when the cluster of the message middleware has overall fault, and comprises: acquiring the number of received messages of all nodes in the cluster, and recording the number of nodes successfully acquiring the messages; when the number of the nodes successfully acquiring the messages is equal to the preset number of configuration nodes, taking the node with the largest number of the received messages as the candidate node to be recovered;

8. A computer-readable storage medium, on which a cluster integrity failure recovery program of message middleware is stored, the cluster integrity failure recovery program of message middleware being executed by a processor to implement the steps of the cluster integrity failure recovery method of message middleware according to any one of claims 1 to 5.