CN112445809A

CN112445809A - Distributed database node survival state detection module and method

Info

Publication number: CN112445809A
Application number: CN202011334370.0A
Authority: CN
Inventors: 王尧; 王瀚墨; 陈磊; 孙思清
Original assignee: Inspur Cloud Information Technology Co Ltd
Current assignee: Inspur Cloud Information Technology Co Ltd
Priority date: 2020-11-25
Filing date: 2020-11-25
Publication date: 2021-03-05

Abstract

The invention discloses a distributed database node survival state detection module, and relates to the technical field of computer communication. Aiming at the defects existing in the existing node state detection process, a monitoring unit of a state detection module is used for monitoring and acquiring the health state and the updating time of each node, a processing unit of the state detection module is used for carrying out data response according to a topological mode in a Gossip protocol, analyzing and counting the survival state of the nodes, and extracting an inactive node list. The invention further discloses a method for detecting the survival state of the distributed database nodes, wherein in the DRDB, the distributed database of the cloud sea in the wave tide, based on Gossip protocol broadcast messages, each node in the cluster can receive and send data packets containing other node update messages in a heartbeat cycle, and the state detection module is communicated with each node in the cluster, so that an inactive node list can be extracted. The invention can quickly identify whether the node has normal capability to carry out functions of duplicate processing, memory copy, message transmission and the like.

Description

Distributed database node survival state detection module and method

Technical Field

The invention relates to the technical field of computer communication, in particular to a distributed database node survival state detection module and a method.

Background

The Raft protocol is commonly used in distributed networks to maintain consistency of multiple copies. Referring to fig. 1, the copy has a total of three states, Leader, Follower, and candiddate. Wherein, Leader refers to all the processors of the request; follower refers to the passive receiver of the request, receives the update request from the Leader, and then writes to the local log file; candidate refers to a Candidate, in the Raft cluster, there is only one Leader within a bounded time, and under the condition that the Leader operates normally, a node server only belongs to one of two roles of the Leader and the Follower, and the Follower can not become Candidate until the Leader fails, so that the election process is triggered.

Typically a Raft group has 3 copies and the Leader is responsible for communicating with the other 2 followers, which do not communicate with each other. The Leader will copy the client's write request log to the Follower. It will keep the heartbeat with Follower. Each footer has a timeout (typically 150ms to 300ms) that is reset when a heartbeat is received. That is to say, each raw group has an expiration time Threshold, and in the same "epoch" time, when the Follower finds that the Leader heartbeat expires, the raw election operation is performed again, and after a new Leader node is selected, the sequence number corresponding to the "epoch" is increased by 1.

When 2 heartbeats are delivered per second, the N Raft groups deliver 2N/s heartbeats to propagate in the network, and when N > is 1000, a single node generates at least 2000/s heartbeats, and if M nodes exist in the cluster (M > -1000), the whole cluster has at least ten thousand requests per second. This request frequency is feared that the normal cluster is hard to carry.

To solve the above-mentioned ultra-large-scale QPS (Query Per Second) request, many factories use various methods, such as: hardware performance is increased, and network card transmission bandwidth and speed are improved; for another example: in the CockRoachDB, a mechanism for reporting heartbeats by multiple Raft groups together is modified to only transfer the survival state of each node in the cluster, so that the number of QPSs is greatly reduced, but since the timestamp of each node update is always changed, the transfer of the keep-alive message in one period time is still the update time of each active node in the whole cluster. Hundreds of nodes need to locally store an active state record table of other nodes in the cluster, and the recorded state table can guide each Raft group to perform subsequent corresponding processing. This process, while somewhat alleviating the embarrassment of millions of requests, delivers thousands of heartbeat requests per second.

Disclosure of Invention

The processing logic of the nodes in the cluster is complex, and the state change of the nodes is unpredictable. Only correct observation and collection of node health can the stability of the cluster be continuously guaranteed. Although the conventional inter-node heartbeat has high reliability, in fact, for a relatively stable system, the heartbeat of each time the active node is delivered is not very useful for the receiving party. If we can default that all nodes in the cluster are in normal condition, only when the nodes are not normally advertised, these are the main points most concerned by the receiving node. Based on the above, the invention designs a distributed database node survival state detection module and method, which can rapidly identify whether the node has normal capability to perform functions such as duplicate processing, memory copy, message transmission and the like.

Firstly, the invention provides a distributed database node survival state detection module, and the technical scheme adopted for solving the technical problems is as follows:

a distributed database node survival status detection module communicatively connecting each node in a cluster, comprising:

a monitoring unit for monitoring and acquiring the health status and update time of each node,

and the processing unit is used for responding data according to a topological mode in the Gossip protocol, analyzing and counting the survival state of the nodes and extracting an inactive node list.

Furthermore, the related state detection module provides writable and read-only functions, each node in the cluster orderly and regularly communicates with the monitoring unit, and writes the self health state and updating time information into the monitoring unit.

Further, after each timeout time, the processing unit transmits a node list of health state changes in the current cluster and corresponding final state information to the whole cluster at one time in a topology mode by using a Gossip protocol, wherein the health state changes of each node in the cluster include normal to abnormal and abnormal to normal.

Furthermore, when the related state detection module performs copy timeout detection:

each Follower node in the cluster maintains a timer for checking whether the state of the loader node of the left group is normal;

the cluster state monitoring module periodically scans cluster state results through the cluster state monitoring nodes, and the cluster state monitoring nodes store a health association table between the node and corresponding Leader nodes:

(a) when the Leader node state is normal, increasing the lease period by 1, and continuously monitoring the Leader node state in the next lease period;

(b) when the state of the Leader node is abnormal, the Follower node actively applies for changing into the Candidate node, and the Raft group performs re-election operation.

Secondly, the invention provides a method for detecting the survival state of nodes in a distributed database, which adopts the following technical scheme for solving the technical problems:

a distributed database node survival state detection method comprises the following implementation processes:

in the DRDB, which is a distributed database in the cloud sea of the Langchao, each node in a cluster receives and sends a data packet containing update messages of other nodes in a heartbeat period based on Gossip protocol broadcast messages,

the state detection module is communicated with each node in the cluster to obtain the health state and the updating time of each node, and carries out data response according to a topological mode in a Gossip protocol, analyzes and counts the survival state of the nodes and extracts an inactive node list.

Optionally, the related state detection module provides writable and read-only functions, each node in the cluster communicates with the state detection module orderly and periodically, and writes two items of information of the health status and the update time of the node into the state detection module.

Further optionally, after each timeout time, the state detection module transmits the node list of the health state change in the current cluster and the corresponding final state information to the whole cluster at one time in a topology mode by using the Gossip protocol, where the health state change of each node in the cluster includes normal to abnormal and abnormal to normal.

Further optionally, when the related state detection module performs copy timeout detection:

Compared with the prior art, the distributed database node survival state detection module and the method have the beneficial effects that:

(1) the invention can rapidly identify whether the node has normal capability to carry out functions of duplicate processing, memory copy, message transmission and the like, reduces hundreds of times of the survival reporting and transmission of the node, obviously reduces the request quantity of QPS per second, reduces dozens of times of storage occupation and ensures the performance and stability of the node in the network;

(2) according to the invention, a configuration file is not required to be created, all topologies are automatically generated by a Gossip protocol, redundant message transmission is not required, each node only receives the health state result of a corresponding Raft group member finally, the nodes in the cluster periodically update the latest timestamp of the node, the health state is transmitted to the state detection module regularly and orderly, and the state detection module analyzes and counts the survival state of the node and extracts an inactive node list.

Drawings

FIG. 1 is a Raft replica state transition diagram;

FIG. 2 is a schematic diagram of node viability detection of the present invention;

FIG. 3 is a first schematic diagram of copy timeout detection according to the present invention;

fig. 4 is a diagram of copy timeout detection of the present invention.

Detailed Description

In order to make the technical scheme, the technical problems to be solved and the technical effects of the present invention more clearly apparent, the following technical scheme of the present invention is clearly and completely described with reference to the specific embodiments.

The first embodiment is as follows:

referring to fig. 2, the present embodiment provides a distributed database node survival status detection module, which is communicatively connected to each node in a cluster, and includes:

In this embodiment, the related status detection module provides writable and read-only functions, and each node in the cluster communicates with the monitoring unit in order and periodically, and writes two items of information, i.e., the health status and the update time of the node in the status detection module.

In this embodiment, after each timeout period, the processing unit transmits the node list of the health state change in the current cluster and the corresponding final state information to the whole cluster at one time in the topology mode by using the Gossip protocol, where the health state change of each node in the cluster includes two types, namely normal change into abnormal change and abnormal change into normal change.

In this embodiment, referring to fig. 3 and 4, when the related state detection module performs copy timeout detection:

Example two:

referring to fig. 2, the embodiment provides a method for detecting a node survival status of a distributed database, where the method includes:

In this embodiment, the related status detection module provides writable and read-only functions, and each node in the cluster communicates with the status detection module orderly and periodically, and writes two items of information, i.e., the health status and the update time of the node in the status detection module.

In this embodiment, after each timeout period, the state detection module transmits the node list of the health state change in the current cluster and the corresponding final state information to the whole cluster at one time in a topology mode by using the Gossip protocol, where the health state change of each node in the cluster includes a normal change into an abnormal change and an abnormal change into a normal change.

In summary, by using the module and the method for detecting the survival state of the distributed database node, whether the node has normal capability to perform functions such as duplicate processing, memory copy, message transmission and the like can be quickly identified, and the performance and the stability of the node in the cluster are ensured.

The principles and embodiments of the present invention have been described in detail using specific examples, which are provided only to aid in understanding the core technical content of the present invention. Based on the above embodiments of the present invention, those skilled in the art should make any improvements and modifications to the present invention without departing from the principle of the present invention, and therefore, the present invention should fall into the protection scope of the present invention.

Claims

1. A distributed database node survival status detection module communicatively coupled to each node in a cluster, comprising:

2. The survival status detection module of distributed database nodes according to claim 1, wherein the status detection module provides writable and read-only functions, each node in the cluster communicates with the monitoring unit orderly and periodically, and writes the information of both the health status and the update time of the node into the monitoring unit.

3. The module of claim 2, wherein after each timeout period, the processing unit transmits a node list of health status changes in the current cluster and corresponding final status information to the entire cluster at one time in a topology mode by using Gossip protocol, wherein the health status changes of each node in the cluster include normal to abnormal and abnormal to normal.

4. The distributed database node survival state detection module of claim 2, wherein when the state detection module performs copy timeout detection:

5. A distributed database node survival state detection method is characterized in that the realization process comprises the following steps:

6. The method according to claim 5, wherein the status detection module provides writable and read-only functions, and each node in the cluster communicates with the status detection module orderly and periodically and writes both its health status and update time into the status detection module.

7. The method as claimed in claim 6, wherein after each timeout period, the status detection module transmits the node list of health status change in the current cluster and the corresponding final status information to the whole cluster at one time in a topology mode by using Gossip protocol,

the health state change of each node in the cluster comprises normal change to abnormal change and abnormal change to normal change.

8. The method for detecting the survival status of distributed database nodes according to any one of claims 5 to 7, wherein when the status detection module performs copy timeout detection: