CN112445809A - Distributed database node survival state detection module and method - Google Patents
Distributed database node survival state detection module and method Download PDFInfo
- Publication number
- CN112445809A CN112445809A CN202011334370.0A CN202011334370A CN112445809A CN 112445809 A CN112445809 A CN 112445809A CN 202011334370 A CN202011334370 A CN 202011334370A CN 112445809 A CN112445809 A CN 112445809A
- Authority
- CN
- China
- Prior art keywords
- node
- state
- cluster
- detection module
- nodes
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 54
- 230000004083 survival effect Effects 0.000 title claims abstract description 29
- 238000000034 method Methods 0.000 title claims abstract description 19
- 238000012544 monitoring process Methods 0.000 claims abstract description 37
- 230000036541 health Effects 0.000 claims abstract description 23
- 238000012545 processing Methods 0.000 claims abstract description 13
- 230000006870 function Effects 0.000 claims abstract description 10
- 230000008569 process Effects 0.000 claims abstract description 5
- 230000004044 response Effects 0.000 claims abstract description 4
- 230000008859 change Effects 0.000 claims description 21
- 230000002159 abnormal effect Effects 0.000 claims description 18
- 230000003862 health status Effects 0.000 claims description 11
- 239000000284 extract Substances 0.000 claims description 4
- 230000005540 biological transmission Effects 0.000 abstract description 7
- 238000004891 communication Methods 0.000 abstract description 2
- 230000007547 defect Effects 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 4
- 238000012546 transfer Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
- 230000035899 viability Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
- G06F16/2358—Change logging, detection, and notification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
Abstract
The invention discloses a distributed database node survival state detection module, and relates to the technical field of computer communication. Aiming at the defects existing in the existing node state detection process, a monitoring unit of a state detection module is used for monitoring and acquiring the health state and the updating time of each node, a processing unit of the state detection module is used for carrying out data response according to a topological mode in a Gossip protocol, analyzing and counting the survival state of the nodes, and extracting an inactive node list. The invention further discloses a method for detecting the survival state of the distributed database nodes, wherein in the DRDB, the distributed database of the cloud sea in the wave tide, based on Gossip protocol broadcast messages, each node in the cluster can receive and send data packets containing other node update messages in a heartbeat cycle, and the state detection module is communicated with each node in the cluster, so that an inactive node list can be extracted. The invention can quickly identify whether the node has normal capability to carry out functions of duplicate processing, memory copy, message transmission and the like.
Description
Technical Field
The invention relates to the technical field of computer communication, in particular to a distributed database node survival state detection module and a method.
Background
The Raft protocol is commonly used in distributed networks to maintain consistency of multiple copies. Referring to fig. 1, the copy has a total of three states, Leader, Follower, and candiddate. Wherein, Leader refers to all the processors of the request; follower refers to the passive receiver of the request, receives the update request from the Leader, and then writes to the local log file; candidate refers to a Candidate, in the Raft cluster, there is only one Leader within a bounded time, and under the condition that the Leader operates normally, a node server only belongs to one of two roles of the Leader and the Follower, and the Follower can not become Candidate until the Leader fails, so that the election process is triggered.
Typically a Raft group has 3 copies and the Leader is responsible for communicating with the other 2 followers, which do not communicate with each other. The Leader will copy the client's write request log to the Follower. It will keep the heartbeat with Follower. Each footer has a timeout (typically 150ms to 300ms) that is reset when a heartbeat is received. That is to say, each raw group has an expiration time Threshold, and in the same "epoch" time, when the Follower finds that the Leader heartbeat expires, the raw election operation is performed again, and after a new Leader node is selected, the sequence number corresponding to the "epoch" is increased by 1.
When 2 heartbeats are delivered per second, the N Raft groups deliver 2N/s heartbeats to propagate in the network, and when N > is 1000, a single node generates at least 2000/s heartbeats, and if M nodes exist in the cluster (M > -1000), the whole cluster has at least ten thousand requests per second. This request frequency is feared that the normal cluster is hard to carry.
To solve the above-mentioned ultra-large-scale QPS (Query Per Second) request, many factories use various methods, such as: hardware performance is increased, and network card transmission bandwidth and speed are improved; for another example: in the CockRoachDB, a mechanism for reporting heartbeats by multiple Raft groups together is modified to only transfer the survival state of each node in the cluster, so that the number of QPSs is greatly reduced, but since the timestamp of each node update is always changed, the transfer of the keep-alive message in one period time is still the update time of each active node in the whole cluster. Hundreds of nodes need to locally store an active state record table of other nodes in the cluster, and the recorded state table can guide each Raft group to perform subsequent corresponding processing. This process, while somewhat alleviating the embarrassment of millions of requests, delivers thousands of heartbeat requests per second.
Disclosure of Invention
The processing logic of the nodes in the cluster is complex, and the state change of the nodes is unpredictable. Only correct observation and collection of node health can the stability of the cluster be continuously guaranteed. Although the conventional inter-node heartbeat has high reliability, in fact, for a relatively stable system, the heartbeat of each time the active node is delivered is not very useful for the receiving party. If we can default that all nodes in the cluster are in normal condition, only when the nodes are not normally advertised, these are the main points most concerned by the receiving node. Based on the above, the invention designs a distributed database node survival state detection module and method, which can rapidly identify whether the node has normal capability to perform functions such as duplicate processing, memory copy, message transmission and the like.
Firstly, the invention provides a distributed database node survival state detection module, and the technical scheme adopted for solving the technical problems is as follows:
a distributed database node survival status detection module communicatively connecting each node in a cluster, comprising:
a monitoring unit for monitoring and acquiring the health status and update time of each node,
and the processing unit is used for responding data according to a topological mode in the Gossip protocol, analyzing and counting the survival state of the nodes and extracting an inactive node list.
Furthermore, the related state detection module provides writable and read-only functions, each node in the cluster orderly and regularly communicates with the monitoring unit, and writes the self health state and updating time information into the monitoring unit.
Further, after each timeout time, the processing unit transmits a node list of health state changes in the current cluster and corresponding final state information to the whole cluster at one time in a topology mode by using a Gossip protocol, wherein the health state changes of each node in the cluster include normal to abnormal and abnormal to normal.
Furthermore, when the related state detection module performs copy timeout detection:
each Follower node in the cluster maintains a timer for checking whether the state of the loader node of the left group is normal;
the cluster state monitoring module periodically scans cluster state results through the cluster state monitoring nodes, and the cluster state monitoring nodes store a health association table between the node and corresponding Leader nodes:
(a) when the Leader node state is normal, increasing the lease period by 1, and continuously monitoring the Leader node state in the next lease period;
(b) when the state of the Leader node is abnormal, the Follower node actively applies for changing into the Candidate node, and the Raft group performs re-election operation.
Secondly, the invention provides a method for detecting the survival state of nodes in a distributed database, which adopts the following technical scheme for solving the technical problems:
a distributed database node survival state detection method comprises the following implementation processes:
in the DRDB, which is a distributed database in the cloud sea of the Langchao, each node in a cluster receives and sends a data packet containing update messages of other nodes in a heartbeat period based on Gossip protocol broadcast messages,
the state detection module is communicated with each node in the cluster to obtain the health state and the updating time of each node, and carries out data response according to a topological mode in a Gossip protocol, analyzes and counts the survival state of the nodes and extracts an inactive node list.
Optionally, the related state detection module provides writable and read-only functions, each node in the cluster communicates with the state detection module orderly and periodically, and writes two items of information of the health status and the update time of the node into the state detection module.
Further optionally, after each timeout time, the state detection module transmits the node list of the health state change in the current cluster and the corresponding final state information to the whole cluster at one time in a topology mode by using the Gossip protocol, where the health state change of each node in the cluster includes normal to abnormal and abnormal to normal.
Further optionally, when the related state detection module performs copy timeout detection:
each Follower node in the cluster maintains a timer for checking whether the state of the loader node of the left group is normal;
the cluster state monitoring module periodically scans cluster state results through the cluster state monitoring nodes, and the cluster state monitoring nodes store a health association table between the node and corresponding Leader nodes:
(a) when the Leader node state is normal, increasing the lease period by 1, and continuously monitoring the Leader node state in the next lease period;
(b) when the state of the Leader node is abnormal, the Follower node actively applies for changing into the Candidate node, and the Raft group performs re-election operation.
Compared with the prior art, the distributed database node survival state detection module and the method have the beneficial effects that:
(1) the invention can rapidly identify whether the node has normal capability to carry out functions of duplicate processing, memory copy, message transmission and the like, reduces hundreds of times of the survival reporting and transmission of the node, obviously reduces the request quantity of QPS per second, reduces dozens of times of storage occupation and ensures the performance and stability of the node in the network;
(2) according to the invention, a configuration file is not required to be created, all topologies are automatically generated by a Gossip protocol, redundant message transmission is not required, each node only receives the health state result of a corresponding Raft group member finally, the nodes in the cluster periodically update the latest timestamp of the node, the health state is transmitted to the state detection module regularly and orderly, and the state detection module analyzes and counts the survival state of the node and extracts an inactive node list.
Drawings
FIG. 1 is a Raft replica state transition diagram;
FIG. 2 is a schematic diagram of node viability detection of the present invention;
FIG. 3 is a first schematic diagram of copy timeout detection according to the present invention;
fig. 4 is a diagram of copy timeout detection of the present invention.
Detailed Description
In order to make the technical scheme, the technical problems to be solved and the technical effects of the present invention more clearly apparent, the following technical scheme of the present invention is clearly and completely described with reference to the specific embodiments.
The first embodiment is as follows:
referring to fig. 2, the present embodiment provides a distributed database node survival status detection module, which is communicatively connected to each node in a cluster, and includes:
a monitoring unit for monitoring and acquiring the health status and update time of each node,
and the processing unit is used for responding data according to a topological mode in the Gossip protocol, analyzing and counting the survival state of the nodes and extracting an inactive node list.
In this embodiment, the related status detection module provides writable and read-only functions, and each node in the cluster communicates with the monitoring unit in order and periodically, and writes two items of information, i.e., the health status and the update time of the node in the status detection module.
In this embodiment, after each timeout period, the processing unit transmits the node list of the health state change in the current cluster and the corresponding final state information to the whole cluster at one time in the topology mode by using the Gossip protocol, where the health state change of each node in the cluster includes two types, namely normal change into abnormal change and abnormal change into normal change.
In this embodiment, referring to fig. 3 and 4, when the related state detection module performs copy timeout detection:
each Follower node in the cluster maintains a timer for checking whether the state of the loader node of the left group is normal;
the cluster state monitoring module periodically scans cluster state results through the cluster state monitoring nodes, and the cluster state monitoring nodes store a health association table between the node and corresponding Leader nodes:
(a) when the Leader node state is normal, increasing the lease period by 1, and continuously monitoring the Leader node state in the next lease period;
(b) when the state of the Leader node is abnormal, the Follower node actively applies for changing into the Candidate node, and the Raft group performs re-election operation.
Example two:
referring to fig. 2, the embodiment provides a method for detecting a node survival status of a distributed database, where the method includes:
in the DRDB, which is a distributed database in the cloud sea of the Langchao, each node in a cluster receives and sends a data packet containing update messages of other nodes in a heartbeat period based on Gossip protocol broadcast messages,
the state detection module is communicated with each node in the cluster to obtain the health state and the updating time of each node, and carries out data response according to a topological mode in a Gossip protocol, analyzes and counts the survival state of the nodes and extracts an inactive node list.
In this embodiment, the related status detection module provides writable and read-only functions, and each node in the cluster communicates with the status detection module orderly and periodically, and writes two items of information, i.e., the health status and the update time of the node in the status detection module.
In this embodiment, after each timeout period, the state detection module transmits the node list of the health state change in the current cluster and the corresponding final state information to the whole cluster at one time in a topology mode by using the Gossip protocol, where the health state change of each node in the cluster includes a normal change into an abnormal change and an abnormal change into a normal change.
In this embodiment, referring to fig. 3 and 4, when the related state detection module performs copy timeout detection:
each Follower node in the cluster maintains a timer for checking whether the state of the loader node of the left group is normal;
the cluster state monitoring module periodically scans cluster state results through the cluster state monitoring nodes, and the cluster state monitoring nodes store a health association table between the node and corresponding Leader nodes:
(a) when the Leader node state is normal, increasing the lease period by 1, and continuously monitoring the Leader node state in the next lease period;
(b) when the state of the Leader node is abnormal, the Follower node actively applies for changing into the Candidate node, and the Raft group performs re-election operation.
In summary, by using the module and the method for detecting the survival state of the distributed database node, whether the node has normal capability to perform functions such as duplicate processing, memory copy, message transmission and the like can be quickly identified, and the performance and the stability of the node in the cluster are ensured.
The principles and embodiments of the present invention have been described in detail using specific examples, which are provided only to aid in understanding the core technical content of the present invention. Based on the above embodiments of the present invention, those skilled in the art should make any improvements and modifications to the present invention without departing from the principle of the present invention, and therefore, the present invention should fall into the protection scope of the present invention.
Claims (8)
1. A distributed database node survival status detection module communicatively coupled to each node in a cluster, comprising:
a monitoring unit for monitoring and acquiring the health status and update time of each node,
and the processing unit is used for responding data according to a topological mode in the Gossip protocol, analyzing and counting the survival state of the nodes and extracting an inactive node list.
2. The survival status detection module of distributed database nodes according to claim 1, wherein the status detection module provides writable and read-only functions, each node in the cluster communicates with the monitoring unit orderly and periodically, and writes the information of both the health status and the update time of the node into the monitoring unit.
3. The module of claim 2, wherein after each timeout period, the processing unit transmits a node list of health status changes in the current cluster and corresponding final status information to the entire cluster at one time in a topology mode by using Gossip protocol, wherein the health status changes of each node in the cluster include normal to abnormal and abnormal to normal.
4. The distributed database node survival state detection module of claim 2, wherein when the state detection module performs copy timeout detection:
each Follower node in the cluster maintains a timer for checking whether the state of the loader node of the left group is normal;
the cluster state monitoring module periodically scans cluster state results through the cluster state monitoring nodes, and the cluster state monitoring nodes store a health association table between the node and corresponding Leader nodes:
(a) when the Leader node state is normal, increasing the lease period by 1, and continuously monitoring the Leader node state in the next lease period;
(b) when the state of the Leader node is abnormal, the Follower node actively applies for changing into the Candidate node, and the Raft group performs re-election operation.
5. A distributed database node survival state detection method is characterized in that the realization process comprises the following steps:
in the DRDB, which is a distributed database in the cloud sea of the Langchao, each node in a cluster receives and sends a data packet containing update messages of other nodes in a heartbeat period based on Gossip protocol broadcast messages,
the state detection module is communicated with each node in the cluster to obtain the health state and the updating time of each node, and carries out data response according to a topological mode in a Gossip protocol, analyzes and counts the survival state of the nodes and extracts an inactive node list.
6. The method according to claim 5, wherein the status detection module provides writable and read-only functions, and each node in the cluster communicates with the status detection module orderly and periodically and writes both its health status and update time into the status detection module.
7. The method as claimed in claim 6, wherein after each timeout period, the status detection module transmits the node list of health status change in the current cluster and the corresponding final status information to the whole cluster at one time in a topology mode by using Gossip protocol,
the health state change of each node in the cluster comprises normal change to abnormal change and abnormal change to normal change.
8. The method for detecting the survival status of distributed database nodes according to any one of claims 5 to 7, wherein when the status detection module performs copy timeout detection:
each Follower node in the cluster maintains a timer for checking whether the state of the loader node of the left group is normal;
the cluster state monitoring module periodically scans cluster state results through the cluster state monitoring nodes, and the cluster state monitoring nodes store a health association table between the node and corresponding Leader nodes:
(a) when the Leader node state is normal, increasing the lease period by 1, and continuously monitoring the Leader node state in the next lease period;
(b) when the state of the Leader node is abnormal, the Follower node actively applies for changing into the Candidate node, and the Raft group performs re-election operation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011334370.0A CN112445809A (en) | 2020-11-25 | 2020-11-25 | Distributed database node survival state detection module and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011334370.0A CN112445809A (en) | 2020-11-25 | 2020-11-25 | Distributed database node survival state detection module and method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112445809A true CN112445809A (en) | 2021-03-05 |
Family
ID=74738941
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011334370.0A Pending CN112445809A (en) | 2020-11-25 | 2020-11-25 | Distributed database node survival state detection module and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112445809A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115022157A (en) * | 2022-08-09 | 2022-09-06 | 深圳竹云科技股份有限公司 | Method, device, equipment and medium for node failover in cluster |
CN115801626A (en) * | 2023-02-08 | 2023-03-14 | 华南师范大学 | Large-scale wide-area distributed cluster member failure detection method and system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102769673A (en) * | 2012-07-25 | 2012-11-07 | 楚云汉智武汉网络存储系统有限公司 | Failure detection method suitable to large-scale storage cluster |
CN104933132A (en) * | 2015-06-12 | 2015-09-23 | 广州巨杉软件开发有限公司 | Distributed database weighted voting method based on operating sequence number |
CN106656624A (en) * | 2017-01-04 | 2017-05-10 | 合肥康捷信息科技有限公司 | Optimization method based on Gossip communication protocol and Raft election algorithm |
CN110855737A (en) * | 2019-09-24 | 2020-02-28 | 中国科学院软件研究所 | Consistency level controllable self-adaptive data synchronization method and system |
US20200145283A1 (en) * | 2017-07-12 | 2020-05-07 | Huawei Technologies Co.,Ltd. | Intra-cluster node troubleshooting method and device |
-
2020
- 2020-11-25 CN CN202011334370.0A patent/CN112445809A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102769673A (en) * | 2012-07-25 | 2012-11-07 | 楚云汉智武汉网络存储系统有限公司 | Failure detection method suitable to large-scale storage cluster |
CN104933132A (en) * | 2015-06-12 | 2015-09-23 | 广州巨杉软件开发有限公司 | Distributed database weighted voting method based on operating sequence number |
CN106656624A (en) * | 2017-01-04 | 2017-05-10 | 合肥康捷信息科技有限公司 | Optimization method based on Gossip communication protocol and Raft election algorithm |
US20200145283A1 (en) * | 2017-07-12 | 2020-05-07 | Huawei Technologies Co.,Ltd. | Intra-cluster node troubleshooting method and device |
CN110855737A (en) * | 2019-09-24 | 2020-02-28 | 中国科学院软件研究所 | Consistency level controllable self-adaptive data synchronization method and system |
Non-Patent Citations (2)
Title |
---|
赵春扬等: "一致性协议在分布式数据库系统中的应用", 《华东师范大学学报(自然科学版)》 * |
陈陆等: "改进的Raft一致性算法及其研究", 《江苏科技大学学报(自然科学版)》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115022157A (en) * | 2022-08-09 | 2022-09-06 | 深圳竹云科技股份有限公司 | Method, device, equipment and medium for node failover in cluster |
CN115022157B (en) * | 2022-08-09 | 2022-11-15 | 深圳竹云科技股份有限公司 | Method, device, equipment and medium for node failover in cluster |
CN115801626A (en) * | 2023-02-08 | 2023-03-14 | 华南师范大学 | Large-scale wide-area distributed cluster member failure detection method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Rost et al. | Memento: A health monitoring system for wireless sensor networks | |
WO2021128915A1 (en) | Smart device monitoring method and apparatus | |
CN106598762A (en) | Message synchronization method and system | |
US20060259525A1 (en) | Recovery method using extendible hashing-based cluster logs in shared-nothing spatial database cluster | |
US7539150B2 (en) | Node discovery and communications in a network | |
GB2410406A (en) | Status generation and heartbeat signalling for a node of a high-availability cluster | |
CN112118174B (en) | Software defined data gateway | |
CN110677282B (en) | Hot backup method of distributed system and distributed system | |
CN112445809A (en) | Distributed database node survival state detection module and method | |
CN106202075A (en) | A kind of method and device of data base's active-standby switch | |
JP2005301975A (en) | Heartbeat apparatus via remote mirroring link on multi-site and its use method | |
US20090177743A1 (en) | Device, Method and Computer Program Product for Cluster Based Conferencing | |
CN113282604B (en) | High-availability time sequence database cluster system realized based on message queue | |
CN107623703A (en) | Global transaction identifies GTID synchronous method, apparatus and system | |
CN108512753B (en) | Method and device for transmitting messages in cluster file system | |
EP3570169B1 (en) | Method and system for processing device failure | |
CN107465706B (en) | Distributed data object storage device based on wireless communication network | |
US9043274B1 (en) | Updating local database and central database | |
CN105339906A (en) | Data writing control method for persistent storage device | |
US20220358118A1 (en) | Data synchronization in edge computing networks | |
CN110716827B (en) | Hot backup method suitable for distributed system and distributed system | |
CN113905054A (en) | Kudu cluster data synchronization method, device and system based on RDMA | |
CN114328638A (en) | Service message pushing system based on database polling | |
CN114490691B (en) | Distributed system data consistency method | |
CN111464579A (en) | Message processing method and server |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210305 |