CN107426051B - The monitoring method of the working condition of distributed cluster system interior joint, apparatus and system - Google Patents
The monitoring method of the working condition of distributed cluster system interior joint, apparatus and system Download PDFInfo
- Publication number
- CN107426051B CN107426051B CN201710591183.2A CN201710591183A CN107426051B CN 107426051 B CN107426051 B CN 107426051B CN 201710591183 A CN201710591183 A CN 201710591183A CN 107426051 B CN107426051 B CN 107426051B
- Authority
- CN
- China
- Prior art keywords
- node
- nodes
- network connection
- cluster system
- distributed cluster
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/08—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
- H04L43/0805—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
- H04L43/0811—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking connectivity
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/10—Active monitoring, e.g. heartbeat, ping or trace-route
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/50—Testing arrangements
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Environmental & Geological Engineering (AREA)
- Health & Medical Sciences (AREA)
- Cardiology (AREA)
- General Health & Medical Sciences (AREA)
- Multi Processors (AREA)
- Debugging And Monitoring (AREA)
Abstract
An embodiment of the present invention provides a kind of monitoring method, device and the systems of the working condition of distributed cluster system interior joint.The monitoring method of the working condition of the distributed cluster system interior joint, including:Obtain the number for being judged as heartbeat detection time-out by other nodes of each node in distributed cluster system in scheduled duration;The highest node of the number is selected from each node;Obtain the network connection state for the node selected;When the network connection state for the node selected is unimpeded, it is seemingly-dead node to be judged as the node selected;When the network connection state for the node selected is disconnects, generation judging result is:The node selected is really to die for the sake of honour a little.The present invention can in time, effectively, it is reliable, quickly identify seemingly-dead node, improve the stability of cluster.
Description
Technical field
The present invention relates to distributed system field more particularly to a kind of working conditions of distributed cluster system interior joint
Monitoring method and device and system.
Background technology
As cloud computing is in the extensive use in each field and the increase of data volume, scale, property to distributed file system
Very high demand can be proposed with reliability.Under large-scale cluster, small probability event can become frequently to occur.Node is seemingly-dead
It is exactly one of problem to be solved.After node is seemingly-dead, if cannot effectively and timely identify, it can seriously affect whole
The stability and performance of a cluster can cause upper layer application to occur of short duration unavailable.But seemingly-dead node is difficult detection, if side
Method is not right, can also judge by accident.
The content of the invention
The embodiment provides a kind of monitoring methods and device of the working condition of distributed cluster system node
And system, it is capable of the working condition of timely and effective recognition node.
To achieve these goals, this invention takes following technical solutions.
A kind of monitoring method of the working condition of distributed cluster system interior joint, including:
Obtain in scheduled duration each node in distributed cluster system is judged as heartbeat detection time-out by other nodes
Number;
The highest node of the number is selected from each node;
Obtain the network connection state for the node selected;
When the network connection state for the node selected is unimpeded, generation judging result is:That selects is described
Node is seemingly-dead node;
When the network connection state for the node selected is disconnects, generation judging result is:That selects is described
Node is really to die for the sake of honour a little.
A kind of monitoring device of the working condition of distributed cluster system interior joint, including:
First acquisition module obtains in scheduled duration being judged as by other nodes for each node in distributed cluster system
The number of heartbeat detection time-out;
Selecting module selects the highest node of the number from each node;
Second acquisition module obtains the network connection state for the node selected;
Judgment module, when the network connection state for the node selected is unimpeded, generation judging result is:Selection
The node gone out is seemingly-dead node;When the network connection state for the node selected is disconnects, judging result is generated
For:The node selected is really to die for the sake of honour a little.
A kind of monitoring system of the working condition of distributed cluster system interior joint, including:In distributed cluster system
At least three nodes, monitoring device;
The monitoring device is used for:Obtain in scheduled duration being saved by other for each node in distributed cluster system
Point is judged as the number of heartbeat detection time-out;The highest node of the number is selected from each node;Acquisition is selected
The node network connection state;When the network connection state for the node selected is unimpeded, generation judges knot
Fruit is:The node selected is seemingly-dead node;When the network connection state for the node selected is disconnects, generation
Judging result is:The node selected is really to die for the sake of honour a little.
Solves existing skill in the embodiment of the present invention it can be seen from the technical solution provided by embodiments of the invention described above
The problem of working condition of egress can not accurately, be quickly judged in art.
The additional aspect of the present invention and advantage will be set forth in part in the description, these will become from the following description
It obtains substantially or is recognized by the practice of the present invention.
Description of the drawings
In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment
Attached drawing be briefly described, it should be apparent that, the accompanying drawings in the following description is only some embodiments of the present invention, for this
For the those of ordinary skill of field, without having to pay creative labor, other are can also be obtained according to these attached drawings
Attached drawing.
Fig. 1 is a kind of processing stream of the monitoring method of the working condition of distributed cluster system interior joint provided by the invention
Cheng Tu;
Fig. 2 shows for a kind of connection of the monitoring device of the working condition of distributed cluster system interior joint provided by the invention
It is intended to;
Fig. 3 is a kind of monitoring system of the working condition of distributed cluster system interior joint provided in an embodiment of the present invention
Connection diagram.
Specific embodiment
Embodiments of the present invention are described below in detail, the example of the embodiment is shown in the drawings, wherein from beginning
Same or similar element is represented to same or similar label eventually or there is same or like element.Below by ginseng
The embodiment for examining attached drawing description is exemplary, and is only used for explaining the present invention, and is not construed as limiting the claims.
As shown in Figure 1, be a kind of monitoring method of the working condition of distributed cluster system interior joint of the present invention,
Including:
Step 11, obtain in scheduled duration each node in distributed cluster system is judged as that heartbeat is examined by other nodes
Survey the number of time-out;
Step 12, the highest node of the number is selected from each node;
Step 13, the network connection state for the node selected is obtained;The step is specially:Pass through internet detective
Device is surveyed to test the network connection state of the node, to obtain the network connection state for the node selected.It for example, can
To pass through PING orders.
Step 14, when the network connection state for the node selected is unimpeded, generation judging result is:It selects
The node be seemingly-dead node;
Step 15, when the network connection state for the node selected is disconnects, generation judging result is:It selects
The node be really to die for the sake of honour a little.
The present invention can in time, effectively, it is reliable, quickly identify seemingly-dead node, it is fairly simple.Optionally, the method
It further includes:
Step 16, the judging result is sent in the distributed cluster system and removes the selected node
Other outer nodes so that other nodes in addition to the selected node carry out respective handling.
The present invention can in time, effectively, it is reliable, quickly identify seemingly-dead node, carry out respective handling, improve cluster
Stability.
Step 16 is specially:
Step 161, when the judging result is:It is described to remove the selection when node selected is seemingly-dead node
Other described nodes outside the node gone out stop to the seemingly-dead node distribution task;Alternatively, stop waiting described seemingly-dead
Node is to the feedback message for the task of having distributed.
Step 162, when the judging result is:It is described to remove the choosing when the node selected is really to die for the sake of honour
Other described nodes outside the node selected out disconnect and the true connection died for the sake of honour a little.
Optionally, step 11 includes:
Step 111, each node in the distributed cluster system is continuously transmitted every fixed duration to other nodes
The heartbeat request of predetermined quantity;For example, a node sent 2 heartbeat requests every 2 seconds to other nodes.
Step 112, when the section point in other described nodes does not return to heartbeat to the first node for sending heartbeat request
During the response message of request, then the section point is judged as heartbeat detection time-out by the first node;In the embodiment
First node and section point are intended merely to state different nodes, are all node to be monitored.
Step 113, according to the judging result of each first node, count the section point and be judged as heartbeat
Detect the number of time-out.
As shown in Fig. 2, be the monitoring device of seemingly-dead node in a kind of distributed cluster system of the present invention, including:
First acquisition module 21 obtains in scheduled duration being judged by other nodes for each node in distributed cluster system
For the number of heartbeat detection time-out;
Selecting module 22 selects the highest node of the number from each node;
Second acquisition module 23 obtains the network connection state for the node selected;
Judgment module 24, when the network connection state for the node selected is unimpeded, generation judging result is:Choosing
The node selected out is seemingly-dead node;When the network connection state for the node selected is disconnects, generation judges knot
Fruit is:The node selected is really to die for the sake of honour a little.
The device, further includes:
The judging result is sent in the distributed cluster system except selected described by sending module 25
Other nodes outside node so that other nodes in addition to the selected node carry out respective handling.
Second acquisition module 23 includes:
Heartbeat timeout detection sub-module 231, for each node in the distributed cluster system every fixed duration
The heartbeat request of predetermined quantity is continuously transmitted to other nodes;
Judging submodule 232, when the section point in other described nodes is not returned to the first node for sending heartbeat request
When returning the response message of heartbeat request, then the section point is judged as heartbeat detection time-out by the first node;
Statistic submodule 233 according to the judging result of each first node, counts the section point and is judged
For the number of heartbeat detection time-out.
As described in Figure 3, it is the monitoring system of seemingly-dead node in a kind of distributed cluster system of the present invention, including:
At least three nodes 31 in distributed cluster system, monitoring device 32;
Wherein, monitoring device can be arranged in management node, and management node is different from the distributed cluster system
In at least three nodes 31 to be monitored outside node.
The monitoring device 32 is used for:Obtain in scheduled duration each node in distributed cluster system by other
Node is judged as the number of heartbeat detection time-out;The highest node of the number is selected from each node;Obtain selection
The network connection state of the node gone out;When the network connection state for the node selected is unimpeded, generation judges
As a result it is:The node selected is seemingly-dead node;It is raw when the network connection state for the node selected is disconnects
It is into judging result:The node selected is really to die for the sake of honour a little.
At least three nodes 32 in the distributed cluster system are used for:It is continuously sent out to other nodes every fixed duration
Send the heartbeat request of predetermined quantity;When the section point in other described nodes is not returned to the first node for sending heartbeat request
During the response message of heartbeat request, then the section point is judged as heartbeat detection time-out by the first node;Described first
Node sends the message of the judging result of heartbeat detection time-out to the monitoring device;
The monitoring device 32 is additionally operable to:According to the message of the judging result of each first node, count described
Section point is judged as the number of heartbeat detection time-out.
The application scenarios of the present invention are described below.A kind of method of seemingly-dead node in judgement distributed file system, can be with
In time, seemingly-dead node effectively, reliably, is quickly identified.
The system of the present invention mainly includes following module:
Heartbeat timeout detection module, for sending heartbeat message mutually between distributed type assemblies interior nodes;The module is arranged on
On each node to be monitored;
Heartbeat timeout reporting modules, for reporting heartbeat timeout information to management node (i.e. above-mentioned monitoring device);It should
Module is arranged on each node to be monitored;
Seemingly-dead management module (being equal to above-mentioned monitoring device), for the heartbeat message of collecting and reporting, and judge be
It is no to have seemingly-dead node.
The method of the present invention is described below.
Step 1, heart beat detection module is installed on each node to be monitored in cluster.Heart beat detection module can be every 2
Second sends request to other all nodes in cluster.Such as in 4 seconds, node A continuously transmits 2 heartbeat requests to node B all not to be had
It returns, then be considered as the heartbeat timeout that node A is directed toward B.
Step 2:If detecting heartbeat timeout, heartbeat timeout reporting modules can give seemingly-dead management mould this information reporting
Block.Information format can be (A->During B ultrasound).
Step 3:After seemingly-dead management module receives heartbeat timeout information, decision-making is carried out.Seemingly-dead management module was from nearest 10 seconds
Retrieval, sees that the heartbeat timeout information for being directed toward which node is most inside the information inside received.
Step 4:If the heartbeat message that direction node B is retrieved inside step 3 is most, the net of decision node B is continued to
Network connects.Here the Internet packets detector may be employed to test.If the network connection of B is unimpeded, that just illustrates that B is seemingly-dead
.
The invention has the advantages that:
It solves the problems, such as accurately, quickly judge seemingly-dead node in the prior art, can subsequently carry out corresponding position
Reason, so as to the stability of maintenance system.
The seemingly-dead situation of heretofore described node is:The operating system nucleus alive (living) of node, but thereon
The response of some or all operations become very slow scene.Node is very extremely for the situation of node suspension or power-off.
The foregoing is only a preferred embodiment of the present invention, but protection scope of the present invention be not limited thereto,
Any one skilled in the art in the technical scope disclosed by the present invention, the change or replacement that can be readily occurred in,
It should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with scope of the claims
Subject to.
Claims (10)
1. a kind of monitoring method of the working condition of distributed cluster system interior joint, which is characterized in that including:
Obtain time for being judged as heartbeat detection time-out by other nodes of each node in distributed cluster system in scheduled duration
Number;
The highest node of the number is selected from each node;
Obtain the network connection state for the node selected;
When the network connection state for the node selected is unimpeded, generation judging result is:The node selected
For seemingly-dead node;
When the network connection state for the node selected is disconnects, generation judging result is:The node selected
It is really to die for the sake of honour a little.
2. according to the method described in claim 1, it is characterized in that, the method further includes:
The judging result is sent to other sections in the distributed cluster system in addition to the selected node
Point so that other nodes in addition to the selected node carry out respective handling.
3. according to the method described in claim 2, it is characterized in that, described other sections in addition to the selected node
Point, which carries out respective treated step, to be included:
When the judging result is:It is described to remove the selected node when node selected is seemingly-dead node
Other outer described nodes stop to the seemingly-dead node distribution task;Alternatively, stop waiting the seemingly-dead node to having distributed
The feedback message of task.
4. according to the method described in claim 2, it is characterized in that, described other sections in addition to the selected node
Point, which carries out respective treated step, to be included:
When the judging result is:It is described to remove the selected section when the node selected is really to die for the sake of honour
Other described nodes outside point disconnect and the true connection died for the sake of honour a little.
5. the according to the method described in claim 1, it is characterized in that, network connection shape for obtaining the node selected
The step of state, includes:
The network connection state of the node is tested by the Internet packets detector, to obtain the net for the node selected
Network connection status.
It is 6. according to the method described in claim 1, it is characterized in that, each in distributed cluster system in the acquisition scheduled duration
The step of number for being judged as heartbeat detection time-out by other nodes of a node, includes:
Each node in the distributed cluster system continuously transmits the heart of predetermined quantity every fixed duration to other nodes
Jump request;
When the response that the section point in other described nodes does not return to heartbeat request to the first node for sending heartbeat request disappears
During breath, then the section point is judged as heartbeat detection time-out by the first node;
According to the judging result of each first node, time that the section point is judged as heartbeat detection time-out is counted
Number.
7. a kind of monitoring device of the working condition of distributed cluster system interior joint, which is characterized in that including:
First acquisition module, obtain in scheduled duration each node in distributed cluster system is judged as heartbeat by other nodes
Detect the number of time-out;
Selecting module selects the highest node of the number from each node;
Second acquisition module obtains the network connection state for the node selected;
Judgment module, when the network connection state for the node selected is unimpeded, generation judging result is:It selects
The node is seemingly-dead node;When the network connection state for the node selected is disconnects, generation judging result is:Choosing
The node selected out is really to die for the sake of honour a little.
8. device according to claim 7, which is characterized in that further include:
The judging result is sent in the distributed cluster system in addition to the selected node by sending module
Other nodes so that other nodes in addition to the selected node carry out respective handling.
9. a kind of monitoring system of the working condition of distributed cluster system interior joint, which is characterized in that including:Distributed type assemblies
At least three nodes in system, monitoring device;
The monitoring device is used for:Obtain in scheduled duration being sentenced by other nodes for each node in distributed cluster system
Break as the number of heartbeat detection time-out;The highest node of the number is selected from each node;Obtain the institute selected
State the network connection state of node;When the network connection state for the node selected is unimpeded, generation judging result is:
The node selected is seemingly-dead node;When the network connection state for the node selected is disconnects, generation judges
As a result it is:The node selected is really to die for the sake of honour a little.
10. system according to claim 9, which is characterized in that
At least three nodes in the distributed cluster system are used for:It is continuously transmitted every fixed duration to other nodes predetermined
The heartbeat request of quantity;It is asked when the section point in other described nodes does not return to heartbeat to the first node for sending heartbeat request
During the response message asked, then the section point is judged as heartbeat detection time-out by the first node;The first node is given
The monitoring device sends the message of the judging result of heartbeat detection time-out;
The monitoring device is additionally operable to:According to the message of the judging result of each first node, second section is counted
Point is judged as the number of heartbeat detection time-out.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710591183.2A CN107426051B (en) | 2017-07-19 | 2017-07-19 | The monitoring method of the working condition of distributed cluster system interior joint, apparatus and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710591183.2A CN107426051B (en) | 2017-07-19 | 2017-07-19 | The monitoring method of the working condition of distributed cluster system interior joint, apparatus and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107426051A CN107426051A (en) | 2017-12-01 |
CN107426051B true CN107426051B (en) | 2018-06-05 |
Family
ID=60429811
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710591183.2A Expired - Fee Related CN107426051B (en) | 2017-07-19 | 2017-07-19 | The monitoring method of the working condition of distributed cluster system interior joint, apparatus and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107426051B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109803024B (en) * | 2019-01-28 | 2021-12-21 | 北京中科晶上科技股份有限公司 | Method for cluster node network |
CN111698120B (en) * | 2020-06-02 | 2022-10-18 | 浙江大华技术股份有限公司 | Storage node isolation method and device |
CN114584489A (en) * | 2022-03-08 | 2022-06-03 | 浪潮云信息技术股份公司 | Ssh channel-based remote environment information and configuration detection method and system |
CN116684256B (en) * | 2023-08-01 | 2023-11-03 | 苏州浪潮智能科技有限公司 | Node fault monitoring method, device and system, electronic equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102510343A (en) * | 2011-11-16 | 2012-06-20 | 广东新支点技术服务有限公司 | Highly available cluster system feign death solution based on both remote detection and power management |
CN102521060A (en) * | 2011-11-16 | 2012-06-27 | 广东新支点技术服务有限公司 | Pseudo halt solving method of high-availability cluster system based on watchdog local detecting technique |
CN104038366A (en) * | 2014-05-05 | 2014-09-10 | 深圳市中博科创信息技术有限公司 | Cluster node failure detection method and system |
CN105656996A (en) * | 2015-12-25 | 2016-06-08 | 北京奇虎科技有限公司 | Data node survival detection method and device |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE60200530T2 (en) * | 2001-04-04 | 2004-09-23 | Alcatel | Mechanism and method for determining and quickly restoring a minimum capacity in a meshed network |
-
2017
- 2017-07-19 CN CN201710591183.2A patent/CN107426051B/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102510343A (en) * | 2011-11-16 | 2012-06-20 | 广东新支点技术服务有限公司 | Highly available cluster system feign death solution based on both remote detection and power management |
CN102521060A (en) * | 2011-11-16 | 2012-06-27 | 广东新支点技术服务有限公司 | Pseudo halt solving method of high-availability cluster system based on watchdog local detecting technique |
CN104038366A (en) * | 2014-05-05 | 2014-09-10 | 深圳市中博科创信息技术有限公司 | Cluster node failure detection method and system |
CN105656996A (en) * | 2015-12-25 | 2016-06-08 | 北京奇虎科技有限公司 | Data node survival detection method and device |
Also Published As
Publication number | Publication date |
---|---|
CN107426051A (en) | 2017-12-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107426051B (en) | The monitoring method of the working condition of distributed cluster system interior joint, apparatus and system | |
CN102111310B (en) | Method and system for monitoring content delivery network (CDN) equipment status | |
CN113038122B (en) | Fault positioning system and method based on video image diagnosis data | |
CN112398680B (en) | Fault delimiting method and equipment | |
EP2795841B1 (en) | Method and arrangement for fault analysis in a multi-layer network | |
CN110716842B (en) | Cluster fault detection method and device | |
CN101656013A (en) | Vehicle-mounted monitoring alarm terminal, system and alarm method | |
CN102740112B (en) | Method for controlling equipment polling based on video monitoring system | |
US10447561B2 (en) | BFD method and apparatus | |
CN107294767B (en) | Live broadcast network transmission fault monitoring method and system | |
CN115118581B (en) | Internet of things data all-link monitoring and intelligent guaranteeing system based on 5G | |
CN107104838A (en) | A kind of information processing method, server and terminal | |
CN108234161A (en) | For the access detection method and system of on-line off-line multitiered network framework | |
CN106789239A (en) | Towards the information application system failure trend prediction method and device of power business | |
CN103763143A (en) | Method and system for equipment abnormality alarming based on storage server | |
CN111865667A (en) | Network connectivity fault root cause positioning method and device | |
CN101252477B (en) | Determining method and analyzing apparatus of network fault root | |
US20150271045A1 (en) | Method, apparatus and system for detecting network element load imbalance | |
CN109840366B (en) | Municipal bridge state detection device | |
CN111049703A (en) | Network equipment detection method and system | |
CN111988172B (en) | Network information management platform, device and security management method | |
CN105991305A (en) | Method and device of identifying link abnormity | |
CN109699041A (en) | A kind of RRU channel failure diagnosis processing method and RRU device | |
KR102359985B1 (en) | A system and method for data collection | |
CN106803795A (en) | Video monitoring system Fault Identification based on detection frame, positioning and warning system and its method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
PP01 | Preservation of patent right |
Effective date of registration: 20181115 Granted publication date: 20180605 |
|
PP01 | Preservation of patent right | ||
PD01 | Discharge of preservation of patent |
Date of cancellation: 20211115 Granted publication date: 20180605 |
|
PD01 | Discharge of preservation of patent | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20180605 Termination date: 20190719 |
|
CF01 | Termination of patent right due to non-payment of annual fee |