CN107426051B - The monitoring method of the working condition of distributed cluster system interior joint, apparatus and system - Google Patents

The monitoring method of the working condition of distributed cluster system interior joint, apparatus and system Download PDF

Info

Publication number
CN107426051B
CN107426051B CN201710591183.2A CN201710591183A CN107426051B CN 107426051 B CN107426051 B CN 107426051B CN 201710591183 A CN201710591183 A CN 201710591183A CN 107426051 B CN107426051 B CN 107426051B
Authority
CN
China
Prior art keywords
node
nodes
network connection
cluster system
distributed cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201710591183.2A
Other languages
Chinese (zh)
Other versions
CN107426051A (en
Inventor
张俊峰
游峰
李纲彬
金鑫鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Internet Science And Technology Ltd Of Cloud Of China
Original Assignee
Beijing Internet Science And Technology Ltd Of Cloud Of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Internet Science And Technology Ltd Of Cloud Of China filed Critical Beijing Internet Science And Technology Ltd Of Cloud Of China
Priority to CN201710591183.2A priority Critical patent/CN107426051B/en
Publication of CN107426051A publication Critical patent/CN107426051A/en
Application granted granted Critical
Publication of CN107426051B publication Critical patent/CN107426051B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0811Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking connectivity
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/10Active monitoring, e.g. heartbeat, ping or trace-route
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/50Testing arrangements

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Environmental & Geological Engineering (AREA)
  • Health & Medical Sciences (AREA)
  • Cardiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Multi Processors (AREA)
  • Debugging And Monitoring (AREA)

Abstract

An embodiment of the present invention provides a kind of monitoring method, device and the systems of the working condition of distributed cluster system interior joint.The monitoring method of the working condition of the distributed cluster system interior joint, including:Obtain the number for being judged as heartbeat detection time-out by other nodes of each node in distributed cluster system in scheduled duration;The highest node of the number is selected from each node;Obtain the network connection state for the node selected;When the network connection state for the node selected is unimpeded, it is seemingly-dead node to be judged as the node selected;When the network connection state for the node selected is disconnects, generation judging result is:The node selected is really to die for the sake of honour a little.The present invention can in time, effectively, it is reliable, quickly identify seemingly-dead node, improve the stability of cluster.

Description

The monitoring method of the working condition of distributed cluster system interior joint, apparatus and system
Technical field
The present invention relates to distributed system field more particularly to a kind of working conditions of distributed cluster system interior joint Monitoring method and device and system.
Background technology
As cloud computing is in the extensive use in each field and the increase of data volume, scale, property to distributed file system Very high demand can be proposed with reliability.Under large-scale cluster, small probability event can become frequently to occur.Node is seemingly-dead It is exactly one of problem to be solved.After node is seemingly-dead, if cannot effectively and timely identify, it can seriously affect whole The stability and performance of a cluster can cause upper layer application to occur of short duration unavailable.But seemingly-dead node is difficult detection, if side Method is not right, can also judge by accident.
The content of the invention
The embodiment provides a kind of monitoring methods and device of the working condition of distributed cluster system node And system, it is capable of the working condition of timely and effective recognition node.
To achieve these goals, this invention takes following technical solutions.
A kind of monitoring method of the working condition of distributed cluster system interior joint, including:
Obtain in scheduled duration each node in distributed cluster system is judged as heartbeat detection time-out by other nodes Number;
The highest node of the number is selected from each node;
Obtain the network connection state for the node selected;
When the network connection state for the node selected is unimpeded, generation judging result is:That selects is described Node is seemingly-dead node;
When the network connection state for the node selected is disconnects, generation judging result is:That selects is described Node is really to die for the sake of honour a little.
A kind of monitoring device of the working condition of distributed cluster system interior joint, including:
First acquisition module obtains in scheduled duration being judged as by other nodes for each node in distributed cluster system The number of heartbeat detection time-out;
Selecting module selects the highest node of the number from each node;
Second acquisition module obtains the network connection state for the node selected;
Judgment module, when the network connection state for the node selected is unimpeded, generation judging result is:Selection The node gone out is seemingly-dead node;When the network connection state for the node selected is disconnects, judging result is generated For:The node selected is really to die for the sake of honour a little.
A kind of monitoring system of the working condition of distributed cluster system interior joint, including:In distributed cluster system At least three nodes, monitoring device;
The monitoring device is used for:Obtain in scheduled duration being saved by other for each node in distributed cluster system Point is judged as the number of heartbeat detection time-out;The highest node of the number is selected from each node;Acquisition is selected The node network connection state;When the network connection state for the node selected is unimpeded, generation judges knot Fruit is:The node selected is seemingly-dead node;When the network connection state for the node selected is disconnects, generation Judging result is:The node selected is really to die for the sake of honour a little.
Solves existing skill in the embodiment of the present invention it can be seen from the technical solution provided by embodiments of the invention described above The problem of working condition of egress can not accurately, be quickly judged in art.
The additional aspect of the present invention and advantage will be set forth in part in the description, these will become from the following description It obtains substantially or is recognized by the practice of the present invention.
Description of the drawings
In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment Attached drawing be briefly described, it should be apparent that, the accompanying drawings in the following description is only some embodiments of the present invention, for this For the those of ordinary skill of field, without having to pay creative labor, other are can also be obtained according to these attached drawings Attached drawing.
Fig. 1 is a kind of processing stream of the monitoring method of the working condition of distributed cluster system interior joint provided by the invention Cheng Tu;
Fig. 2 shows for a kind of connection of the monitoring device of the working condition of distributed cluster system interior joint provided by the invention It is intended to;
Fig. 3 is a kind of monitoring system of the working condition of distributed cluster system interior joint provided in an embodiment of the present invention Connection diagram.
Specific embodiment
Embodiments of the present invention are described below in detail, the example of the embodiment is shown in the drawings, wherein from beginning Same or similar element is represented to same or similar label eventually or there is same or like element.Below by ginseng The embodiment for examining attached drawing description is exemplary, and is only used for explaining the present invention, and is not construed as limiting the claims.
As shown in Figure 1, be a kind of monitoring method of the working condition of distributed cluster system interior joint of the present invention, Including:
Step 11, obtain in scheduled duration each node in distributed cluster system is judged as that heartbeat is examined by other nodes Survey the number of time-out;
Step 12, the highest node of the number is selected from each node;
Step 13, the network connection state for the node selected is obtained;The step is specially:Pass through internet detective Device is surveyed to test the network connection state of the node, to obtain the network connection state for the node selected.It for example, can To pass through PING orders.
Step 14, when the network connection state for the node selected is unimpeded, generation judging result is:It selects The node be seemingly-dead node;
Step 15, when the network connection state for the node selected is disconnects, generation judging result is:It selects The node be really to die for the sake of honour a little.
The present invention can in time, effectively, it is reliable, quickly identify seemingly-dead node, it is fairly simple.Optionally, the method It further includes:
Step 16, the judging result is sent in the distributed cluster system and removes the selected node Other outer nodes so that other nodes in addition to the selected node carry out respective handling.
The present invention can in time, effectively, it is reliable, quickly identify seemingly-dead node, carry out respective handling, improve cluster Stability.
Step 16 is specially:
Step 161, when the judging result is:It is described to remove the selection when node selected is seemingly-dead node Other described nodes outside the node gone out stop to the seemingly-dead node distribution task;Alternatively, stop waiting described seemingly-dead Node is to the feedback message for the task of having distributed.
Step 162, when the judging result is:It is described to remove the choosing when the node selected is really to die for the sake of honour Other described nodes outside the node selected out disconnect and the true connection died for the sake of honour a little.
Optionally, step 11 includes:
Step 111, each node in the distributed cluster system is continuously transmitted every fixed duration to other nodes The heartbeat request of predetermined quantity;For example, a node sent 2 heartbeat requests every 2 seconds to other nodes.
Step 112, when the section point in other described nodes does not return to heartbeat to the first node for sending heartbeat request During the response message of request, then the section point is judged as heartbeat detection time-out by the first node;In the embodiment First node and section point are intended merely to state different nodes, are all node to be monitored.
Step 113, according to the judging result of each first node, count the section point and be judged as heartbeat Detect the number of time-out.
As shown in Fig. 2, be the monitoring device of seemingly-dead node in a kind of distributed cluster system of the present invention, including:
First acquisition module 21 obtains in scheduled duration being judged by other nodes for each node in distributed cluster system For the number of heartbeat detection time-out;
Selecting module 22 selects the highest node of the number from each node;
Second acquisition module 23 obtains the network connection state for the node selected;
Judgment module 24, when the network connection state for the node selected is unimpeded, generation judging result is:Choosing The node selected out is seemingly-dead node;When the network connection state for the node selected is disconnects, generation judges knot Fruit is:The node selected is really to die for the sake of honour a little.
The device, further includes:
The judging result is sent in the distributed cluster system except selected described by sending module 25 Other nodes outside node so that other nodes in addition to the selected node carry out respective handling.
Second acquisition module 23 includes:
Heartbeat timeout detection sub-module 231, for each node in the distributed cluster system every fixed duration The heartbeat request of predetermined quantity is continuously transmitted to other nodes;
Judging submodule 232, when the section point in other described nodes is not returned to the first node for sending heartbeat request When returning the response message of heartbeat request, then the section point is judged as heartbeat detection time-out by the first node;
Statistic submodule 233 according to the judging result of each first node, counts the section point and is judged For the number of heartbeat detection time-out.
As described in Figure 3, it is the monitoring system of seemingly-dead node in a kind of distributed cluster system of the present invention, including: At least three nodes 31 in distributed cluster system, monitoring device 32;
Wherein, monitoring device can be arranged in management node, and management node is different from the distributed cluster system In at least three nodes 31 to be monitored outside node.
The monitoring device 32 is used for:Obtain in scheduled duration each node in distributed cluster system by other Node is judged as the number of heartbeat detection time-out;The highest node of the number is selected from each node;Obtain selection The network connection state of the node gone out;When the network connection state for the node selected is unimpeded, generation judges As a result it is:The node selected is seemingly-dead node;It is raw when the network connection state for the node selected is disconnects It is into judging result:The node selected is really to die for the sake of honour a little.
At least three nodes 32 in the distributed cluster system are used for:It is continuously sent out to other nodes every fixed duration Send the heartbeat request of predetermined quantity;When the section point in other described nodes is not returned to the first node for sending heartbeat request During the response message of heartbeat request, then the section point is judged as heartbeat detection time-out by the first node;Described first Node sends the message of the judging result of heartbeat detection time-out to the monitoring device;
The monitoring device 32 is additionally operable to:According to the message of the judging result of each first node, count described Section point is judged as the number of heartbeat detection time-out.
The application scenarios of the present invention are described below.A kind of method of seemingly-dead node in judgement distributed file system, can be with In time, seemingly-dead node effectively, reliably, is quickly identified.
The system of the present invention mainly includes following module:
Heartbeat timeout detection module, for sending heartbeat message mutually between distributed type assemblies interior nodes;The module is arranged on On each node to be monitored;
Heartbeat timeout reporting modules, for reporting heartbeat timeout information to management node (i.e. above-mentioned monitoring device);It should Module is arranged on each node to be monitored;
Seemingly-dead management module (being equal to above-mentioned monitoring device), for the heartbeat message of collecting and reporting, and judge be It is no to have seemingly-dead node.
The method of the present invention is described below.
Step 1, heart beat detection module is installed on each node to be monitored in cluster.Heart beat detection module can be every 2 Second sends request to other all nodes in cluster.Such as in 4 seconds, node A continuously transmits 2 heartbeat requests to node B all not to be had It returns, then be considered as the heartbeat timeout that node A is directed toward B.
Step 2:If detecting heartbeat timeout, heartbeat timeout reporting modules can give seemingly-dead management mould this information reporting Block.Information format can be (A->During B ultrasound).
Step 3:After seemingly-dead management module receives heartbeat timeout information, decision-making is carried out.Seemingly-dead management module was from nearest 10 seconds Retrieval, sees that the heartbeat timeout information for being directed toward which node is most inside the information inside received.
Step 4:If the heartbeat message that direction node B is retrieved inside step 3 is most, the net of decision node B is continued to Network connects.Here the Internet packets detector may be employed to test.If the network connection of B is unimpeded, that just illustrates that B is seemingly-dead .
The invention has the advantages that:
It solves the problems, such as accurately, quickly judge seemingly-dead node in the prior art, can subsequently carry out corresponding position Reason, so as to the stability of maintenance system.
The seemingly-dead situation of heretofore described node is:The operating system nucleus alive (living) of node, but thereon The response of some or all operations become very slow scene.Node is very extremely for the situation of node suspension or power-off.
The foregoing is only a preferred embodiment of the present invention, but protection scope of the present invention be not limited thereto, Any one skilled in the art in the technical scope disclosed by the present invention, the change or replacement that can be readily occurred in, It should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with scope of the claims Subject to.

Claims (10)

1. a kind of monitoring method of the working condition of distributed cluster system interior joint, which is characterized in that including:
Obtain time for being judged as heartbeat detection time-out by other nodes of each node in distributed cluster system in scheduled duration Number;
The highest node of the number is selected from each node;
Obtain the network connection state for the node selected;
When the network connection state for the node selected is unimpeded, generation judging result is:The node selected For seemingly-dead node;
When the network connection state for the node selected is disconnects, generation judging result is:The node selected It is really to die for the sake of honour a little.
2. according to the method described in claim 1, it is characterized in that, the method further includes:
The judging result is sent to other sections in the distributed cluster system in addition to the selected node Point so that other nodes in addition to the selected node carry out respective handling.
3. according to the method described in claim 2, it is characterized in that, described other sections in addition to the selected node Point, which carries out respective treated step, to be included:
When the judging result is:It is described to remove the selected node when node selected is seemingly-dead node Other outer described nodes stop to the seemingly-dead node distribution task;Alternatively, stop waiting the seemingly-dead node to having distributed The feedback message of task.
4. according to the method described in claim 2, it is characterized in that, described other sections in addition to the selected node Point, which carries out respective treated step, to be included:
When the judging result is:It is described to remove the selected section when the node selected is really to die for the sake of honour Other described nodes outside point disconnect and the true connection died for the sake of honour a little.
5. the according to the method described in claim 1, it is characterized in that, network connection shape for obtaining the node selected The step of state, includes:
The network connection state of the node is tested by the Internet packets detector, to obtain the net for the node selected Network connection status.
It is 6. according to the method described in claim 1, it is characterized in that, each in distributed cluster system in the acquisition scheduled duration The step of number for being judged as heartbeat detection time-out by other nodes of a node, includes:
Each node in the distributed cluster system continuously transmits the heart of predetermined quantity every fixed duration to other nodes Jump request;
When the response that the section point in other described nodes does not return to heartbeat request to the first node for sending heartbeat request disappears During breath, then the section point is judged as heartbeat detection time-out by the first node;
According to the judging result of each first node, time that the section point is judged as heartbeat detection time-out is counted Number.
7. a kind of monitoring device of the working condition of distributed cluster system interior joint, which is characterized in that including:
First acquisition module, obtain in scheduled duration each node in distributed cluster system is judged as heartbeat by other nodes Detect the number of time-out;
Selecting module selects the highest node of the number from each node;
Second acquisition module obtains the network connection state for the node selected;
Judgment module, when the network connection state for the node selected is unimpeded, generation judging result is:It selects The node is seemingly-dead node;When the network connection state for the node selected is disconnects, generation judging result is:Choosing The node selected out is really to die for the sake of honour a little.
8. device according to claim 7, which is characterized in that further include:
The judging result is sent in the distributed cluster system in addition to the selected node by sending module Other nodes so that other nodes in addition to the selected node carry out respective handling.
9. a kind of monitoring system of the working condition of distributed cluster system interior joint, which is characterized in that including:Distributed type assemblies At least three nodes in system, monitoring device;
The monitoring device is used for:Obtain in scheduled duration being sentenced by other nodes for each node in distributed cluster system Break as the number of heartbeat detection time-out;The highest node of the number is selected from each node;Obtain the institute selected State the network connection state of node;When the network connection state for the node selected is unimpeded, generation judging result is: The node selected is seemingly-dead node;When the network connection state for the node selected is disconnects, generation judges As a result it is:The node selected is really to die for the sake of honour a little.
10. system according to claim 9, which is characterized in that
At least three nodes in the distributed cluster system are used for:It is continuously transmitted every fixed duration to other nodes predetermined The heartbeat request of quantity;It is asked when the section point in other described nodes does not return to heartbeat to the first node for sending heartbeat request During the response message asked, then the section point is judged as heartbeat detection time-out by the first node;The first node is given The monitoring device sends the message of the judging result of heartbeat detection time-out;
The monitoring device is additionally operable to:According to the message of the judging result of each first node, second section is counted Point is judged as the number of heartbeat detection time-out.
CN201710591183.2A 2017-07-19 2017-07-19 The monitoring method of the working condition of distributed cluster system interior joint, apparatus and system Expired - Fee Related CN107426051B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710591183.2A CN107426051B (en) 2017-07-19 2017-07-19 The monitoring method of the working condition of distributed cluster system interior joint, apparatus and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710591183.2A CN107426051B (en) 2017-07-19 2017-07-19 The monitoring method of the working condition of distributed cluster system interior joint, apparatus and system

Publications (2)

Publication Number Publication Date
CN107426051A CN107426051A (en) 2017-12-01
CN107426051B true CN107426051B (en) 2018-06-05

Family

ID=60429811

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710591183.2A Expired - Fee Related CN107426051B (en) 2017-07-19 2017-07-19 The monitoring method of the working condition of distributed cluster system interior joint, apparatus and system

Country Status (1)

Country Link
CN (1) CN107426051B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109803024B (en) * 2019-01-28 2021-12-21 北京中科晶上科技股份有限公司 Method for cluster node network
CN111698120B (en) * 2020-06-02 2022-10-18 浙江大华技术股份有限公司 Storage node isolation method and device
CN114584489A (en) * 2022-03-08 2022-06-03 浪潮云信息技术股份公司 Ssh channel-based remote environment information and configuration detection method and system
CN116684256B (en) * 2023-08-01 2023-11-03 苏州浪潮智能科技有限公司 Node fault monitoring method, device and system, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102510343A (en) * 2011-11-16 2012-06-20 广东新支点技术服务有限公司 Highly available cluster system feign death solution based on both remote detection and power management
CN102521060A (en) * 2011-11-16 2012-06-27 广东新支点技术服务有限公司 Pseudo halt solving method of high-availability cluster system based on watchdog local detecting technique
CN104038366A (en) * 2014-05-05 2014-09-10 深圳市中博科创信息技术有限公司 Cluster node failure detection method and system
CN105656996A (en) * 2015-12-25 2016-06-08 北京奇虎科技有限公司 Data node survival detection method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE60200530T2 (en) * 2001-04-04 2004-09-23 Alcatel Mechanism and method for determining and quickly restoring a minimum capacity in a meshed network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102510343A (en) * 2011-11-16 2012-06-20 广东新支点技术服务有限公司 Highly available cluster system feign death solution based on both remote detection and power management
CN102521060A (en) * 2011-11-16 2012-06-27 广东新支点技术服务有限公司 Pseudo halt solving method of high-availability cluster system based on watchdog local detecting technique
CN104038366A (en) * 2014-05-05 2014-09-10 深圳市中博科创信息技术有限公司 Cluster node failure detection method and system
CN105656996A (en) * 2015-12-25 2016-06-08 北京奇虎科技有限公司 Data node survival detection method and device

Also Published As

Publication number Publication date
CN107426051A (en) 2017-12-01

Similar Documents

Publication Publication Date Title
CN107426051B (en) The monitoring method of the working condition of distributed cluster system interior joint, apparatus and system
CN102111310B (en) Method and system for monitoring content delivery network (CDN) equipment status
CN113038122B (en) Fault positioning system and method based on video image diagnosis data
CN112398680B (en) Fault delimiting method and equipment
EP2795841B1 (en) Method and arrangement for fault analysis in a multi-layer network
CN110716842B (en) Cluster fault detection method and device
CN101656013A (en) Vehicle-mounted monitoring alarm terminal, system and alarm method
CN102740112B (en) Method for controlling equipment polling based on video monitoring system
US10447561B2 (en) BFD method and apparatus
CN107294767B (en) Live broadcast network transmission fault monitoring method and system
CN115118581B (en) Internet of things data all-link monitoring and intelligent guaranteeing system based on 5G
CN107104838A (en) A kind of information processing method, server and terminal
CN108234161A (en) For the access detection method and system of on-line off-line multitiered network framework
CN106789239A (en) Towards the information application system failure trend prediction method and device of power business
CN103763143A (en) Method and system for equipment abnormality alarming based on storage server
CN111865667A (en) Network connectivity fault root cause positioning method and device
CN101252477B (en) Determining method and analyzing apparatus of network fault root
US20150271045A1 (en) Method, apparatus and system for detecting network element load imbalance
CN109840366B (en) Municipal bridge state detection device
CN111049703A (en) Network equipment detection method and system
CN111988172B (en) Network information management platform, device and security management method
CN105991305A (en) Method and device of identifying link abnormity
CN109699041A (en) A kind of RRU channel failure diagnosis processing method and RRU device
KR102359985B1 (en) A system and method for data collection
CN106803795A (en) Video monitoring system Fault Identification based on detection frame, positioning and warning system and its method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PP01 Preservation of patent right

Effective date of registration: 20181115

Granted publication date: 20180605

PP01 Preservation of patent right
PD01 Discharge of preservation of patent

Date of cancellation: 20211115

Granted publication date: 20180605

PD01 Discharge of preservation of patent
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20180605

Termination date: 20190719

CF01 Termination of patent right due to non-payment of annual fee