CN107426051B

CN107426051B - The monitoring method of the working condition of distributed cluster system interior joint, apparatus and system

Info

Publication number: CN107426051B
Application number: CN201710591183.2A
Authority: CN
Inventors: 张俊峰; 游峰; 李纲彬; 金鑫鑫
Original assignee: Beijing Internet Science And Technology Ltd Of Cloud Of China
Current assignee: Beijing Internet Science And Technology Ltd Of Cloud Of China
Priority date: 2017-07-19
Filing date: 2017-07-19
Publication date: 2018-06-05
Anticipated expiration: 2037-07-19
Also published as: CN107426051A

Abstract

An embodiment of the present invention provides a kind of monitoring method, device and the systems of the working condition of distributed cluster system interior joint.The monitoring method of the working condition of the distributed cluster system interior joint, including：Obtain the number for being judged as heartbeat detection time-out by other nodes of each node in distributed cluster system in scheduled duration；The highest node of the number is selected from each node；Obtain the network connection state for the node selected；When the network connection state for the node selected is unimpeded, it is seemingly-dead node to be judged as the node selected；When the network connection state for the node selected is disconnects, generation judging result is：The node selected is really to die for the sake of honour a little.The present invention can in time, effectively, it is reliable, quickly identify seemingly-dead node, improve the stability of cluster.

Description

The monitoring method of the working condition of distributed cluster system interior joint, apparatus and system

Technical field

The present invention relates to distributed system field more particularly to a kind of working conditions of distributed cluster system interior joint Monitoring method and device and system.

Background technology

As cloud computing is in the extensive use in each field and the increase of data volume, scale, property to distributed file system Very high demand can be proposed with reliability.Under large-scale cluster, small probability event can become frequently to occur.Node is seemingly-dead It is exactly one of problem to be solved.After node is seemingly-dead, if cannot effectively and timely identify, it can seriously affect whole The stability and performance of a cluster can cause upper layer application to occur of short duration unavailable.But seemingly-dead node is difficult detection, if side Method is not right, can also judge by accident.

The content of the invention

The embodiment provides a kind of monitoring methods and device of the working condition of distributed cluster system node And system, it is capable of the working condition of timely and effective recognition node.

To achieve these goals, this invention takes following technical solutions.

A kind of monitoring method of the working condition of distributed cluster system interior joint, including：

Obtain in scheduled duration each node in distributed cluster system is judged as heartbeat detection time-out by other nodes Number；

The highest node of the number is selected from each node；

Obtain the network connection state for the node selected；

When the network connection state for the node selected is unimpeded, generation judging result is：That selects is described Node is seemingly-dead node；

When the network connection state for the node selected is disconnects, generation judging result is：That selects is described Node is really to die for the sake of honour a little.

A kind of monitoring device of the working condition of distributed cluster system interior joint, including：

First acquisition module obtains in scheduled duration being judged as by other nodes for each node in distributed cluster system The number of heartbeat detection time-out；

Selecting module selects the highest node of the number from each node；

Second acquisition module obtains the network connection state for the node selected；

Judgment module, when the network connection state for the node selected is unimpeded, generation judging result is：Selection The node gone out is seemingly-dead node；When the network connection state for the node selected is disconnects, judging result is generated For：The node selected is really to die for the sake of honour a little.

A kind of monitoring system of the working condition of distributed cluster system interior joint, including：In distributed cluster system At least three nodes, monitoring device；

The monitoring device is used for：Obtain in scheduled duration being saved by other for each node in distributed cluster system Point is judged as the number of heartbeat detection time-out；The highest node of the number is selected from each node；Acquisition is selected The node network connection state；When the network connection state for the node selected is unimpeded, generation judges knot Fruit is：The node selected is seemingly-dead node；When the network connection state for the node selected is disconnects, generation Judging result is：The node selected is really to die for the sake of honour a little.

Solves existing skill in the embodiment of the present invention it can be seen from the technical solution provided by embodiments of the invention described above The problem of working condition of egress can not accurately, be quickly judged in art.

The additional aspect of the present invention and advantage will be set forth in part in the description, these will become from the following description It obtains substantially or is recognized by the practice of the present invention.

Description of the drawings

In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment Attached drawing be briefly described, it should be apparent that, the accompanying drawings in the following description is only some embodiments of the present invention, for this For the those of ordinary skill of field, without having to pay creative labor, other are can also be obtained according to these attached drawings Attached drawing.

Fig. 1 is a kind of processing stream of the monitoring method of the working condition of distributed cluster system interior joint provided by the invention Cheng Tu；

Fig. 2 shows for a kind of connection of the monitoring device of the working condition of distributed cluster system interior joint provided by the invention It is intended to；

Fig. 3 is a kind of monitoring system of the working condition of distributed cluster system interior joint provided in an embodiment of the present invention Connection diagram.

Specific embodiment

Embodiments of the present invention are described below in detail, the example of the embodiment is shown in the drawings, wherein from beginning Same or similar element is represented to same or similar label eventually or there is same or like element.Below by ginseng The embodiment for examining attached drawing description is exemplary, and is only used for explaining the present invention, and is not construed as limiting the claims.

As shown in Figure 1, be a kind of monitoring method of the working condition of distributed cluster system interior joint of the present invention, Including：

Step 11, obtain in scheduled duration each node in distributed cluster system is judged as that heartbeat is examined by other nodes Survey the number of time-out；

Step 12, the highest node of the number is selected from each node；

Step 13, the network connection state for the node selected is obtained；The step is specially：Pass through internet detective Device is surveyed to test the network connection state of the node, to obtain the network connection state for the node selected.It for example, can To pass through PING orders.

Step 14, when the network connection state for the node selected is unimpeded, generation judging result is：It selects The node be seemingly-dead node；

Step 15, when the network connection state for the node selected is disconnects, generation judging result is：It selects The node be really to die for the sake of honour a little.

The present invention can in time, effectively, it is reliable, quickly identify seemingly-dead node, it is fairly simple.Optionally, the method It further includes：

Step 16, the judging result is sent in the distributed cluster system and removes the selected node Other outer nodes so that other nodes in addition to the selected node carry out respective handling.

The present invention can in time, effectively, it is reliable, quickly identify seemingly-dead node, carry out respective handling, improve cluster Stability.

Step 16 is specially：

Step 161, when the judging result is：It is described to remove the selection when node selected is seemingly-dead node Other described nodes outside the node gone out stop to the seemingly-dead node distribution task；Alternatively, stop waiting described seemingly-dead Node is to the feedback message for the task of having distributed.

Step 162, when the judging result is：It is described to remove the choosing when the node selected is really to die for the sake of honour Other described nodes outside the node selected out disconnect and the true connection died for the sake of honour a little.

Optionally, step 11 includes：

Step 111, each node in the distributed cluster system is continuously transmitted every fixed duration to other nodes The heartbeat request of predetermined quantity；For example, a node sent 2 heartbeat requests every 2 seconds to other nodes.

Step 112, when the section point in other described nodes does not return to heartbeat to the first node for sending heartbeat request During the response message of request, then the section point is judged as heartbeat detection time-out by the first node；In the embodiment First node and section point are intended merely to state different nodes, are all node to be monitored.

Step 113, according to the judging result of each first node, count the section point and be judged as heartbeat Detect the number of time-out.

As shown in Fig. 2, be the monitoring device of seemingly-dead node in a kind of distributed cluster system of the present invention, including：

First acquisition module 21 obtains in scheduled duration being judged by other nodes for each node in distributed cluster system For the number of heartbeat detection time-out；

Selecting module 22 selects the highest node of the number from each node；

Second acquisition module 23 obtains the network connection state for the node selected；

Judgment module 24, when the network connection state for the node selected is unimpeded, generation judging result is：Choosing The node selected out is seemingly-dead node；When the network connection state for the node selected is disconnects, generation judges knot Fruit is：The node selected is really to die for the sake of honour a little.

The device, further includes：

The judging result is sent in the distributed cluster system except selected described by sending module 25 Other nodes outside node so that other nodes in addition to the selected node carry out respective handling.

Second acquisition module 23 includes：

Heartbeat timeout detection sub-module 231, for each node in the distributed cluster system every fixed duration The heartbeat request of predetermined quantity is continuously transmitted to other nodes；

Judging submodule 232, when the section point in other described nodes is not returned to the first node for sending heartbeat request When returning the response message of heartbeat request, then the section point is judged as heartbeat detection time-out by the first node；

Statistic submodule 233 according to the judging result of each first node, counts the section point and is judged For the number of heartbeat detection time-out.

As described in Figure 3, it is the monitoring system of seemingly-dead node in a kind of distributed cluster system of the present invention, including： At least three nodes 31 in distributed cluster system, monitoring device 32；

Wherein, monitoring device can be arranged in management node, and management node is different from the distributed cluster system In at least three nodes 31 to be monitored outside node.

The monitoring device 32 is used for：Obtain in scheduled duration each node in distributed cluster system by other Node is judged as the number of heartbeat detection time-out；The highest node of the number is selected from each node；Obtain selection The network connection state of the node gone out；When the network connection state for the node selected is unimpeded, generation judges As a result it is：The node selected is seemingly-dead node；It is raw when the network connection state for the node selected is disconnects It is into judging result：The node selected is really to die for the sake of honour a little.

At least three nodes 32 in the distributed cluster system are used for：It is continuously sent out to other nodes every fixed duration Send the heartbeat request of predetermined quantity；When the section point in other described nodes is not returned to the first node for sending heartbeat request During the response message of heartbeat request, then the section point is judged as heartbeat detection time-out by the first node；Described first Node sends the message of the judging result of heartbeat detection time-out to the monitoring device；

The monitoring device 32 is additionally operable to：According to the message of the judging result of each first node, count described Section point is judged as the number of heartbeat detection time-out.

The application scenarios of the present invention are described below.A kind of method of seemingly-dead node in judgement distributed file system, can be with In time, seemingly-dead node effectively, reliably, is quickly identified.

The system of the present invention mainly includes following module：

Heartbeat timeout detection module, for sending heartbeat message mutually between distributed type assemblies interior nodes；The module is arranged on On each node to be monitored；

Heartbeat timeout reporting modules, for reporting heartbeat timeout information to management node (i.e. above-mentioned monitoring device)；It should Module is arranged on each node to be monitored；

Seemingly-dead management module (being equal to above-mentioned monitoring device), for the heartbeat message of collecting and reporting, and judge be It is no to have seemingly-dead node.

The method of the present invention is described below.

Step 1, heart beat detection module is installed on each node to be monitored in cluster.Heart beat detection module can be every 2 Second sends request to other all nodes in cluster.Such as in 4 seconds, node A continuously transmits 2 heartbeat requests to node B all not to be had It returns, then be considered as the heartbeat timeout that node A is directed toward B.

Step 2：If detecting heartbeat timeout, heartbeat timeout reporting modules can give seemingly-dead management mould this information reporting Block.Information format can be (A->During B ultrasound).

Step 3：After seemingly-dead management module receives heartbeat timeout information, decision-making is carried out.Seemingly-dead management module was from nearest 10 seconds Retrieval, sees that the heartbeat timeout information for being directed toward which node is most inside the information inside received.

Step 4：If the heartbeat message that direction node B is retrieved inside step 3 is most, the net of decision node B is continued to Network connects.Here the Internet packets detector may be employed to test.If the network connection of B is unimpeded, that just illustrates that B is seemingly-dead .

The invention has the advantages that：

It solves the problems, such as accurately, quickly judge seemingly-dead node in the prior art, can subsequently carry out corresponding position Reason, so as to the stability of maintenance system.

The seemingly-dead situation of heretofore described node is：The operating system nucleus alive (living) of node, but thereon The response of some or all operations become very slow scene.Node is very extremely for the situation of node suspension or power-off.

The foregoing is only a preferred embodiment of the present invention, but protection scope of the present invention be not limited thereto, Any one skilled in the art in the technical scope disclosed by the present invention, the change or replacement that can be readily occurred in, It should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with scope of the claims Subject to.

Claims

1. a kind of monitoring method of the working condition of distributed cluster system interior joint, which is characterized in that including：

Obtain time for being judged as heartbeat detection time-out by other nodes of each node in distributed cluster system in scheduled duration Number；

The highest node of the number is selected from each node；

Obtain the network connection state for the node selected；

When the network connection state for the node selected is unimpeded, generation judging result is：The node selected For seemingly-dead node；

When the network connection state for the node selected is disconnects, generation judging result is：The node selected It is really to die for the sake of honour a little.

2. according to the method described in claim 1, it is characterized in that, the method further includes：

The judging result is sent to other sections in the distributed cluster system in addition to the selected node Point so that other nodes in addition to the selected node carry out respective handling.

3. according to the method described in claim 2, it is characterized in that, described other sections in addition to the selected node Point, which carries out respective treated step, to be included：

When the judging result is：It is described to remove the selected node when node selected is seemingly-dead node Other outer described nodes stop to the seemingly-dead node distribution task；Alternatively, stop waiting the seemingly-dead node to having distributed The feedback message of task.

4. according to the method described in claim 2, it is characterized in that, described other sections in addition to the selected node Point, which carries out respective treated step, to be included：

When the judging result is：It is described to remove the selected section when the node selected is really to die for the sake of honour Other described nodes outside point disconnect and the true connection died for the sake of honour a little.

5. the according to the method described in claim 1, it is characterized in that, network connection shape for obtaining the node selected The step of state, includes：

The network connection state of the node is tested by the Internet packets detector, to obtain the net for the node selected Network connection status.

It is 6. according to the method described in claim 1, it is characterized in that, each in distributed cluster system in the acquisition scheduled duration The step of number for being judged as heartbeat detection time-out by other nodes of a node, includes：

Each node in the distributed cluster system continuously transmits the heart of predetermined quantity every fixed duration to other nodes Jump request；

When the response that the section point in other described nodes does not return to heartbeat request to the first node for sending heartbeat request disappears During breath, then the section point is judged as heartbeat detection time-out by the first node；

According to the judging result of each first node, time that the section point is judged as heartbeat detection time-out is counted Number.

7. a kind of monitoring device of the working condition of distributed cluster system interior joint, which is characterized in that including：

First acquisition module, obtain in scheduled duration each node in distributed cluster system is judged as heartbeat by other nodes Detect the number of time-out；

Selecting module selects the highest node of the number from each node；

Judgment module, when the network connection state for the node selected is unimpeded, generation judging result is：It selects The node is seemingly-dead node；When the network connection state for the node selected is disconnects, generation judging result is：Choosing The node selected out is really to die for the sake of honour a little.

8. device according to claim 7, which is characterized in that further include：

The judging result is sent in the distributed cluster system in addition to the selected node by sending module Other nodes so that other nodes in addition to the selected node carry out respective handling.

9. a kind of monitoring system of the working condition of distributed cluster system interior joint, which is characterized in that including：Distributed type assemblies At least three nodes in system, monitoring device；

The monitoring device is used for：Obtain in scheduled duration being sentenced by other nodes for each node in distributed cluster system Break as the number of heartbeat detection time-out；The highest node of the number is selected from each node；Obtain the institute selected State the network connection state of node；When the network connection state for the node selected is unimpeded, generation judging result is： The node selected is seemingly-dead node；When the network connection state for the node selected is disconnects, generation judges As a result it is：The node selected is really to die for the sake of honour a little.

10. system according to claim 9, which is characterized in that

At least three nodes in the distributed cluster system are used for：It is continuously transmitted every fixed duration to other nodes predetermined The heartbeat request of quantity；It is asked when the section point in other described nodes does not return to heartbeat to the first node for sending heartbeat request During the response message asked, then the section point is judged as heartbeat detection time-out by the first node；The first node is given The monitoring device sends the message of the judging result of heartbeat detection time-out；

The monitoring device is additionally operable to：According to the message of the judging result of each first node, second section is counted Point is judged as the number of heartbeat detection time-out.