CN110740064A

CN110740064A - Distributed cluster node fault processing method, device, equipment and storage medium

Info

Publication number: CN110740064A
Application number: CN201911025111.7A
Authority: CN
Inventors: 张大帅
Original assignee: Beijing Inspur Data Technology Co Ltd
Current assignee: Beijing Inspur Data Technology Co Ltd
Priority date: 2019-10-25
Filing date: 2019-10-25
Publication date: 2020-01-31

Abstract

The invention discloses a distributed cluster node fault processing method which comprises the following steps of respectively sending multicast requests to agent services pre-deployed by nodes in a distributed storage cluster, determining nodes which do not respond to the multicast requests as fault nodes when determining that the nodes which do not respond to the multicast requests exist, and clearing relevant authentication information of the fault nodes in the distributed storage cluster.

Description

Distributed cluster node fault processing method, device, equipment and storage medium

Technical Field

The present invention relates to the field of distributed storage technologies, and in particular, to a method, an apparatus, a device, and a computer-readable storage medium for processing a fault in distributed cluster nodes.

Background

The distributed storage cluster system generally includes a plurality of storage servers (servers), which form cluster systems for providing services to the outside, the servers are also denoted by "nodes", each distributed storage cluster generally has main monitoring nodes (called main nodes for short), which monitor the state of the storage cluster.

In summary, how to effectively solve the problem that the normal service operation of a client is affected by the overall performance of a cluster due to data reconstruction caused by the down of a node is an urgent need to be solved by those skilled in the art.

Disclosure of Invention

The invention aims to provide a distributed cluster node fault processing method which avoids data reconstruction caused by node downtime, greatly reduces the influence on the overall performance of a cluster and ensures the operation of normal services of a client, and the invention also aims to provide a distributed cluster node fault processing device, equipment and a computer readable storage medium.

In order to solve the technical problems, the invention provides the following technical scheme:

A distributed cluster node fault handling method, comprising:

respectively sending multicast requests to agent services pre-deployed by each node in the distributed storage cluster;

determining a node that does not respond to the multicast request as a failed node when it is determined that there is a node that does not respond to the multicast request;

and clearing the relevant authentication information of the fault node in the distributed storage cluster.

In embodiments of the present invention, after clearing the relevant authentication information of the failed node in the distributed storage cluster, the method further includes:

and adding fault identification information after the sn serial number corresponding to the fault node.

In embodiments of the present invention, after adding the fault identification information after the sn sequence number corresponding to the faulty node, the method further includes:

when a cluster joining request is received, detecting whether the fault identification information exists after the sn serial number of a node to be joined;

and if so, removing the original cluster service information in the node to be added, and adding the node to be added with the removed original cluster service information to the distributed storage cluster.

In embodiments of the present invention, when it is determined that there is a node that does not respond to the multicast request, determining the node that does not respond to the multicast request as a failed node includes:

and when determining that the nodes which do not respond to the multicast requests for the continuous preset times exist, determining the nodes which do not respond to the multicast requests for the continuous preset times as fault nodes.

distributed cluster node fault handling device, comprising:

the request sending module is used for respectively sending multicast requests to the agent services pre-deployed by each node in the distributed storage cluster;

a failed node determination module, configured to determine a node that does not respond to the multicast request as a failed node when it is determined that there is a node that does not respond to the multicast request;

and the authentication information clearing module is used for clearing the relevant authentication information of the fault node in the distributed storage cluster.

In embodiments of the present invention, the method further comprises:

and the identification information adding module is used for adding fault identification information after the sn serial number corresponding to the fault node is removed after the relevant authentication information of the fault node in the distributed storage cluster.

In embodiments of the present invention, the method further comprises:

the identification information detection module is used for detecting whether the fault identification information exists after the sn serial number of the node to be added when a cluster adding request is received after the fault identification information is added after the sn serial number corresponding to the fault node;

and the node adding module is used for clearing the original cluster service information in the node to be added when the fault identification information exists after the sn serial number of the node to be added is detected, and adding the node to be added with the cleared original cluster service information to the distributed storage cluster.

In specific embodiments of the present invention, the failed node determining module is specifically a module that, when it is determined that there is a node that does not respond to the multicast requests for a preset number of consecutive times, determines a node that does not respond to the multicast requests for a preset number of consecutive times as a failed node.

distributed cluster node failure handling device, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the distributed cluster node fault handling method as described above when executing the computer program.

computer readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the distributed cluster node failure handling method as set out above.

The method provided by the embodiment of the invention is applied to respectively send multicast requests to the pre-deployed proxy services of each node in the distributed storage cluster; when determining that there is a node which does not respond to the multicast request, determining the node which does not respond to the multicast request as a failed node; and clearing the relevant authentication information of the fault node in the distributed storage cluster. By pre-deploying the proxy service for each node in the distributed storage cluster respectively, the fault node can be detected in time according to the response state of each node to the multicast request received by the corresponding proxy service, and the relevant authentication information of the fault node can be cleared in time, so that the fault node is removed from the distributed storage cluster in time, data reconstruction caused by node downtime is avoided, the influence on the overall performance of the cluster is greatly reduced, and the normal service operation of a client is ensured.

Correspondingly, embodiments of the present invention further provide a distributed cluster node fault processing apparatus, a device, and a computer-readable storage medium corresponding to the distributed cluster node fault processing method, which have the above technical effects and are not described herein again.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flow chart of implementation methods of a distributed cluster node fault handling method in an embodiment of the present invention;

fig. 2 is another implementation flowcharts of the distributed cluster node fault handling method in the embodiment of the present invention;

fig. 3 is a block diagram of a distributed cluster node fault handling apparatus according to an embodiment of the present invention;

fig. 4 is a block diagram of distributed cluster node fault handling apparatus according to an embodiment of the present invention.

Detailed Description

For a better understanding of the present invention, reference is made to the following detailed description taken in conjunction with the accompanying drawings and the accompanying detailed description, it is understood that the illustrated embodiments are only a partial embodiment , rather than a full embodiment.

Example :

referring to fig. 1, fig. 1 is a flow chart of implementation methods of a distributed cluster node fault handling method in an embodiment of the present invention, where the method may include the following steps:

s101: and respectively sending multicast requests to the agent services pre-deployed by each node in the distributed storage cluster.

A detection service (master) may be pre-deployed in a master node of the distributed storage cluster and an agent service (agent) may be pre-deployed in each node of the distributed storage cluster. The detection service may send multicast requests to the proxy services in real time or at preset time intervals.

The multicast request may be a handshake request.

It should be noted that, when the detection service sends the multicast request to each proxy service according to the preset time interval, the time interval for sending the multicast request may be set and adjusted according to the actual situation, which is not limited in this embodiment of the present invention, and may be set to 15s, for example.

S102: when it is determined that there is a node that does not respond to the multicast request, the node that does not respond to the multicast request is determined as a failed node.

After the multicast request is sent to the pre-deployed proxy service of each node in the distributed storage cluster, whether each node responds to the multicast request can be detected, for example, when a certain node returns an "OK" reply through the corresponding proxy service, it indicates that the node is normal, and when a certain node does not respond to the multicast request later, it indicates that the node has a problem. When it is determined that there is a node that does not respond to the multicast request, the node that does not respond to the multicast request may be determined as a failed node.

S103: and clearing the relevant authentication information of the fault node in the distributed storage cluster.

After the fault node is determined, because the fault node is down and cannot communicate with the fault node, the state (MON) of the related monitoring Storage cluster in the fault node and the Object Storage Device cannot be cleared, the cluster service such as the data Storage service (OSD) and the like cannot be provided, so that the related authentication information such as MON and OSD and the like of the fault node in the distributed Storage cluster is cleared first, and if the name of the fault node is noden, the related authentication information can be cleared through a command cluster auth del. And the operation and maintenance cost is reduced by automatically detecting the fault node.

It should be noted that, based on the above embodiment , the embodiment of the present invention further provides a corresponding improved scheme, and the steps that are the same as or correspond to those in the above embodiment may be referred to each other in the subsequent embodiments, and corresponding beneficial effects may also be referred to each other, which is not described in detail in the following improved embodiment .

Referring to fig. 2, fig. 2 is another implementation flowcharts of the distributed cluster node fault handling method in the embodiment of the present invention, where the method may include the following steps:

s201: and respectively sending multicast requests to the agent services pre-deployed by each node in the distributed storage cluster.

S202: and when determining that the nodes which do not respond to the multicast requests for the continuous preset times exist, determining the nodes which do not respond to the multicast requests for the continuous preset times as the fault nodes.

The number of times that a certain node is determined as a failed node and that no multicast request needs to be satisfied continuously responds may be preset, and when it is determined that there are nodes that do not respond to multicast requests for the continuously preset number of times, the node that does not respond to multicast requests for the continuously preset number of times is determined as the failed node. Through multiple times of verification, misjudgment caused by network jitter and the like is avoided.

It should be noted that the preset number of times may be set and adjusted according to actual situations, which is not limited in the embodiment of the present invention, and may be set to 3 times, for example.

S203: and clearing the relevant authentication information of the fault node in the distributed storage cluster.

S204: and adding fault identification information after the sn serial number corresponding to the fault node.

After the related authentication information of the failed node in the distributed storage cluster is cleared, fault identification information, such as fault/clear identification, may be added after a sn serial number (i.e., a product serial number) corresponding to the failed node, indicating that the node is the removed distributed storage cluster in the event of a fault, and storage service and configuration information such as MON, OSD, etc. on the node are not yet completely cleared.

S205: when a cluster joining request is received, whether fault identification information exists after the sn serial number of the node to be joined is detected, if yes, step S206 is executed, and if not, the node to be joined is directly added to the distributed storage cluster.

When a new node or a node with a fault repair completed needs to join the distributed storage cluster, a cluster joining request can be sent to the detection service, and after the detection service receives the cluster joining request, whether fault identification information exists after the sn serial number of the node to be joined can be detected, so that whether the node is a node which reappears to join the distributed storage cluster after the fault repair is completed can be determined. When the fault identification information exists after the sn serial number of the node to be added is determined, it indicates that the node is a node which reappears to add to the distributed storage cluster after fault repair is completed, in this case, step S206 may be continuously executed, and when the sn serial number of the node to be added is determined that the fault identification information does not exist, it indicates that the node is a new node which applies to add to the distributed storage cluster, in this case, the node to be added may be directly added to the distributed storage cluster.

S206: and removing the original cluster service information in the node to be added, and adding the node to be added with the removed original cluster service information to the distributed storage cluster.

When the fault identification information exists after the sn serial number of the node to be added is determined, the original cluster service information in the node to be added can be removed, and the node to be added with the original cluster service information removed is added to the distributed storage cluster. By clearing the original cluster service information to be added into the node, the problem that the service access of a client is influenced after the node is added into the distributed storage cluster by the residual isolated service before the last fault can be avoided.

Corresponding to the above method embodiment, the embodiment of the present invention further provides a distributed cluster node fault handling apparatus, and the distributed cluster node fault handling apparatus described below and the distributed cluster node fault handling method described above may be referred to correspondingly.

Referring to fig. 3, fig. 3 is a block diagram illustrating a structure of an distributed cluster node fault handling apparatus according to an embodiment of the present invention, where the apparatus may include:

a request sending module 31, configured to send a multicast request to a proxy service pre-deployed by each node in the distributed storage cluster;

a failed node determination module 32, configured to determine a node that does not respond to the multicast request as a failed node when it is determined that there is a node that does not respond to the multicast request;

and the authentication information clearing module 33 is configured to clear the relevant authentication information of the failed node in the distributed storage cluster.

The device provided by the embodiment of the invention is applied to respectively send multicast requests to the pre-deployed proxy services of each node in the distributed storage cluster; when determining that there is a node which does not respond to the multicast request, determining the node which does not respond to the multicast request as a failed node; and clearing the relevant authentication information of the fault node in the distributed storage cluster. By pre-deploying the proxy service for each node in the distributed storage cluster respectively, the fault node can be detected in time according to the response state of each node to the multicast request received by the corresponding proxy service, and the relevant authentication information of the fault node can be cleared in time, so that the fault node is removed from the distributed storage cluster in time, data reconstruction caused by node downtime is avoided, the influence on the overall performance of the cluster is greatly reduced, and the normal service operation of a client is ensured.

In embodiments of the present invention, the apparatus may further comprise:

and the identification information adding module is used for adding the fault identification information after the sn serial number corresponding to the fault node after the relevant authentication information of the fault node in the distributed storage cluster is eliminated.

In embodiments of the present invention, the apparatus may further comprise:

and the node adding module is used for removing the original cluster service information in the node to be added when the fault identification information exists after the sn serial number of the node to be added is detected, and adding the node to be added with the removed original cluster service information to the distributed storage cluster.

In embodiments of the present invention, the failed node determining module 32 is specifically a module that, when it is determined that there is a node that does not respond to any of the consecutive preset number of multicast requests, determines a node that does not respond to any of the consecutive preset number of multicast requests as a failed node.

Corresponding to the above method embodiment, referring to fig. 4, fig. 4 is a schematic diagram of a distributed cluster node fault handling device provided in the present invention, where the device may include:

a memory 41 for storing a computer program;

the processor 42, when executing the computer program stored in the memory 41, may implement the following steps:

respectively sending multicast requests to agent services pre-deployed by each node in the distributed storage cluster; when determining that there is a node which does not respond to the multicast request, determining the node which does not respond to the multicast request as a failed node; and clearing the relevant authentication information of the fault node in the distributed storage cluster.

For the introduction of the device provided by the present invention, please refer to the above method embodiment, which is not described herein again.

In accordance with the above method embodiment, the present invention further provides computer-readable storage media, on which a computer program is stored, the computer program, when executed by a processor, being adapted to perform the steps of:

The computer-readable storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

For the introduction of the computer-readable storage medium provided by the present invention, please refer to the above method embodiments, which are not described herein again.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device, the apparatus and the computer-readable storage medium disclosed in the embodiments correspond to the method disclosed in the embodiments, so that the description is simple, and the relevant points can be referred to the description of the method.

The principle and the implementation of the present invention are explained in the present application by using specific examples, and the above description of the embodiments is only used to help understanding the technical solution and the core idea of the present invention. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims

1, distributed cluster node fault handling method, comprising:

2. The method according to claim 1, wherein after clearing the relevant authentication information of the failed node in the distributed storage cluster, the method further comprises:

3. The method according to claim 2, wherein after adding the fault identification information after the sn sequence number corresponding to the faulty node, the method further comprises:

4. The distributed cluster node fault handling method of any of claims 1-3, wherein determining a node that is not responding to the multicast request as a faulty node when it is determined that there are nodes that are not responding to the multicast request includes:

5, distributed cluster node fault handling device, comprising:

6. The distributed cluster node failure handling apparatus of claim 5, further comprising:

7. The distributed cluster node failure handling apparatus of claim 6, further comprising:

8. The distributed cluster node failure processing apparatus of as claimed in any of claims 5 to 7, wherein the failed node determining module is specifically a module that determines a node that does not respond to a preset number of consecutive multicast requests as a failed node when it is determined that there are nodes that do not respond to a preset number of consecutive multicast requests.

9, distributed cluster node fault handling device, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the distributed cluster node failure handling method of any claims 1-4 when executing the computer program.

10, computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the distributed cluster node failure handling method according to any of claims 1 to 4 through .