CN111901422A

CN111901422A - Method, system and device for managing nodes in cluster

Info

Publication number: CN111901422A
Application number: CN202010738723.7A
Authority: CN
Inventors: 李二明
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2020-07-28
Filing date: 2020-07-28
Publication date: 2020-11-06
Anticipated expiration: 2040-07-28
Also published as: CN111901422B

Abstract

The invention discloses a method, a system and a device for managing nodes in a cluster, which are used for determining the grouping condition of each node in the cluster according to a preset cluster grouping and deploying strategy when a deploying instruction for representing a cluster grouping mode is received; configuring a configuration file corresponding to each node in the cluster according to the grouping condition of each node in the cluster; after configuration files corresponding to the nodes in the cluster are configured, restarting the nodes in the cluster to enable the cluster grouping mode to take effect; in the cluster-packet mode, the target node is configured to monitor node status of the remaining nodes of the same group through message passing with the remaining nodes. Therefore, under a large-scale cluster, each node in the cluster can be deployed in a cluster grouping mode, and each node in the same group only needs to be monitored mutually, so that stable monitoring of states among the nodes is facilitated, and misjudgment is not easy to cause; moreover, the cluster grouping mode is adopted to help identify the fault node in the cluster so as to avoid the fault node from influencing the service.

Description

Method, system and device for managing nodes in cluster

Technical Field

The present invention relates to the field of cluster node management, and in particular, to a method, a system, and an apparatus for managing nodes in a cluster.

Background

At present, a method for monitoring node states in a cluster includes: each node in the cluster establishes a TCP (Transmission Control Protocol) connection with all other nodes. For any node, whether the states of all other nodes are normal or not is judged through message passing with all other nodes. When the cluster scale is small, the message flow among the nodes is not large, but when the cluster scale is large, if the number of the nodes in the cluster reaches hundreds or even thousands, the message flow among the nodes is very large, which is not beneficial to the stable monitoring of the state among the nodes and is easy to cause misjudgment; moreover, in a large-scale cluster, identification of a failed node in the cluster is difficult, and the consequences are serious if the service is affected.

Therefore, how to provide a solution to the above technical problem is a problem that needs to be solved by those skilled in the art.

Disclosure of Invention

The invention aims to provide a method, a system and a device for managing nodes in a cluster, wherein under a large-scale cluster, the method can deploy each node in the cluster in a cluster grouping mode, and each node in the same group only needs to be monitored mutually, so that the method is favorable for stable monitoring of the state among the nodes and is not easy to cause misjudgment; moreover, the cluster grouping mode is adopted to help identify the fault node in the cluster so as to avoid the fault node from influencing the service.

In order to solve the above technical problem, the present invention provides a method for managing nodes in a cluster, including:

when a deployment instruction representing a cluster grouping mode is received, determining the grouping condition of each node in a cluster according to a preset cluster grouping deployment strategy;

configuring configuration files corresponding to the nodes in the cluster according to the grouping condition of the nodes in the cluster; the target configuration file corresponding to the target node represents a specific node in the same group with the target node; the target node is any node in the cluster;

restarting each node in the cluster after configuration files corresponding to each node in the cluster are configured, so that the cluster grouping mode takes effect; wherein, in the cluster grouping mode, the target node is configured to monitor node states of the remaining nodes through message passing with the remaining nodes of the same group.

Preferably, the process of determining the grouping condition of each node in the cluster according to the preset cluster grouping deployment policy includes:

and determining the grouping condition of each node in the cluster based on a cluster grouping deployment strategy for dividing the nodes belonging to the same network segment and/or corresponding to the same storage pool into the same group.

Preferably, the method for managing nodes in the cluster further includes:

when a certain node in the cluster fails, selecting a main node from normal nodes in the group of the failed node according to a preset election mechanism;

judging whether the master node holds a distributed lock;

if yes, executing fault processing operation of the fault node;

if not, triggering the main node to send distributed lock acquisition requests to other normal nodes in the same group, and judging whether the total number of nodes replying to the main node based on the distributed lock acquisition requests is larger than a preset reply number threshold value;

if so, determining that the master node successfully acquires the distributed lock to execute the fault processing operation of the fault node;

and if not, determining that the main node is a false main node, forbidding the false main node to enter a connection state within a preset duration, and re-executing the operation of selecting the main node from normal nodes in the group of the fault node according to a preset election mechanism.

Preferably, the process of performing the fault handling operation of the faulty node includes:

performing data recovery on the database of the fault node, and synchronizing the database content of the normal node of the group of the fault node based on the database of the data recovery; wherein, the database contents of the nodes in the same group are the same;

and releasing the virtual IP of the fault node, and reallocating the virtual IP of the fault node to a normal node of the group where the fault node is located based on a load balancing strategy so that the normal node replaces the fault node to continue processing node tasks.

Preferably, the process of executing the fault handling operation of the faulty node further includes:

and informing all normal nodes of the group of the fault nodes of the fault information of the fault nodes.

Preferably, the method for managing nodes in the cluster further includes:

when a deployment instruction representing a cluster mode is received, all nodes in the cluster are divided into the same group;

configuring a configuration file corresponding to each node in the cluster according to the condition that each node in the cluster belongs to the same group;

restarting each node in the cluster after configuration files corresponding to each node in the cluster are configured, so that the cluster mode takes effect; wherein, in the cluster mode, the target node is configured to monitor node states of the remaining nodes in the cluster through message passing with the remaining nodes.

In order to solve the above technical problem, the present invention further provides a management system for nodes in a cluster, including:

the first grouping module is used for determining the grouping condition of each node in the cluster according to a preset cluster grouping deployment strategy when a deployment instruction for representing a cluster grouping mode is received;

the first configuration module is used for configuring configuration files corresponding to the nodes in the cluster according to the grouping condition of the nodes in the cluster; the target configuration file corresponding to the target node represents a specific node in the same group with the target node; the target node is any node in the cluster;

the first restarting module is used for restarting each node in the cluster after the configuration files corresponding to each node in the cluster are configured, so that the cluster grouping mode takes effect; wherein, in the cluster grouping mode, the target node is configured to monitor node states of the remaining nodes through message passing with the remaining nodes of the same group.

Preferably, the management system of the nodes in the cluster further includes:

the system comprises an election module, a master node and a slave node, wherein the election module is used for electing the master node from normal nodes of a group where a fault node is located according to a preset election mechanism when a certain node in the cluster fails;

the judging module is used for judging whether the main node holds the distributed lock; if yes, executing the processing module; if not, executing an acquisition module;

the processing module is used for executing the fault processing operation of the fault node;

the acquisition module is used for triggering the main node to send distributed lock acquisition requests to other normal nodes in the same group and judging whether the total number of nodes replying to the main node based on the distributed lock acquisition requests is larger than a preset reply number threshold value or not; if yes, executing the processing module; if not, executing a prohibition module;

and the prohibiting module is used for determining that the master node is a false master node, prohibiting the false master node from entering a connection state within a preset duration, and re-executing the election module.

Preferably, the management system of the nodes in the cluster further includes:

the second grouping module is used for dividing all nodes in the cluster into the same group when a deployment instruction for representing a cluster mode is received;

the second configuration module is used for configuring configuration files corresponding to the nodes in the cluster according to the condition that the nodes in the cluster belong to the same group;

the second restarting module is used for restarting each node in the cluster after the configuration files corresponding to each node in the cluster are configured, so that the cluster mode takes effect; wherein, in the cluster mode, the target node is configured to monitor node states of the remaining nodes in the cluster through message passing with the remaining nodes.

In order to solve the above technical problem, the present invention further provides a management apparatus for nodes in a cluster, including:

a memory for storing a computer program;

and the processor is used for realizing the steps of the management method of the nodes in any one of the clusters when executing the computer program.

The invention provides a management method of nodes in a cluster, which comprises the steps of determining grouping conditions of all nodes in the cluster according to a preset cluster grouping deployment strategy when a deployment instruction representing a cluster grouping mode is received; configuring a configuration file corresponding to each node in the cluster according to the grouping condition of each node in the cluster; the target configuration file corresponding to the target node represents a specific node in the same group with the target node; after configuration files corresponding to the nodes in the cluster are configured, restarting the nodes in the cluster to enable the cluster grouping mode to take effect; wherein, in the cluster grouping mode, the target node is used for monitoring the node states of the other nodes through message transmission with the other nodes in the same group. Therefore, under a large-scale cluster, each node in the cluster can be deployed in a cluster grouping mode, and each node in the same group only needs to be monitored mutually, so that stable monitoring of states among the nodes is facilitated, and misjudgment is not easy to cause; moreover, the cluster grouping mode is adopted to help identify the fault node in the cluster so as to avoid the fault node from influencing the service.

The invention also provides a system and a device for managing the nodes in the cluster, and the system and the device have the same beneficial effects as the management method.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed in the prior art and the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a flowchart of a method for managing nodes in a cluster according to an embodiment of the present invention;

fig. 2 is a deployment scheme of a cluster grouping mode according to an embodiment of the present invention;

fig. 3 is a flowchart of distributed lock acquisition in a cluster grouping mode according to an embodiment of the present invention;

fig. 4 is a deployment scheme of a cluster mode according to an embodiment of the present invention.

Detailed Description

The core of the invention is to provide a method, a system and a device for managing nodes in a cluster, under a large-scale cluster, the method can deploy each node in the cluster by adopting a cluster grouping mode, and each node in the same group only needs to be monitored mutually, thereby being beneficial to the stable monitoring of the state among the nodes and not easy to cause misjudgment; moreover, the cluster grouping mode is adopted to help identify the fault node in the cluster so as to avoid the fault node from influencing the service.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a flowchart illustrating a method for managing nodes in a cluster according to an embodiment of the present invention.

The management method of the nodes in the cluster comprises the following steps:

step S1: and when a deployment instruction for representing a cluster grouping mode is received, determining the grouping condition of each node in the cluster according to a preset cluster grouping deployment strategy.

It should be noted that the preset of the present application is set in advance, and only needs to be set once, and the reset is not needed unless the modification is needed according to the actual situation.

Specifically, the cluster grouping mode is set for the nodes in the cluster (each node in the cluster can establish TCP connection with all other nodes), that is, the nodes in the cluster are divided into a plurality of groups, as shown in fig. 2 (taking nodes n1-n9 as an example), all the nodes in the same group only need to monitor each other in the cluster, and all the nodes in different groups are independent of each other.

Based on this, the cluster grouping deployment strategy for guiding the grouping of the nodes in the cluster needs to be set in advance, so that when a deployment instruction representing a cluster grouping mode is received, the nodes in the cluster are grouped and divided according to the set cluster grouping deployment strategy, and thus the grouping condition of each node in the cluster is determined.

Step S2: configuring a configuration file corresponding to each node in the cluster according to the grouping condition of each node in the cluster; and the target configuration file corresponding to the target node represents a specific node in the same group with the target node.

It should be noted that the target node in the present application is any node in the cluster.

Specifically, each node in the cluster corresponds to one configuration file, and taking the target node as an example, the role of the configuration file is described as follows: the target configuration file corresponding to the target node indicates specific nodes in the same group as the target node, that is, what specific nodes are monitored by the target node can be known from the target configuration file.

Based on the configuration file, the configuration file corresponding to each node in the cluster can be configured according to the grouping condition of each node in the cluster, so that each node meeting the same group can be monitored subsequently, and each node in different groups is independent.

Step S3: after configuration files corresponding to the nodes in the cluster are configured, restarting the nodes in the cluster to enable the cluster grouping mode to take effect; wherein, in the cluster grouping mode, the target node is used for monitoring the node states of the other nodes through message transmission with the other nodes in the same group.

Specifically, after configuration files corresponding to the nodes in the cluster are configured, the nodes in the cluster need to be restarted, and after the nodes in the cluster are restarted, the cluster grouping mode is enabled. In the cluster grouping mode, taking a target node as an example, the principle of monitoring the node state is explained as follows: the target node monitors the node states of the other nodes through message transmission with the other nodes in the same group, specifically, in the same group, a CTDB (Cluster TrivialDatabase) service running on each node is used, and the purpose of monitoring the node states is achieved through the CTDB service running on the node.

On the basis of the above-described embodiment:

as an optional embodiment, the process of determining the grouping condition of each node in the cluster according to the preset cluster grouping deployment policy includes:

Specifically, there are three cluster grouping deployment strategies: 1) and the nodes belonging to the same network segment in the cluster are divided into the same group, so that later maintenance is facilitated. 2) The nodes corresponding to the same storage pool in the cluster are divided into the same group, so that data loss caused by discontinuous data storage in subsequent fault transfer is avoided; for example, 200 nodes in a cluster, 40 nodes share one storage pool, so that the nodes in the cluster can be divided into five groups, and the nodes corresponding to the same storage pool belong to the same group. 3) And dividing the nodes which belong to the same network segment and correspond to the same storage pool in the cluster into the same group.

As an optional embodiment, the method for managing nodes in a cluster further includes:

judging whether the main node holds a distributed lock;

if yes, executing fault processing operation of the fault node;

if not, triggering the main node to send distributed lock acquisition requests to other normal nodes in the same group, and judging whether the total number of nodes replying to the main node based on the distributed lock acquisition requests is greater than a preset reply number threshold value;

and if not, determining that the main node is a false main node, forbidding the false main node to enter a connection state within a preset duration, and re-executing the operation of selecting the main node from the normal nodes in the group where the fault node is located according to a preset election mechanism.

Further, referring to fig. 3, when a node in the cluster fails, the present application may select a single master node from normal nodes in the group where the failed node is located according to a preset election mechanism (for example, an election mechanism in which a node with the earliest start time is used as a master node, etc.), and if the selected node is already a true master node before the election, the selected node holds a distributed lock (a locking authority that the true master node has).

Based on this, after selecting only one main node from normal nodes in the group where the fault node is located, the method judges whether the selected main node holds a distributed lock, if so, the main node is determined to be a true main node, and the fault processing operation of the fault node can be continuously executed; if the distributed lock is not held, triggering the master node to send distributed lock acquisition requests to other normal nodes in the same group, waiting for the other normal nodes to reply to the master node after receiving the distributed lock acquisition requests, and judging whether the total number of the nodes replying to the master node based on the distributed lock acquisition requests is greater than a preset reply number threshold (the reply number threshold is generally set to 1/2 of the number of all the normal nodes in the group where the master node is located), if so, determining that the master node successfully acquires the distributed lock, taking the role of the master node into effect, and continuing to execute fault processing operation of the fault node; if the number of the selected nodes is not larger than the preset reply number threshold, determining that the master node is a false master node, invalidating the election result, prohibiting the false master node from entering a connection state within a preset duration (only normal nodes in the group can be elected as the master node, and prohibiting the false master node from entering the connection state means that the false master node cannot elect as the master node within a period of time), and then restarting election, namely re-executing the operation of electing the master node from normal nodes in the group where the fault node is located according to a preset election mechanism until a real master node is elected.

As an alternative embodiment, the process of performing the fault handling operation of the faulty node includes:

performing data recovery on the database of the fault node, and synchronizing the database contents of the normal nodes of the group of the fault node based on the database of the data recovery; wherein, the database contents of the nodes in the same group are the same;

and releasing the virtual IP of the fault node, and reallocating the virtual IP of the fault node to a normal node of the group where the fault node is located based on a load balancing strategy so that the normal node replaces the fault node to continue processing the node task.

Specifically, the fault handling operation of the faulty node includes: 1) and (3) database recovery: in consideration of the synchronization of the contents of the databases corresponding to the nodes of the same group, when the failed node fails, the nodes of the same group may not perform a new round of database content synchronization, so that the database of the failed node is subjected to data recovery, and the database contents of the normal nodes of the group where the failed node is located are synchronized based on the database of the data recovery, thereby ensuring the integrity of the database contents of the nodes of the same group. 2) Virtual IP (Internet Protocol) reallocation: and releasing the virtual IP of the fault node, selecting a new node which replaces the fault node to continue processing the node task from the normal nodes of the group where the fault node is located based on a load balancing strategy, and reallocating the virtual IP of the fault node to the selected new node, so that the new node replaces the fault node to continue processing the node task.

As an alternative embodiment, the process of performing the fault handling operation of the faulty node further includes:

Further, the fault handling operation of the faulty node further includes: and the fault notification is to specifically notify the fault information of the fault node to all normal nodes in the group where the fault node is located.

when a deployment instruction representing a cluster mode is received, all nodes in a cluster are divided into the same group;

after configuration files corresponding to the nodes in the cluster are configured, restarting the nodes in the cluster to enable the cluster mode to take effect; wherein, in the cluster mode, the target node is used for monitoring the node states of the rest nodes through message passing with the rest nodes in the cluster.

Further, the cluster mode is set for the nodes in the cluster, that is, all the nodes in the cluster are divided into the same group, and as shown in fig. 4, all the nodes in the cluster monitor each other. Based on the method, when the deployment instruction representing the cluster mode is received, all nodes in the cluster are divided into the same group, and the configuration files corresponding to all nodes in the cluster are configured according to the condition that all nodes in the cluster belong to the same group, so that the nodes in the cluster can be monitored subsequently.

After configuration files corresponding to the nodes in the cluster are configured, the nodes in the cluster need to be restarted, and after the nodes in the cluster are restarted, the cluster mode is enabled to take effect. In the cluster mode, taking a target node as an example, the principle of monitoring the node state is explained as follows: the target node monitors the node states of the other nodes through message transmission with the other nodes in the cluster, specifically, in the cluster, the CTDB service running on each node achieves the purpose of monitoring the node states through the CTDB service running on the node.

It should be noted that when the cluster scale is small (< a preset node number threshold), a cluster mode is adopted to deploy nodes in the cluster; and when the cluster scale is large (more than or equal to a preset node number threshold), deploying the nodes in the cluster by adopting a cluster grouping mode.

The present application further provides a management system for nodes in a cluster, including:

the first configuration module is used for configuring configuration files corresponding to all nodes in the cluster according to the grouping condition of all nodes in the cluster; the target configuration file corresponding to the target node represents a specific node in the same group with the target node; the target node is any node in the cluster;

the first restarting module is used for restarting each node in the cluster after the configuration files corresponding to each node in the cluster are configured so as to enable the cluster grouping mode to take effect; wherein, in the cluster grouping mode, the target node is used for monitoring the node states of the other nodes through message transmission with the other nodes in the same group.

As an optional embodiment, the management system of the nodes in the cluster further includes:

the processing module is used for executing fault processing operation of the fault node;

and the prohibiting module is used for determining that the master node is a false master node, prohibiting the false master node from entering a connection state within a preset duration, and executing the election module again.

the second grouping module is used for dividing all nodes in the cluster into the same group when a deployment instruction for representing the cluster mode is received;

the second restarting module is used for restarting each node in the cluster after the configuration files corresponding to each node in the cluster are configured, so that the cluster mode takes effect; wherein, in the cluster mode, the target node is used for monitoring the node states of the rest nodes through message passing with the rest nodes in the cluster.

For introduction of the management system provided in the present application, please refer to the above-mentioned embodiment of the management method, which is not described herein again.

The present application further provides a management apparatus for nodes in a cluster, including:

a memory for storing a computer program;

and the processor is used for implementing the steps of the management method of the nodes in any one of the clusters in the execution of the computer program.

For introduction of the management apparatus provided in the present application, please refer to the embodiments of the management method described above, and the description of the present application is omitted here.

It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for managing nodes in a cluster is characterized by comprising the following steps:

2. The method for managing nodes in a cluster according to claim 1, wherein the process of determining the grouping condition of each node in the cluster according to the preset cluster grouping deployment policy comprises:

3. The method for managing nodes in a cluster according to claim 1, wherein the method for managing nodes in a cluster further comprises:

judging whether the master node holds a distributed lock;

if yes, executing fault processing operation of the fault node;

4. The method for managing nodes in a cluster according to claim 3, wherein the process of performing the fault handling operation of the faulty node comprises:

5. The method for managing nodes in a cluster according to claim 4, wherein the process of performing the fault handling operation of the faulty node further comprises:

6. The method for managing nodes in a cluster according to any one of claims 1 to 5, wherein the method for managing nodes in a cluster further comprises:

7. A system for managing nodes in a cluster, comprising:

8. The system for managing nodes in a cluster of claim 7, wherein the system for managing nodes in a cluster further comprises:

9. The management system for nodes in a cluster according to any of claims 7-8, wherein the management system for nodes in a cluster further comprises:

10. An apparatus for managing nodes in a cluster, comprising:

a memory for storing a computer program;

processor for implementing the steps of the method of managing nodes in a cluster according to any of claims 1 to 6 when executing said computer program.