CN115878361A

CN115878361A - Node management method and device for database cluster and electronic equipment

Info

Publication number: CN115878361A
Application number: CN202211711963.3A
Authority: CN
Inventors: 许可; 王真
Original assignee: Hillstone Networks Co Ltd
Current assignee: Hillstone Networks Co Ltd
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2023-03-31

Abstract

The application discloses a node management method and device of a database cluster and electronic equipment. Wherein, the method comprises the following steps: creating a plurality of nodes in a database cluster, wherein the plurality of nodes comprise a master node and at least one slave node; monitoring whether the main node is abnormal or not in the running process through a target container in the main node; when the master node is abnormal, the master node is determined to be a node to be processed, and one slave node is selected from at least one slave node to serve as a new master node; reestablishing the node to be processed, and deleting the network address corresponding to the node to be processed; and after the nodes to be processed are successfully rebuilt, acquiring the target network address, and determining the nodes to be processed after the nodes to be processed are successfully rebuilt as a new slave node according to the target network address. The method and the device solve the technical problem that in the prior art, when the main node of the database cluster breaks down, the failure recovery efficiency is low.

Description

Node management method and device of database cluster and electronic equipment

Technical Field

The application relates to the field of database clusters, in particular to a node management method and device of a database cluster and electronic equipment.

Background

With the development of cloud technology, the service cloud has become a big trend. Among them, a service deployed on kuberneters (K8 s for short, which is an open source for managing containerized applications on multiple hosts in a cloud platform) can rely on the capabilities provided by kuberneters to achieve high availability. For example: using the service deployed by the Deployment of the Deployment or stateful set, kuberneters can always maintain a fixed number of pods (a minimum computing unit in a kuberneters application program, one pod corresponds to one node in a database cluster) in a database cluster, and when one pod fails, the failed pod needs to be re-created in time, so that high availability of the service is realized. During operation of the service, data is often stored in databases such as etcd, mongo, clickhouse and the like, and once the databases fail, the service running on kuberneters is abnormal. This requires that databases such as etcd, mongo, clickhouse, etc. also be deployable to kuberneters environments and have high availability capabilities.

However, in the prior art, when the master node of the database cluster is abnormal, the master node often needs to be manually re-created and then can be recovered for use, and after the node is re-created, the corresponding IP address of the node is likely to change. On the basis, if the latest IP address corresponding to the newly created node cannot be obtained in time, the communication between the nodes cannot be normally completed even after the nodes are successfully created. Therefore, the problem that the fault repairing efficiency of the main node is low due to the fact that the method for repairing the fault main node in a manual mode in the prior art is adopted is solved.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the application provides a node management method and device for a database cluster and electronic equipment, and aims to at least solve the technical problem that in the prior art, when a main node of the database cluster fails, the failure recovery efficiency is low.

According to an aspect of an embodiment of the present application, a node management method for a database cluster is provided, including: creating a plurality of nodes in a database cluster, wherein the plurality of nodes comprise a master node and at least one slave node, each node corresponds to a network address, the master node is used for providing data writing service, and the slave node is used for providing data reading service; monitoring whether the main node is abnormal or not in the running process through a target container in the main node; when the master node is abnormal, the master node is determined to be a node to be processed, and one slave node is selected from at least one slave node to serve as a new master node; reestablishing the node to be processed, and deleting the network address corresponding to the node to be processed; and after the nodes to be processed are successfully reconstructed, acquiring a target network address, and determining the nodes to be processed after successful reconstruction as a new slave node according to the target network address, wherein the target network address is a network address redistributed to the nodes to be processed after successful reconstruction.

Further, the node management method of the database cluster further comprises the following steps: creating a plurality of computing units in a database cluster, wherein each computing unit is assigned a network address; randomly selecting one computing unit from the plurality of computing units as a master node, and determining other computing units as slave nodes, wherein the other computing units are all the computing units except the master node.

Further, the node management method of the database cluster further comprises the following steps: after a computing unit is randomly selected from a plurality of computing units to serve as a main node, whether a newly added computing unit exists in a database cluster is detected; under the condition that a newly added computing unit is detected to exist in the database cluster, acquiring a network address corresponding to the newly added computing unit; and determining the newly added computing unit as a new slave node in the database cluster according to the network address corresponding to the newly added computing unit.

Further, the node management method of the database cluster further comprises the following steps: after one computing unit is randomly selected from a plurality of computing units to serve as a main node, monitoring whether the database process on each slave node is abnormal or not through a target container on each slave node; when the database process on any slave node is abnormal, updating the process identifier corresponding to the slave node to be a first identifier; and determining the slave node carrying the first identifier as a first slave node, and recreating the first slave node.

Further, the node management method of the database cluster further comprises the following steps: deleting the network address corresponding to the first slave node; after the first slave node is successfully rebuilt, acquiring a first network address, wherein the first network address is a network address which is re-allocated to the first slave node after the first slave node is successfully rebuilt; and adding the successfully reconstructed first slave node into the database cluster according to the first network address.

Further, the node management method of the database cluster further comprises the following steps: after a computing unit is randomly selected from a plurality of computing units to serve as a master node, monitoring whether communication between each slave node and the master node is abnormal or not through a target container on each slave node; when the communication between any slave node and the master node is abnormal, updating the communication state identifier corresponding to the slave node as a second identifier; determining the slave node carrying the second identifier as a second slave node, and forbidding the second slave node to continue providing data reading service; the second slave node is removed from the database cluster.

Further, the node management method of the database cluster further comprises the following steps: the main node and each slave node store full data, and a target network address is obtained through a target container on a new main node; determining the nodes to be processed after the reconstruction is successful as newly-added target slave nodes in the database cluster according to the target network address; all data stored in the new master node is synchronized to the target slave node.

According to another aspect of the embodiments of the present application, there is also provided a node management apparatus for a database cluster, including: the system comprises a node creating module, a data reading module and a data processing module, wherein the node creating module is used for creating a plurality of nodes in a database cluster, the plurality of nodes comprise a main node and at least one slave node, each node corresponds to a network address, the main node is used for providing data writing service, and the slave node is used for providing data reading service; the monitoring module is used for monitoring whether the main node is abnormal in the operation process through a target container in the main node; the first determining module is used for determining the master node as a node to be processed when the master node is abnormal, and selecting one slave node from at least one slave node as a new master node; the node reconstruction module is used for recreating the node to be processed and deleting the network address corresponding to the node to be processed; and the second determining module is used for acquiring the target network address after the nodes to be processed are successfully reconstructed, and determining the nodes to be processed after the nodes to be processed are successfully reconstructed as a new slave node according to the target network address, wherein the target network address is a network address redistributed to the nodes to be processed after the nodes to be processed are successfully reconstructed.

According to another aspect of the embodiments of the present application, there is also provided a computer-readable storage medium, in which a computer program is stored, where the computer program is configured to execute the above-mentioned node management method of a database cluster when running.

According to another aspect of embodiments of the present application, there is also provided an electronic device, including one or more processors; storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out a method for running a program, wherein the program is arranged to carry out the above-mentioned node management method of a database cluster when run.

The method includes the steps of firstly creating a plurality of nodes in a database cluster by automatically reconstructing a master node with an abnormality and selecting one slave node from at least one slave node as a new master node, then monitoring whether the master node has the abnormality in the operation process through a target container in the master node, determining the master node as a node to be processed when the master node has the abnormality, selecting one slave node from the at least one slave node as the new master node, then recreating the node to be processed, deleting a network address corresponding to the node to be processed, acquiring a target network address after the node to be processed is successfully reconstructed, and determining the node to be processed after the reconstruction is successfully reconstructed as a new slave node according to the target network address, wherein the target network address is a network address reallocated to the node to be processed after the reconstruction is successful, the plurality of nodes comprise one master node and at least one slave node, each node corresponds to one network address and is used for providing data writing service, and the slave nodes are used for providing data reading service.

According to the method and the device, whether the main node is abnormal or not in the operation process is automatically monitored through the target container in the main node, and the abnormal main node is automatically recreated, so that the recreating efficiency of the main node is improved. Meanwhile, when the master node is abnormal, one slave node can be selected from at least one slave node as a new master node, so that the interrupt processing of the service data can be avoided to the greatest extent, and the stability of the database cluster in processing the data is improved. In addition, after the failed main node is successfully rebuilt, the target network address is obtained, and the node which is successfully rebuilt is used as a new slave node according to the target network address, so that the purpose of automatic fault transfer is achieved, service interruption caused by abnormal main nodes is avoided, and the fault repairing efficiency of abnormal nodes is improved.

Therefore, the technical scheme of the application achieves the purpose of realizing automatic fault transfer when the main node fails, thereby realizing the technical effect of improving the operation stability of the database service and further solving the technical problem of low fault repairing efficiency when the main node of the database cluster fails in the prior art.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a schematic diagram of an alternative deployment method for a high availability cluster of a database according to the prior art;

FIG. 2 is a schematic diagram of an alternative method of deploying a cluster of databases Gao Keyong according to the prior art;

FIG. 3 is a flow chart of an alternative method of node management for a database cluster according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a building method of an etcd database cluster according to an embodiment of the application;

FIG. 5 is a block diagram of an alternative database cluster according to an embodiment of the present application;

fig. 6 is a schematic diagram of a node management apparatus of an alternative database cluster according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

With the development of cloud technology, the service cloud has become a big trend. Services deployed on kuberneters may rely on the capabilities provided by kuberneters to achieve high availability. For example: using deployed services of Deployment by Deployment or stateful set, kuberneters will always maintain a fixed number of pods in the cluster, and when a pod fails, the failed pod is killed and re-created in time, thereby achieving high availability of services. During operation, data is often stored in databases such as etcd, mongo, clickhouse and the like, and once the databases fail, the services running on kuberneters are abnormal. This requires that databases such as etcd, mongo, clickhouse, etc. also be deployable to kuberneters environments and have high availability capabilities.

Fig. 1 is a schematic diagram of an alternative deployment method of a high-availability cluster of a database according to the prior art. As shown in fig. 1, the high-availability cluster of the database is an etcd cluster, and three nodes are deployed, where the three nodes are all deployed on a physical host, and an operating system is a CentOS7. In the prior art, information such as an IP address of each node in a database cluster needs to be configured in a configuration file in advance, and the method can realize high availability depending on the capability of the cluster. The specific structure is as follows:

(1) The etcd cluster is deployed on three physical hosts.

(2) The physical host 1 serves as a master node, the physical host 2 serves as a node1 node, and the physical host 3 serves as a node2 node.

(3) And each physical host runs an etcd process to jointly form an etcd cluster.

(4) When a certain physical host is down, other two nodes can still normally operate, and the etcd cluster can still provide services to the outside.

The prior art needs at least the following procedures when deploying the etcd cluster in fig. 1:

1. installing the etcd service: the etcd is respectively installed on three physical hosts through a command of 'yum-y install etcd'.

2. And modifying the master node configuration file: after the ETCD service is installed, there is a configuration file of ETCD. Conf under the/etcc/ETCD directory, the configuration file is modified on the master node, and information such as IP of each node of the CLUSTER is configured in the configuration file of the master node.

3. Modify node1 and node2 node configuration files: the configuration files on node1 and node2 require setting the ETCD _ INITIAL _ CLUSTER _ STATE attribute to existing.

4. And (3) starting the etcd service on the master node: the start command is as follows:

systemctl daemon-reload

systemctl start etcd

systemctl enable etcd

5. and (3) starting the etcd services of the node1 and the node 2: the start command is executed on node1 and node2, respectively, to start the etcd service.

Common various types of databases provide cluster deployment modes based on physical hosts, and rely on the clusters to achieve high availability. However, as can be appreciated by the deployment process described above, such a database deployment based on physical host deployment has the following disadvantages:

(1) The deployment process is very cumbersome: the IP of all nodes must be known in advance, all configuration information such as IP is configured on each node in advance, and then the database service (e.g., etcd service) is manually started in sequence according to the sequence of starting the master node and then starting the slave nodes.

(2) Failure to transfer failure automatically: when a node fails, it cannot be automatically recovered. The related fault must be repaired manually, and the database service is restarted, so that the node can continue to provide the service to the outside normally.

(3) The expansion nodes are relatively complex: when the nodes need to be added, because the IP of each node in the cluster is configured in advance, the database service can be restarted to take effect only after the configuration file is modified, and the operation is complicated.

Fig. 2 is a schematic diagram of an alternative database Gao Keyong cluster deployment method according to the prior art. As shown in fig. 2, the method first deploys a pod on kubernets, and then deploys and starts a database service (e.g., etcd service) as a container inside the pod. Wherein the pod is the minimum computing unit in the kubernets application program.

The prior art needs at least the following procedures when deploying the etcd cluster in fig. 2:

1. in a kubernets cluster, three physical hosts are deployed, wherein the etcd service is deployed on the physical host 2.

2. A Service is deployed on a kubernets cluster, a Deployment file Deployment is arranged behind the Service, the number of copies is 1, namely the kubernets always maintain an etcd pod on the cluster environment, and when the pod fails, the pod is automatically rebuilt. In addition, a container is arranged in the pod, the start process of the etcd is arranged in the container, and after the pod is created, the start process of the etcd is automatically pulled up, so that the etcd service is provided for the outside.

It should be noted that, according to the method in fig. 2, the etcd service is deployed in the pod of kubernets, and although the configuration is simplified, the pod can be managed by using the capability of kubernets, the following problems still exist:

(1) It is difficult to build clusters: when the pod fails in the operation process, the IP of the pod changes after the kubernets rebuilds the pod, so that the configured IP information goes wrong in advance, and based on the problem, a cluster is difficult to establish, and only single-node etcd service can be deployed. And when the single node deploys the etcd, the etcd service is unavailable in the process of reconstructing the pod, and the real high availability cannot be realized.

(2) Unable automatic dilatation: with the increasing amount of stored data, when more nodes are needed to store data, the deployment party cannot realize automatic capacity expansion.

As can be known from the two deployment methods of the database Gao Keyong cluster in the prior art, the following problems often occur when the database cluster is deployed on kuberneters in the prior art:

(1) The IP address of the pod changes frequently: because the IP of the pod on the kuberneters changes during the reconstruction, and the IP of the database cluster node member needs to be configured when the database cluster is deployed, the communication between the cluster nodes is abnormal after the pod is reconstructed.

(2) Lack of health check mechanisms: when a certain node fails, the failure cannot be monitored in time and automatic failover and node recovery are realized.

(3) Unable automatic dilatation: with the increase of the amount of stored data, more nodes are needed to store the data, and after the database is deployed on kuberneters, the nodes cannot be expanded timely according to the needs.

In order to solve the above-mentioned problems, the embodiments of the present application provide an embodiment of a node management method for a database cluster, and it should be noted that the steps shown in the flowchart of the drawings may be executed in a computer system such as a set of computer executable instructions, and although a logical order is shown in the flowchart, in some cases, the steps shown or described may be executed in an order different from the order shown.

Fig. 3 is a flowchart of an alternative node management method for a database cluster according to an embodiment of the present application, and as shown in fig. 3, the method includes the following steps:

step S301, a plurality of nodes are created in the database cluster.

In step S301, the plurality of nodes includes a master node and at least one slave node, each node corresponding to a network address, the master node being configured to provide a data write service, and the slave node being configured to provide a data read service.

Specifically, the database cluster may be various types of database clusters, such as an etcd database cluster, a mongo database cluster, a clickhouse database cluster, and the like, which are determined according to the type of database provided. For convenience of description, the following description is made by taking the etcd database cluster as an example.

In the present application, a kuberneters application is used as an execution subject of the node management method of the database cluster according to the embodiment of the present application. First, a plurality of computing unit pods are created in an etcd database cluster through a kuberneters application program, wherein each pod is located on a physical host, each pod is a node, and each pod is allocated with an independent ip address. In addition, it should also be noted that each pod is internally composed of two containers, one being an etcd container, for managing the etcd service process, e.g., for initiating the etcd service on the node. The other container is a sidecar container, which is responsible for dynamic configuration management, health status monitoring, node expansion and failover of the whole etcd database cluster.

Step S302, whether the main node is abnormal or not in the running process is monitored through a target container in the main node.

In step S302, the kuberneters application program will select a pod from the created multiple pods as a master node according to a certain master selection rule through the sidecar container on the pod, and use the other remaining pods as slave nodes. It should be noted that the target container is the sidecar container in the node. The kuberneters application monitors whether an exception occurs in the main node during operation, for example, an exception occurs in an etcd service on the main node, through a sidecar container in the main node.

Step S303, when the master node is abnormal, the master node is determined to be a node to be processed, and one slave node is selected from at least one slave node to be used as a new master node.

In step S303, when the sidecar container in the master node monitors that the master node is abnormal in the operation process, the sidecar container in the master node reports the abnormal information of the master node to the kuberneters application program, and then the kuberneters application program determines the abnormal master node as the node to be processed, and kills and reconstructs the node. Meanwhile, in order to ensure that the database cluster is not influenced to continue normal service processing, the kuberneters application program selects one slave node from at least one slave node as a new master node by depending on the master selection rule of the database cluster, so that the purpose of automatically transferring faults is achieved.

And step S304, recreating the nodes to be processed and deleting the network addresses corresponding to the nodes to be processed.

In step S304, the kuberneters application kills and reconstructs the nodes to be processed, and deletes the network address corresponding to the nodes to be processed in the apiserver component, where the apiserver component is configured to store the network address corresponding to each node. The network address in this application is an ip address.

Step S305, after the nodes to be processed are successfully rebuilt, the target network address is obtained, and the nodes to be processed after being successfully rebuilt are determined to be a new slave node according to the target network address.

In step S305, the target network address is a network address to which the node to be processed is reallocated after the reestablishment is successful.

It should be noted that, after the to-be-processed node is successfully rebuilt, a new ip address is usually allocated to the to-be-processed node by the kuberneters application program, and in order to ensure that the to-be-processed node that is successfully rebuilt can communicate with the new master node, the kuberneters application program obtains a target network address through detection of the sidecar container of the new master node, and then uses the to-be-processed node that is successfully rebuilt as a new slave node in the database cluster according to the target network address, thereby achieving the purpose of recovering the number of the slave nodes to the same number as before.

It is easily understood that under normal conditions, the etcd cluster master node provides write service and the slave node provides read service, and if the master node is down and cannot provide service, the database cluster is unavailable. In the application, when a main node fails, a database cluster reselects a new main node, and a pod where the original main node is located is killed and rebuilt, so that the reselected main node can monitor the change of the ip address of the pod, remove the ip address of the old pod and add the ip address of the new pod, and automatic failover of the main node is realized.

Based on the contents of the above steps S301 to S305, in the present application, a manner of automatically reconstructing a master node with an abnormality and selecting a slave node from at least one slave node as a new master node is adopted, first a plurality of nodes are created in a database cluster, then whether an abnormality occurs in an operation process of the master node is monitored through a target container in the master node, when an abnormality occurs in the master node, the master node is determined as a node to be processed, and one slave node is selected from at least one slave node as a new master node, then a node to be processed is created again, and a network address corresponding to the node to be processed is deleted, after the node to be processed is successfully reconstructed, a target network address is obtained, and the node to be processed after being successfully reconstructed is determined as a new slave node according to the target network address, where the target network address is a network address to which the node to be processed after being successfully reconstructed, the plurality of nodes include one master node and at least one slave node, each node corresponds to one network address, and the master node is used for providing a data writing service, and the slave node is used for providing a data reading service.

In an alternative embodiment, to create multiple nodes in a database cluster, a kuberneters application first creates multiple compute units in the database cluster, where each compute unit is assigned a network address. And then, the kuberneters application program randomly selects one computing unit from the plurality of computing units as a master node, and determines other computing units as slave nodes, wherein the other computing units are all computing units except the master node in the plurality of computing units.

Optionally, the calculating unit is pod. Fig. 4 is a schematic diagram of a building method of an etcd database cluster according to an embodiment of the present application, and as shown in fig. 4, when an etcd database cluster is deployed, first, a kuberneters application program creates three pods of the etcd in a cluster environment, where ip addresses of the three pods are all stored in an apiserver component.

Further, the sidecar container inside each pod queries the ip address of the pod pair from the apiserver component, then, the kuberneters application program selects one pod from the three pods as the main node according to a certain main selection rule through the sidecar container on the pod, and the sidecar container on the main node pulls up the etcd process in the main node, thereby realizing the initialization of the main node.

Subsequently, the master node adds one pod from the remaining two pods to the database cluster as a slave node, and the slave node added to the database cluster pulls up the etcd process in its pod, thereby implementing the initialization of the slave node. And finally, the main node adds the remaining last pod to the database cluster, the node can pull up the etcd process of the pod, all the nodes complete initialization, the etcd database cluster formed by the three nodes also completes initialization, and the business process can normally access the etcd service through the etcd database cluster.

In an alternative embodiment, in order to implement an automatic capacity expansion function of a database cluster, after a computing unit is randomly selected from a plurality of computing units as a master node, the kuberneters application program may detect whether a newly added computing unit exists in the database cluster through a sidecar container in the master node, and in the case of detecting that the newly added computing unit exists in the database cluster, the sidecar container in the master node may obtain a network address corresponding to the newly added computing unit, and determine the newly added computing unit as a new slave node in the database cluster according to the network address corresponding to the newly added computing unit.

As shown in fig. 4, starting with the primary node elected, the sidecar container on the primary node will continuously monitor the cluster for pod changes. Specifically, when there is a newly added ipdd pod, the master node adds the edcd pod as a slave node to the database cluster.

Specifically, when the number of nodes of the database cluster needs to be expanded, a user only needs to modify a cluster copy number parameter in the kuberneters application (for example, 1 is added on the basis of an original cluster copy number parameter), the kuberneters application newly creates a pod, and the pod is monitored by the sidecar container on the master node and is added into the database cluster as a new slave node, so that automatic expansion of the cluster is realized. It should be noted that the parameter of the number of copies of the cluster is used to characterize the number of nodes in the database cluster, for example, when the parameter of the number of copies of the cluster is 3, it indicates that three nodes are included in the database cluster, and when the parameter of the number of copies of the cluster is 5, it indicates that 5 nodes are included in the database cluster.

In an optional embodiment, after a computing unit is randomly selected from a plurality of computing units to serve as a master node, the kuberneters application program also monitors whether a database process on each slave node is abnormal or not through a target container on each slave node, and when the database process on any slave node is abnormal, the kuberneters application program updates a process identifier corresponding to the slave node to be a first identifier, determines the slave node carrying the first identifier to be a first slave node, and re-creates the first slave node.

Alternatively, as shown in fig. 4, starting with the selected slave node, the sidecar container on each slave node may also continuously perform a health check on the database process on its corresponding slave node. Specifically, if an exception occurs in a database process on a slave node, the sidecar container on the slave node sets the process identifier liveness corresponding to the slave node to the first identifier false, and then the kuberneters application kills and rebuilds the slave node according to the first identifier.

It should be noted that, when the first slave node is created again, the kuberneters application deletes the network address corresponding to the first slave node from the apiserver component, and acquires the first network address through the sidecar container in the master node after the first slave node is successfully reconstructed, where the first network address is the network address that is newly allocated to the first slave node after the reconstruction is successful. And finally, the kuberneters application program adds the first slave node after the reconstruction is successful to the database cluster according to the first network address through the sidecar container in the main node.

In an optional embodiment, in order to monitor the health state of network communication between a slave node and a master node, after a computing unit is randomly selected from a plurality of computing units as the master node, a kuberneters application program monitors whether communication between each slave node and the master node is abnormal or not through a target container on each slave node, when communication between any slave node and the master node is abnormal, the kuberneters application program updates a communication state identifier corresponding to the slave node to be a second identifier, determines the slave node carrying the second identifier as a second slave node, and prohibits the second slave node from continuously providing data reading service. Finally, the kuberneters application removes the second slave node from the database cluster.

As shown in fig. 4, starting from the master node election, the sidecar container on each slave node will continuously monitor whether there is an abnormality in the communication between the slave node and the master node. Specifically, if the communication between a certain slave node and the master node is abnormal, the sidecar container on the slave node updates the communication state identifier readyness of the slave node to the second identifier false. The slave node marked with the second identifier will not be assigned with a request of data reading service, that is, the slave node carrying the second identifier is prohibited to continue providing data reading service. Finally, the slave node marked as the second identification is also removed from the database cluster.

It should be noted that, if the communication service of the master node is abnormal, the communication state identifier readyness of the master node is also updated to the second identifier false, and since there is only one master node responsible for the data writing service, in order to ensure that the data writing service is not interrupted, the kuberneters application program may select one slave node from at least one slave node as a new master node depending on the master selection rule of the database cluster itself, and meanwhile, the master node marked as the second identifier will not be assigned with a request of the data writing service, and the kuberneters application program may also remove the master node marked as the second identifier from the database cluster.

In an alternative embodiment, in order to implement a complete data backup, the master node and each slave node store a full amount of data, when an abnormality occurs in the master node, such as a downtime, the kuberneters application program may select one slave node from at least one slave node as a new master node, and since each slave node also stores a full amount of data, the data in the new master node is a complete and missing-free full amount of data. In addition, when the original master node is rebuilt and then is added into the database cluster as a new slave node, the new master node synchronizes the full data to the new slave node, so that the data backup of the cluster is realized.

Specifically, the kuberneters application program acquires a target network address through a target container on a new main node, determines a to-be-processed node after successful reconstruction as a newly-added target slave node in the database cluster according to the target network address, and finally synchronizes all data stored in the new main node to the target slave node.

In an alternative embodiment, fig. 5 is a structural diagram of an alternative database cluster according to an embodiment of the present application, and as shown in fig. 5, a pod is created on each of three physical hosts by a kuberneters application. The inside of each etcd pod consists of two containers, wherein one container is an etcd container, the other container is a sidecar container, the etcd container is responsible for the start of etcd service, and the sidecar container is responsible for the dynamic configuration management, the health state monitoring, the node expansion, the fault transfer and the like of the whole cluster.

Based on the structure diagram of the database cluster shown in fig. 5, the database cluster can provide a Service to the outside, and the back is a cluster deployed by using stateful set, and generally 3 pods are deployed. Where there are two vessels in each pod, one being an etcd vessel and the other being a sidecar vessel.

Specifically, the business logic in the sidecar container is performed in a loop body, the database pod in the entire kuberneters environment is monitored in real time, and when 3 pods are completely started, the kuberneters application program selects a main node from the 3 pods according to a certain main selection rule through the sidecar container on the pod and pulls up the database process of the main node.

And the kuberneters application program monitors the pod change on the kuberneters cluster through the sidecar container on the main node, compares the ip address of the pod inquired from the Apiserver component with the ip address of the member node in the database cluster, and adds the member which exists in the pod but is not added into the database cluster. For nodes in the database cluster that are not healthy (i.e., nodes whose communication status is identified as the second identification), the kuberneters application removes them from the cluster.

In addition, for a node that has joined the cluster, its corresponding database process needs to be pulled up.

Finally, the sidecar container in each node monitors the state of the database process in the node in real time, and once the database process of a certain node is abnormal, the sidecar container of the node sets the process identifier liveness of the node as the first identifier false, and then the kuberneters application program kills the node and rebuilds the node. Meanwhile, the sdecar container in each node also monitors the communication state of the node itself in real time, and for a node with abnormal communication, the sdecar container in the node sets the communication state identifier readyness of the node to be the second identifier false, and at this time, a data read/write request corresponding to the node will not be distributed to the node.

Optionally, in an application scenario, in order to improve deployment efficiency, when a highly available cluster of the etcd service is deployed, according to the technical scheme of the present application, a stateful set may be used for deployment, three copies are deployed, two containers are deployed in each pod, one container includes a start process of the etcd, and the other container is a sidecar process. The service logic of the sidecar monitoring node can be realized and compiled into an executable binary file through a golang code, then the binary file is packaged into a sidecar container mirror image, and the sidecar binary file can be automatically executed when the pod is successfully created, so that the management of the whole etcd cluster is realized, and the functions of automatic fault transfer, health state monitoring, automatic capacity expansion, data backup and the like of the etcd cluster can be realized.

As can be seen from the above, the technical solution of the present application can at least provide the following technical effects:

(1) The deployment mode is simple and can be multiplexed: when the database cluster is deployed, only one yaml file is needed to complete the deployment, so that various complicated configurations are avoided, and the database cluster can be reused on other kuberneters clusters.

(2) Automatic cluster expansion: the whole database cluster is managed through the sidecar container, and automatic expansion of the database cluster can be achieved. When the nodes need to be added to meet the data storage requirement, the sidecar program will automatically add the corresponding nodes in the cluster by only modifying the parameter of the number of copies of the cluster.

(3) Monitoring the health state in real time: the sidecar container will monitor the health status of each node in the database cluster in real time. When a node cannot normally provide service to the outside due to a network and the like, the readyness of the node is set to false, and at this time, kuberneters will not distribute a request to the node. When the etcd process of the node per se fails, the kuberneters rebuilds the node.

(4) Automatic failover and node recovery: when the main node fails, the cluster can transfer the failure by automatically reselecting the main node, and rebuild the failed main node as a new slave node to be added into the database cluster, so that the automatic recovery of the node is realized.

(5) The resource use efficiency is high: because the database is deployed on the kuberneters cluster, and the kuberneters optimize resource scheduling in the cluster, the capacity expansion and contraction can be performed according to the cpu and the memory use condition on the node, and the resource use efficiency is high.

Example 2

According to an embodiment of the present application, an embodiment of a node management apparatus for a database cluster is further provided, and as shown in fig. 6, the apparatus includes: a node creating module 601, configured to create a plurality of nodes in a database cluster, where the plurality of nodes include a master node and at least one slave node, each node corresponds to a network address, the master node is configured to provide a data writing service, and the slave node is configured to provide a data reading service; the monitoring module 602 is configured to monitor whether an exception occurs in an operation process of the host node through a target container in the host node; a first determining module 603, configured to determine, when a master node is abnormal, that the master node is a node to be processed, and select a slave node from at least one slave node as a new master node; a node rebuilding module 604, configured to rebuild a node to be processed, and delete a network address corresponding to the node to be processed; a second determining module 605, configured to obtain a target network address after the to-be-processed node is successfully reconstructed, and determine, according to the target network address, that the to-be-processed node that is successfully reconstructed is a new slave node, where the target network address is a network address that is newly allocated to the to-be-processed node that is successfully reconstructed.

Specifically, the database cluster may be various types of database clusters, such as an etcd database cluster, a mongo database cluster, a clickhouse database cluster, and the like, which are determined according to the type of database provided. For convenience of description, the following description is given by taking an etcd database cluster as an example.

In the present application, a kuberneters application is used as an execution subject of the node management method of the database cluster according to the embodiment of the present application. First, a plurality of computing unit pods are created in an etcd database cluster through a kuberneters application program, wherein each pod is located on a physical host, each pod is a node, and each pod is allocated with an independent ip address. In addition, it should be noted that each pod is internally composed of two containers, one is an etcd container, and is used for managing the etcd service process, for example, for starting the etcd service on the node. The other container is a sidecar container, which is responsible for dynamic configuration management, health status monitoring, node expansion and failover of the whole etcd database cluster.

In an alternative embodiment, the kuberneters application may select one pod from the created plurality of pods as the master node according to a certain master selection rule through the sidecar container on the pod, and use the other remaining pods as the slave nodes. It should be noted that the target container is the sidecar container in the node. The kuberneters application monitors whether an anomaly occurs in the main node during operation, for example, an anomaly occurs in an etcd service on the main node, through a sidecar container in the main node.

When the sidecar container in the main node monitors that the main node is abnormal in the operation process, the sidecar container in the main node reports the abnormal information of the main node to a kuberneters application program, and then the kuberneters application program determines the abnormal main node as a node to be processed and kills and reconstructs the node. Meanwhile, in order to ensure that the database cluster is not influenced to continue normal service processing, the kuberneters application program randomly selects one slave node from at least one slave node as a new master node by depending on the master selection rule of the database cluster, so that the purpose of automatically transferring faults is achieved.

It should be noted that when the kuberneters application program kills and rebuilds the node to be processed, the network address corresponding to the node to be processed is also deleted in the apiserver component, where the apiserver component is used to store the network address corresponding to each node. The network address in this application is an ip address.

After the nodes to be processed are successfully rebuilt, a new ip address is generally allocated to the nodes by the kuberneters application program, in order to ensure that the successfully rebuilt nodes to be processed can communicate with a new master node, the kuberneters application program obtains a target network address through detection of a sidecar container of the new master node, and then the successfully rebuilt nodes to be processed are used as new slave nodes in the database cluster according to the target network address, so that the number of the slave nodes is restored to be the same as before.

It is easily understood that under normal conditions, the etcd cluster master node provides write service and the slave node provides read service, and if the master node is down and cannot provide service, the database cluster is unavailable. In the application, when the main node fails, the database cluster reselects a new main node, and the pod where the original main node is located is killed and rebuilt, so that the reselected main node can monitor the change of the ip address of the pod, remove the ip address of the old pod, and add the ip address of the new pod, thereby realizing automatic failover of the main node.

Optionally, the node creating module further includes: the device comprises a first creating unit and a first determining unit. The first creating unit is used for creating a plurality of computing units in the database cluster, wherein each computing unit is allocated with a network address; the first determining unit is used for randomly selecting one computing unit from the plurality of computing units as a master node and determining other computing units as slave nodes, wherein the other computing units are all the computing units except the master node.

In an alternative embodiment, the computing unit is a pod. Fig. 4 is a schematic diagram of a building method of an etcd database cluster according to an embodiment of the present application, and as shown in fig. 4, when an etcd database cluster is deployed, first, a kuberneters application program creates three pods of the etcd in a cluster environment, where ip addresses of the three pods are all stored in an apiserver component.

Optionally, the node management apparatus of the database cluster further includes: the device comprises a detection module, a first acquisition module and a third determination module. The detection module is used for detecting whether a newly added computing unit exists in the database cluster; the first acquisition module is used for acquiring a network address corresponding to a newly added computing unit under the condition that the newly added computing unit is detected to exist in the database cluster; and the third determining module is used for determining the newly added computing unit as a new slave node in the database cluster according to the network address corresponding to the newly added computing unit.

As shown in fig. 4, starting with the selection of the master node, the sidecar container on the master node continuously monitors the pod change in the cluster, and when there is a new edcd pod, the master node adds it as a slave node to the database cluster.

Optionally, the node management apparatus of the database cluster further includes: the system comprises a first monitoring module, a first updating module and a slave node rebuilding module. The first monitoring module is used for monitoring whether the database process on each slave node is abnormal or not through a target container on each slave node; the first updating module is used for updating the process identifier corresponding to the slave node to be a first identifier when the database process on any slave node is abnormal; and the slave node rebuilding module is used for determining the slave node carrying the first identifier as a first slave node and rebuilding the first slave node.

As shown in fig. 4, since the slave node is selected to start, the sidecar container on each slave node will also continuously perform health check on the database process on its corresponding slave node. Specifically, if an exception occurs in a database process on a slave node, the sidecar container on the slave node sets the process identifier liveness corresponding to the slave node to the first identifier false, and then the kuberneters application program kills and rebuilds the slave node according to the first identifier.

Optionally, the slave node rebuilding module further includes: the device comprises a first deleting unit, a first acquiring unit and an adding unit. The first deleting unit is used for deleting the network address corresponding to the first slave node; the first obtaining unit is used for obtaining a first network address after the first slave node is successfully rebuilt, wherein the first network address is a network address which is redistributed to the first slave node after the first slave node is successfully rebuilt; and the adding unit is used for adding the successfully reconstructed first slave node into the database cluster according to the first network address.

Optionally, the node management apparatus of the database cluster further includes: the device comprises a second monitoring module, a second updating module, a forbidding module and a removing module. The second monitoring module is used for monitoring whether the communication between each slave node and the master node is abnormal or not through the target container on each slave node; the second updating module is used for updating the communication state identifier corresponding to the slave node to be a second identifier when the communication between any slave node and the master node is abnormal; the forbidding module is used for determining the slave node carrying the second identifier as a second slave node and forbidding the second slave node to continue providing data reading service; a removal module to remove the second slave node from the database cluster.

Optionally, the second determining module further includes: the device comprises a second acquisition unit, a second determination unit and a data synchronization unit. The second acquiring unit is used for acquiring a target network address through a target container on the new main node; the second determining unit is used for determining the nodes to be processed after the reconstruction is successful as newly-added target slave nodes in the database cluster according to the target network address; and the data synchronization unit is used for synchronizing all the data stored in the new master node to the target slave node.

Based on the structure diagram of the database cluster shown in fig. 5, the database cluster may provide a Service to the outside, and behind the database cluster, a cluster deployed using stateful set is disposed, and generally 3 pods are deployed. Where there are two containers in each pod, one container is the etcd container and the other container is the sidecar container.

Specifically, the business logic in the sidecar container is performed in a loop body, the database pod in the entire kuberneters environment is monitored in real time, and when 3 pods are completely started, the kuberneters application program selects a main node from the 3 pods and pulls up the database process of the main node.

Finally, the sidecar container in each node monitors the state of the database process in the node in real time, once the database process of a certain node is abnormal, the sidecar container of the node sets the process identifier liveness of the node as the first identifier false, and then the kuberneters application program kills the node and rebuilds the node. Meanwhile, the sidecar container in each node also monitors the communication state of the node in real time, and for the node with abnormal communication, the sidecar container in the node sets the communication state identifier readyness of the node to be the second identifier false, and at this time, the data read/write request corresponding to the node is not distributed to the node.

As can be seen from the above, the technical solution of the present application can bring at least the following technical effects:

(2) Automatic cluster expansion: the whole database cluster is managed through the sidecar container, and automatic expansion of the database cluster can be achieved. When the nodes are required to be added to meet the data storage requirement, the sidecar program can automatically add corresponding nodes in the cluster only by modifying the parameter of the number of copies of the cluster.

Example 3

According to another aspect of the embodiments of the present application, there is also provided a computer-readable storage medium. Wherein a computer program is stored in a computer-readable storage medium, the computer program being arranged to execute the node management method of the database cluster in embodiment 1 described above when executed.

Example 4

According to another aspect of embodiments of the present application, there is also provided an electronic device, including one or more processors; storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement a method for running a program, wherein the program is arranged to perform the method for node management of a database cluster in embodiment 1 described above when run.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technical content can be implemented in other manners. The above-described apparatus embodiments are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or may not be executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application, or portions or all or portions of the technical solutions that contribute to the prior art, may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. A node management method of a database cluster is characterized by comprising the following steps:

creating a plurality of nodes in a database cluster, wherein the plurality of nodes comprise a master node and at least one slave node, each node corresponds to a network address, the master node is used for providing data writing service, and the slave node is used for providing data reading service;

monitoring whether the main node is abnormal in the operation process through a target container in the main node;

when the master node is abnormal, determining the master node as a node to be processed, and selecting one slave node from at least one slave node as a new master node;

reestablishing the node to be processed, and deleting the network address corresponding to the node to be processed;

and after the nodes to be processed are successfully reconstructed, acquiring a target network address, and determining the nodes to be processed after successful reconstruction as a new slave node according to the target network address, wherein the target network address is a network address redistributed to the nodes to be processed after successful reconstruction.

2. The method of claim 1, wherein creating a plurality of nodes in a database cluster comprises:

creating a plurality of computing units in the database cluster, wherein each computing unit is assigned a network address;

and randomly selecting one computing unit from the plurality of computing units as the master node, and determining other computing units as the slave nodes, wherein the other computing units are all computing units except the master node in the plurality of computing units.

3. The method of claim 2, wherein after randomly selecting one of the plurality of computing units as the master node, the method further comprises:

detecting whether a newly added computing unit exists in the database cluster;

under the condition that the newly added computing unit is detected to exist in the database cluster, acquiring a network address corresponding to the newly added computing unit;

and determining the newly added computing unit as a new slave node in the database cluster according to the network address corresponding to the newly added computing unit.

4. The method of claim 2, wherein after randomly selecting one of the plurality of computing units as the master node, the method further comprises:

monitoring whether the database process on each slave node is abnormal or not through a target container on each slave node;

when the database process on any slave node is abnormal, updating the process identifier corresponding to the slave node to be a first identifier;

and determining the slave node carrying the first identifier as a first slave node, and re-creating the first slave node.

5. The method of claim 4, wherein the recreating the first slave node comprises:

deleting the network address corresponding to the first slave node;

after the first slave node is successfully rebuilt, acquiring a first network address, wherein the first network address is a network address which is reallocated to the first slave node after the first slave node is successfully rebuilt;

and adding the successfully reconstructed first slave node into the database cluster according to the first network address.

6. The method of claim 2, wherein after randomly selecting one of the plurality of computing units as the master node, the method further comprises:

monitoring whether communication between each slave node and the master node is abnormal or not through a target container on each slave node;

when the communication between any slave node and the master node is abnormal, updating the communication state identifier corresponding to the slave node as a second identifier;

determining the slave node carrying the second identifier as a second slave node, and prohibiting the second slave node from continuously providing data reading service;

removing the second slave node from the database cluster.

7. The method of claim 1, wherein the master node and each slave node store the full amount of data, obtain a target network address, and determine that the node to be processed after the successful reconstruction is a new slave node according to the target network address, comprising:

acquiring the target network address through a target container on the new main node;

determining the nodes to be processed after the reconstruction is successful as newly-added target slave nodes in the database cluster according to the target network address;

synchronizing all data stored in the new master node to the target slave node.

8. A node management apparatus for a database cluster, comprising:

the system comprises a node creating module, a data reading module and a data processing module, wherein the node creating module is used for creating a plurality of nodes in a database cluster, the plurality of nodes comprise a main node and at least one slave node, each node corresponds to a network address, the main node is used for providing data writing service, and the slave node is used for providing data reading service;

the monitoring module is used for monitoring whether the main node is abnormal in the operation process through a target container in the main node;

the first determining module is used for determining the master node as a node to be processed when the master node is abnormal, and selecting one slave node from at least one slave node as a new master node;

the node reconstruction module is used for recreating the node to be processed and deleting the network address corresponding to the node to be processed;

and the second determining module is used for acquiring a target network address after the to-be-processed node is successfully reconstructed, and determining the to-be-processed node which is successfully reconstructed to be a new slave node according to the target network address, wherein the target network address is a network address which is newly allocated to the to-be-processed node which is successfully reconstructed.

9. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is arranged to execute, when running, the method for node management of a database cluster according to any one of claims 1 to 7.

10. An electronic device, wherein the electronic device comprises one or more processors; storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement a method for running a program, wherein the program is arranged to perform the method of node management of a database cluster of any of claims 1 to 7 when run.