CN114584459A

CN114584459A - Method for realizing high availability of main and standby container cloud platforms

Info

Publication number: CN114584459A
Application number: CN202210221854.7A
Authority: CN
Inventors: 石光银; 蔡卫卫; 高传集; 孙思清; 肖雪
Original assignee: Inspur Cloud Information Technology Co Ltd
Current assignee: Inspur Cloud Information Technology Co Ltd
Priority date: 2022-03-07
Filing date: 2022-03-07
Publication date: 2022-06-03

Abstract

The invention relates to the field of high availability of a DRBD and a container cloud platform main standby, in particular to a method for realizing high availability of the container cloud platform main standby, when a main node is unavailable or split, a main standby switching module in a main standby mode corresponding to a container cloud platform is used, so that a container cloud platform control surface component is switched to a standby node, backup metadata is transmitted to the standby node through a backup module, the service of the container cloud platform is rapidly recovered, and the method has the advantages that: by arranging the container cloud with the main and standby mode, the container cloud platform control plane service can be provided by using one main node, the container cloud platform metadata can be backed up in real time by using one standby node, when the main node is unavailable or split, the container cloud platform main and standby switching model is used for quickly switching the cloud platform control plane component to the standby node, the management capability of the cloud platform can be quickly recovered within 1 minute, and the container cloud platform metadata can be prevented from being lost.

Description

Method for realizing high availability of main and standby container cloud platforms

Technical Field

The invention relates to the field of high availability of a DRBD and a container cloud platform master and slave, in particular to a method for realizing high availability of the container cloud platform master and slave.

Background

With the development of cloud computing services, cloud manufacturers have successively released services such as public clouds, private clouds, edge clouds and the like, and users often need to purchase resources such as physical machines, switches and the like to build the private clouds and the edge clouds when using the private clouds and the edge clouds. After the user purchases the private cloud and the edge cloud, the purchased private cloud and edge cloud products can be well used only by the technical capability of cloud computing. But most users do not have cloud computing technology capability, and even have poor operation and maintenance capability. Therefore, users want to purchase PaSS products with capabilities of software development, micro services, API gateways and the like, and the users can directly use the PaSS (platform, i.e., service) services to meet their business requirements.

When a software development service, an API (application program interface) gateway service and a micro service rely on a private cloud or an edge cloud to provide a PASS product, the private cloud and the edge cloud are required to occupy as few resources as possible, the resources are reserved for the PASS service to use, and only local disks can be used for storing data. High reliability of local storage data needs to be supported, and when the cloud platform fails, the container cloud platform needs to be quickly recovered.

The DRBD technology is a technology supporting local storage and high reliability, data backup to other nodes is guaranteed by using a data synchronization mode, however, no existing method exists how to support main and standby disaster recovery of a container cloud by using the DRBD technology.

Disclosure of Invention

The invention aims to provide a method for realizing high availability of a main and a standby container cloud platforms, so as to solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme:

a method for realizing high availability of a main and a standby container cloud platforms is characterized in that the container cloud supports a main and standby mode, the main and standby mode comprises a main node and a standby node, and the main node and the standby node comprise NFS shared file storage service, a control surface module, a main and standby switching module and a backup module;

the container cloud support main node provides services of a container cloud platform control surface module;

the container cloud supports the node switching of the main/standby switching module;

the container cloud support backup node backs up the metadata of the container cloud platform in real time through a backup module;

after the container cloud platform completes the deployment of the main node and standby node components, the main node provides a management function of a control surface module of the container cloud platform, and a backup module of a standby node backs up metadata of the container cloud platform in real time;

when the main node is unavailable or split brain, the main/standby switching module in the main/standby mode corresponding to the container cloud platform is used, so that the control surface component of the container cloud platform is switched to the standby node, the standby metadata is transmitted to the standby node through the standby module, and the service of the container cloud platform is recovered quickly.

Preferably, the backup module is provided with an Etcd service, the Etcd service uses a single copy, the Etcd service operates on the main node and the stateless control plane module through LabeL of the specified main node, the Etcd service is used for managing metadata of the container cloud platform, and the Etcd service supports availability of the control plane of the container cloud platform.

Preferably, the data produced by the Etcd service is backed up to the backup module in real time, when the main node is unavailable or split, the data of the Etcd service is found on the backup module, and the Etcd service is started on the backup node to provide available metadata for the container cloud platform.

Preferably, the LabeL of the designated master node comprises: keepalived, CKE-advertisement, Kube-ApiServer, etc., may operate in the active-standby mode simultaneously.

Preferably, the backup module further includes a DRBD component deployment model, DRBD asynchronous mode supported configuration, DRBD management disk configuration, DRBD drive configuration, and DRBD component running node configuration.

Preferably, the DRBD component deployment model develops the DRBD component, and separately generates disks such as DRBD1 in the main node and the standby node of the container cloud platform, DRBD1 is used for synchronizing the Etcd data, and the data of the Etcd of the main node is ensured to be backed up by the standby node through the DRBD asynchronous mode.

Preferably, the main/standby switching module supports that the Etcd service is closed at the main node, the Etcd service is started at the standby node, the main node Label is switched to the standby node, the main node is deleted from the container cloud platform cluster, and the Kubelet is used for automatically switching the non-Etcd control plane service to the standby node.

Preferably, when the main node is unavailable, the main/standby switching module is required to execute an executable script for switching the main node to the standby node, complete the main/standby switching, instantiate the model as the executable script, execute the main/standby switching script, and complete the function of automatically switching the container cloud platform control component to the standby node.

Preferably, the condition that the master node is unavailable comprises shutdown, damage or split brain of the master node.

Preferably, the main node brain split condition comprises: the main node is not connected with the standby node and all worker nodes; the standby node has no split brain, and the standby node cannot be connected with the main node but can be connected with all worker nodes; and closing the Etcd service of the main node.

Compared with the prior art, the invention has the beneficial effects that:

1. according to the invention, by setting the container cloud with the master-slave mode, the service of providing the control surface of the container cloud platform by using a master node can be supported, the real-time backup of the metadata of the container cloud platform by using a slave node is supported, when the master node is unavailable or has a split brain, the master-slave switching model of the container cloud platform is used for quickly switching the cloud platform control surface component to the slave node, the management capability of the cloud platform can be quickly recovered within 1 minute, and the metadata of the container cloud platform can be ensured not to be lost;

2. the control surface of the container cloud platform only runs on the main node, so that the resource overhead of the control surface is reduced, and scenes with few physical resources, such as private clouds, edge clouds and the like, are supported; the container cloud platform metadata is backed up in real time by using one backup node, so that the high reliability of the control surface is improved, and the capacity of quick recovery when the control surface of the container cloud platform fails is realized.

Drawings

FIG. 1 is a system architecture diagram of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the present invention provides a technical solution:

a method for realizing high availability of a main and a standby container cloud platforms comprises the following steps:

s1: the container cloud supports a master-backup mode, the master-backup mode comprises a master node and a backup node, and the master node and the backup node comprise an NFS shared file storage service, a control surface module, a master-backup switching module and a backup module;

s2: the container cloud support main node provides services of a container cloud platform control surface module, the container cloud supports node switching of the main and standby switching modules, and the container cloud support standby node backs up metadata of the container cloud platform in real time through the backup module;

s3: after the container cloud platform completes the deployment of the main node and standby node components, the main node provides a management function of a control surface module of the container cloud platform, and a backup module of a standby node backs up metadata of the container cloud platform in real time;

s4: when the main node is unavailable or split, the main/standby switching module in the main/standby mode corresponding to the container cloud platform is used to switch the container cloud platform control surface component to the standby node, and the backup metadata is transmitted to the standby node through the backup module, so that the service of the container cloud platform is quickly recovered.

The embodiment of the invention provides a method for realizing high availability of a main and a standby container cloud platforms, which comprises the steps of providing a management function of a control surface module of the container cloud platform through a main node, backing up metadata of the container cloud platform in real time through a backup module of a standby node, switching a control surface component of the container cloud platform to the standby node by using a main and standby switching module in a main and standby mode corresponding to the container cloud platform when the main node is unavailable or split, transmitting the backed-up metadata to the standby node through the backup module, and quickly recovering the service of the container cloud platform; by additionally arranging a backup node outside the main node to backup the metadata of the container cloud platform in real time, the high reliability of the control surface is improved, and the capacity of quick recovery when the control surface of the container cloud platform fails is achieved.

The backup module is provided with an Etcd (database for storing container cloud metadata) service:

the Etcd service is used for managing metadata of the container cloud platform, and the Etcd service supports the availability of a container cloud platform control plane.

The data produced by the Etcd service is backed up to the backup module in real time to form real-time copy monitoring of the data.

When the main node is unavailable or brains split, the data of the Etcd service can be found on the backup node, the Etcd service is started on the backup node, and the available metadata is provided for the container cloud platform, so that the container cloud platform supports the service of providing a container cloud platform control surface module by using one main node, and supports the real-time backup of the container cloud platform metadata by using one backup node.

The Etcd service uses a single copy, and runs on the main node and the stateless control plane module through LabeL (container resource LabeL) of the designated main node, wherein the LabeL of the designated main node comprises: keepalived (API server of K8S), CKE-advertisement (container resource request Webhook service), Kube-ApiServer (virtual IP management service), and the like, may be simultaneously operated in the master-slave mode.

And the main node is monitored in real time and the metadata is copied through the Etcd service, so that the metadata can be called after the standby node is switched.

The backup module is also provided with DRBD (distributed block device replication):

the DRBD includes: the DRBD component deployment model supports DRBD asynchronous mode configuration, DRBD management disk configuration, DRBD drive letter configuration and DRBD component running node configuration.

And developing the DRBD assembly by using a DRBD assembly deployment model, generating disks such as DRBD1 and the like on the main and standby nodes of the cloud platform, wherein DRBD1 is used for synchronizing Etcd data.

And the data of the Etcd of the main node is ensured to be backed up by the standby node by using the synchronous data of the DRBD asynchronous mode.

Because the standby node is in the container cloud platform, the state of the standby node can be monitored in real time, and the function of synchronizing data of the standby node is ensured.

The main and standby switching module supports that the Etcd service is closed at the main node, the Etcd service is started at the standby node, and the main node Label is switched to the standby node.

And deleting the main node in the container cloud platform cluster, and automatically switching the control plane service of the non-Etcd to the standby node by using the Kubelet.

When the main node is unavailable, the main/standby switching module is required to execute an executable script for switching the main node to the standby node, complete main/standby switching, instantiate the model into the executable script, execute the main/standby switching script, complete the function of automatically switching the container cloud platform control component to the standby node, support quick recovery of the management capability of the cloud platform within 1 minute, and ensure that the metadata of the container cloud platform is not lost.

The condition that the main node is unavailable comprises shutdown, damage or split brain of the main node, wherein the condition that the main node is split brain comprises the following steps:

the main node is not connected with the standby node and the worker node;

the spare node has no split brain, the spare node cannot be connected with the main node, but can be connected with a worker node;

and closing the Etcd service of the main node.

When the situation occurs, after the container cloud platform completes the deployment of the main node component and the standby node component, the main node provides the management function of the container cloud platform control surface module, and the backup module of the standby node backs up the metadata of the container cloud platform in real time;

the main node is monitored in real time and metadata are copied through Etcd service and used for data calling after the standby node is switched, a DRBD component deployment model is used for developing a DRBD component, disks such as DRBD1 and the like are generated on the main and standby nodes of a cloud platform, DRBD1 is used for synchronizing Etcd data, and data of the Etcd of the main node are guaranteed to be backed up by the standby node through DRBD asynchronous mode synchronous data. And because the standby node is in the container cloud platform, the state of the standby node can be monitored in real time, and the function of synchronizing data of the standby node is ensured.

The main-standby switching module executes the executable script for switching the main node to the standby node, completes main-standby switching, instantiates the model into the executable script, executes the main-standby switching script, completes the function of automatically switching the container cloud platform control component to the standby node, supports quick recovery of the management capability of the cloud platform within 1 minute, and ensures that the metadata of the container cloud platform are not lost.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A method for realizing high availability of a main and a standby container cloud platforms is characterized by comprising the following steps: the container cloud supports a master-backup mode, the master-backup mode comprises a master node and a backup node, and the master node and the backup node comprise an NFS shared file storage service, a control surface module, a master-backup switching module and a backup module;

when the main node is unavailable or split, the main/standby switching module in the main/standby mode corresponding to the container cloud platform is used to switch the container cloud platform control surface component to the standby node, and the backup metadata is transmitted to the standby node through the backup module, so that the service of the container cloud platform is quickly recovered.

2. The method for realizing high availability of the main and standby container cloud platforms according to claim 1, wherein the method comprises the following steps: the backup module is provided with an Etcd service, the Etcd service uses a single copy, the Etcd service runs on the main node and the stateless control plane module through LabeL of the specified main node, the Etcd service is used for managing metadata of the container cloud platform, and the Etcd service supports the availability of the control plane of the container cloud platform.

3. The method for realizing high availability of the main and standby container cloud platforms according to claim 2, wherein the method comprises the following steps: the data produced by the Etcd service is backed up to the backup module in real time, when the main node is unavailable or split, the data of the Etcd service is found on the backup module, and the Etcd service is started at the backup node to provide available metadata for the container cloud platform.

4. The method for realizing high availability of the main and standby container cloud platforms according to claim 2, wherein the method comprises the following steps: the LabeL of the specified master node comprises: keepalived, CKE-advertisement, Kube-ApiServer, etc., may operate in the active-standby mode simultaneously.

5. The method for realizing high availability of the main and standby container cloud platforms according to claim 3, wherein the method comprises the following steps: the backup module also comprises a DRBD assembly deployment model, DRBD asynchronous mode configuration support, DRBD management disk configuration, DRBD drive symbol configuration and DRBD assembly running node configuration.

6. The method for realizing high availability of the main and standby container cloud platforms according to claim 5, wherein the method comprises the following steps: the DRBD component deployment model develops the DRBD component, disks such as DRBD1 and the like are respectively generated in a main node and a standby node of a container cloud platform, DRBD1 is used for synchronizing Etcd data, the standby node is guaranteed to backup the Etcd data of the main node through DRBD asynchronous mode synchronous data, and the standby node can monitor the state of the standby node in real time and guarantee the function of synchronizing data of the standby node because the standby node is in the container cloud platform.

7. The method for realizing high availability of the main and standby container cloud platforms according to claim 2, wherein the method comprises the following steps: the main and standby switching module supports that Etcd service is closed at a main node, Etcd service is started at a standby node, the main node Label is switched to the standby node, the main node is deleted in the container cloud platform cluster, and Kubelet is used for automatically switching non-Etcd control plane service to the standby node.

8. The method for realizing high availability of the main and standby container cloud platforms according to claim 7, wherein: when the main node is unavailable, the main-standby switching module is required to execute the executable script for switching the main node to the standby node, the main-standby switching is completed, the model is instantiated into the executable script, the main-standby switching script is executed, and the function of automatically switching the container cloud platform control component to the standby node is completed.

9. The method for realizing high availability of the main and standby container cloud platforms according to claim 8, wherein: the condition that the main node is unavailable comprises shutdown, damage or split brain of the main node.

10. The method according to claim 9, wherein the method for realizing high availability of the master and slave container cloud platforms comprises the following steps: the main node brain split condition comprises: the main node is not connected with the standby node and all worker nodes; the standby node has no split brain, and the standby node cannot be connected with the main node but can be connected with all worker nodes; the Etcd service of the master node is turned off.