CN114584459A - Method for realizing high availability of main and standby container cloud platforms - Google Patents

Method for realizing high availability of main and standby container cloud platforms Download PDF

Info

Publication number
CN114584459A
CN114584459A CN202210221854.7A CN202210221854A CN114584459A CN 114584459 A CN114584459 A CN 114584459A CN 202210221854 A CN202210221854 A CN 202210221854A CN 114584459 A CN114584459 A CN 114584459A
Authority
CN
China
Prior art keywords
node
standby
main
container cloud
cloud platform
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210221854.7A
Other languages
Chinese (zh)
Inventor
石光银
蔡卫卫
高传集
孙思清
肖雪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Cloud Information Technology Co Ltd
Original Assignee
Inspur Cloud Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Cloud Information Technology Co Ltd filed Critical Inspur Cloud Information Technology Co Ltd
Priority to CN202210221854.7A priority Critical patent/CN114584459A/en
Publication of CN114584459A publication Critical patent/CN114584459A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0663Performing the actions predefined by failover planning, e.g. switching to standby network elements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1095Replication or mirroring of data, e.g. scheduling or transport for data synchronisation between network nodes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the field of high availability of a DRBD and a container cloud platform main standby, in particular to a method for realizing high availability of the container cloud platform main standby, when a main node is unavailable or split, a main standby switching module in a main standby mode corresponding to a container cloud platform is used, so that a container cloud platform control surface component is switched to a standby node, backup metadata is transmitted to the standby node through a backup module, the service of the container cloud platform is rapidly recovered, and the method has the advantages that: by arranging the container cloud with the main and standby mode, the container cloud platform control plane service can be provided by using one main node, the container cloud platform metadata can be backed up in real time by using one standby node, when the main node is unavailable or split, the container cloud platform main and standby switching model is used for quickly switching the cloud platform control plane component to the standby node, the management capability of the cloud platform can be quickly recovered within 1 minute, and the container cloud platform metadata can be prevented from being lost.

Description

Method for realizing high availability of main and standby container cloud platforms
Technical Field
The invention relates to the field of high availability of a DRBD and a container cloud platform master and slave, in particular to a method for realizing high availability of the container cloud platform master and slave.
Background
With the development of cloud computing services, cloud manufacturers have successively released services such as public clouds, private clouds, edge clouds and the like, and users often need to purchase resources such as physical machines, switches and the like to build the private clouds and the edge clouds when using the private clouds and the edge clouds. After the user purchases the private cloud and the edge cloud, the purchased private cloud and edge cloud products can be well used only by the technical capability of cloud computing. But most users do not have cloud computing technology capability, and even have poor operation and maintenance capability. Therefore, users want to purchase PaSS products with capabilities of software development, micro services, API gateways and the like, and the users can directly use the PaSS (platform, i.e., service) services to meet their business requirements.
When a software development service, an API (application program interface) gateway service and a micro service rely on a private cloud or an edge cloud to provide a PASS product, the private cloud and the edge cloud are required to occupy as few resources as possible, the resources are reserved for the PASS service to use, and only local disks can be used for storing data. High reliability of local storage data needs to be supported, and when the cloud platform fails, the container cloud platform needs to be quickly recovered.
The DRBD technology is a technology supporting local storage and high reliability, data backup to other nodes is guaranteed by using a data synchronization mode, however, no existing method exists how to support main and standby disaster recovery of a container cloud by using the DRBD technology.
Disclosure of Invention
The invention aims to provide a method for realizing high availability of a main and a standby container cloud platforms, so as to solve the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme:
a method for realizing high availability of a main and a standby container cloud platforms is characterized in that the container cloud supports a main and standby mode, the main and standby mode comprises a main node and a standby node, and the main node and the standby node comprise NFS shared file storage service, a control surface module, a main and standby switching module and a backup module;
the container cloud support main node provides services of a container cloud platform control surface module;
the container cloud supports the node switching of the main/standby switching module;
the container cloud support backup node backs up the metadata of the container cloud platform in real time through a backup module;
after the container cloud platform completes the deployment of the main node and standby node components, the main node provides a management function of a control surface module of the container cloud platform, and a backup module of a standby node backs up metadata of the container cloud platform in real time;
when the main node is unavailable or split brain, the main/standby switching module in the main/standby mode corresponding to the container cloud platform is used, so that the control surface component of the container cloud platform is switched to the standby node, the standby metadata is transmitted to the standby node through the standby module, and the service of the container cloud platform is recovered quickly.
Preferably, the backup module is provided with an Etcd service, the Etcd service uses a single copy, the Etcd service operates on the main node and the stateless control plane module through LabeL of the specified main node, the Etcd service is used for managing metadata of the container cloud platform, and the Etcd service supports availability of the control plane of the container cloud platform.
Preferably, the data produced by the Etcd service is backed up to the backup module in real time, when the main node is unavailable or split, the data of the Etcd service is found on the backup module, and the Etcd service is started on the backup node to provide available metadata for the container cloud platform.
Preferably, the LabeL of the designated master node comprises: keepalived, CKE-advertisement, Kube-ApiServer, etc., may operate in the active-standby mode simultaneously.
Preferably, the backup module further includes a DRBD component deployment model, DRBD asynchronous mode supported configuration, DRBD management disk configuration, DRBD drive configuration, and DRBD component running node configuration.
Preferably, the DRBD component deployment model develops the DRBD component, and separately generates disks such as DRBD1 in the main node and the standby node of the container cloud platform, DRBD1 is used for synchronizing the Etcd data, and the data of the Etcd of the main node is ensured to be backed up by the standby node through the DRBD asynchronous mode.
Preferably, the main/standby switching module supports that the Etcd service is closed at the main node, the Etcd service is started at the standby node, the main node Label is switched to the standby node, the main node is deleted from the container cloud platform cluster, and the Kubelet is used for automatically switching the non-Etcd control plane service to the standby node.
Preferably, when the main node is unavailable, the main/standby switching module is required to execute an executable script for switching the main node to the standby node, complete the main/standby switching, instantiate the model as the executable script, execute the main/standby switching script, and complete the function of automatically switching the container cloud platform control component to the standby node.
Preferably, the condition that the master node is unavailable comprises shutdown, damage or split brain of the master node.
Preferably, the main node brain split condition comprises: the main node is not connected with the standby node and all worker nodes; the standby node has no split brain, and the standby node cannot be connected with the main node but can be connected with all worker nodes; and closing the Etcd service of the main node.
Compared with the prior art, the invention has the beneficial effects that:
1. according to the invention, by setting the container cloud with the master-slave mode, the service of providing the control surface of the container cloud platform by using a master node can be supported, the real-time backup of the metadata of the container cloud platform by using a slave node is supported, when the master node is unavailable or has a split brain, the master-slave switching model of the container cloud platform is used for quickly switching the cloud platform control surface component to the slave node, the management capability of the cloud platform can be quickly recovered within 1 minute, and the metadata of the container cloud platform can be ensured not to be lost;
2. the control surface of the container cloud platform only runs on the main node, so that the resource overhead of the control surface is reduced, and scenes with few physical resources, such as private clouds, edge clouds and the like, are supported; the container cloud platform metadata is backed up in real time by using one backup node, so that the high reliability of the control surface is improved, and the capacity of quick recovery when the control surface of the container cloud platform fails is realized.
Drawings
FIG. 1 is a system architecture diagram of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, the present invention provides a technical solution:
a method for realizing high availability of a main and a standby container cloud platforms comprises the following steps:
s1: the container cloud supports a master-backup mode, the master-backup mode comprises a master node and a backup node, and the master node and the backup node comprise an NFS shared file storage service, a control surface module, a master-backup switching module and a backup module;
s2: the container cloud support main node provides services of a container cloud platform control surface module, the container cloud supports node switching of the main and standby switching modules, and the container cloud support standby node backs up metadata of the container cloud platform in real time through the backup module;
s3: after the container cloud platform completes the deployment of the main node and standby node components, the main node provides a management function of a control surface module of the container cloud platform, and a backup module of a standby node backs up metadata of the container cloud platform in real time;
s4: when the main node is unavailable or split, the main/standby switching module in the main/standby mode corresponding to the container cloud platform is used to switch the container cloud platform control surface component to the standby node, and the backup metadata is transmitted to the standby node through the backup module, so that the service of the container cloud platform is quickly recovered.
The embodiment of the invention provides a method for realizing high availability of a main and a standby container cloud platforms, which comprises the steps of providing a management function of a control surface module of the container cloud platform through a main node, backing up metadata of the container cloud platform in real time through a backup module of a standby node, switching a control surface component of the container cloud platform to the standby node by using a main and standby switching module in a main and standby mode corresponding to the container cloud platform when the main node is unavailable or split, transmitting the backed-up metadata to the standby node through the backup module, and quickly recovering the service of the container cloud platform; by additionally arranging a backup node outside the main node to backup the metadata of the container cloud platform in real time, the high reliability of the control surface is improved, and the capacity of quick recovery when the control surface of the container cloud platform fails is achieved.
The backup module is provided with an Etcd (database for storing container cloud metadata) service:
the Etcd service is used for managing metadata of the container cloud platform, and the Etcd service supports the availability of a container cloud platform control plane.
The data produced by the Etcd service is backed up to the backup module in real time to form real-time copy monitoring of the data.
When the main node is unavailable or brains split, the data of the Etcd service can be found on the backup node, the Etcd service is started on the backup node, and the available metadata is provided for the container cloud platform, so that the container cloud platform supports the service of providing a container cloud platform control surface module by using one main node, and supports the real-time backup of the container cloud platform metadata by using one backup node.
The Etcd service uses a single copy, and runs on the main node and the stateless control plane module through LabeL (container resource LabeL) of the designated main node, wherein the LabeL of the designated main node comprises: keepalived (API server of K8S), CKE-advertisement (container resource request Webhook service), Kube-ApiServer (virtual IP management service), and the like, may be simultaneously operated in the master-slave mode.
And the main node is monitored in real time and the metadata is copied through the Etcd service, so that the metadata can be called after the standby node is switched.
The backup module is also provided with DRBD (distributed block device replication):
the DRBD includes: the DRBD component deployment model supports DRBD asynchronous mode configuration, DRBD management disk configuration, DRBD drive letter configuration and DRBD component running node configuration.
And developing the DRBD assembly by using a DRBD assembly deployment model, generating disks such as DRBD1 and the like on the main and standby nodes of the cloud platform, wherein DRBD1 is used for synchronizing Etcd data.
And the data of the Etcd of the main node is ensured to be backed up by the standby node by using the synchronous data of the DRBD asynchronous mode.
Because the standby node is in the container cloud platform, the state of the standby node can be monitored in real time, and the function of synchronizing data of the standby node is ensured.
The main and standby switching module supports that the Etcd service is closed at the main node, the Etcd service is started at the standby node, and the main node Label is switched to the standby node.
And deleting the main node in the container cloud platform cluster, and automatically switching the control plane service of the non-Etcd to the standby node by using the Kubelet.
When the main node is unavailable, the main/standby switching module is required to execute an executable script for switching the main node to the standby node, complete main/standby switching, instantiate the model into the executable script, execute the main/standby switching script, complete the function of automatically switching the container cloud platform control component to the standby node, support quick recovery of the management capability of the cloud platform within 1 minute, and ensure that the metadata of the container cloud platform is not lost.
The condition that the main node is unavailable comprises shutdown, damage or split brain of the main node, wherein the condition that the main node is split brain comprises the following steps:
the main node is not connected with the standby node and the worker node;
the spare node has no split brain, the spare node cannot be connected with the main node, but can be connected with a worker node;
and closing the Etcd service of the main node.
When the situation occurs, after the container cloud platform completes the deployment of the main node component and the standby node component, the main node provides the management function of the container cloud platform control surface module, and the backup module of the standby node backs up the metadata of the container cloud platform in real time;
the main node is monitored in real time and metadata are copied through Etcd service and used for data calling after the standby node is switched, a DRBD component deployment model is used for developing a DRBD component, disks such as DRBD1 and the like are generated on the main and standby nodes of a cloud platform, DRBD1 is used for synchronizing Etcd data, and data of the Etcd of the main node are guaranteed to be backed up by the standby node through DRBD asynchronous mode synchronous data. And because the standby node is in the container cloud platform, the state of the standby node can be monitored in real time, and the function of synchronizing data of the standby node is ensured.
The main-standby switching module executes the executable script for switching the main node to the standby node, completes main-standby switching, instantiates the model into the executable script, executes the main-standby switching script, completes the function of automatically switching the container cloud platform control component to the standby node, supports quick recovery of the management capability of the cloud platform within 1 minute, and ensures that the metadata of the container cloud platform are not lost.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (10)

1. A method for realizing high availability of a main and a standby container cloud platforms is characterized by comprising the following steps: the container cloud supports a master-backup mode, the master-backup mode comprises a master node and a backup node, and the master node and the backup node comprise an NFS shared file storage service, a control surface module, a master-backup switching module and a backup module;
the container cloud support main node provides services of a container cloud platform control surface module;
the container cloud supports the node switching of the main/standby switching module;
the container cloud support backup node backs up the metadata of the container cloud platform in real time through a backup module;
after the container cloud platform completes the deployment of the main node and standby node components, the main node provides a management function of a control surface module of the container cloud platform, and a backup module of a standby node backs up metadata of the container cloud platform in real time;
when the main node is unavailable or split, the main/standby switching module in the main/standby mode corresponding to the container cloud platform is used to switch the container cloud platform control surface component to the standby node, and the backup metadata is transmitted to the standby node through the backup module, so that the service of the container cloud platform is quickly recovered.
2. The method for realizing high availability of the main and standby container cloud platforms according to claim 1, wherein the method comprises the following steps: the backup module is provided with an Etcd service, the Etcd service uses a single copy, the Etcd service runs on the main node and the stateless control plane module through LabeL of the specified main node, the Etcd service is used for managing metadata of the container cloud platform, and the Etcd service supports the availability of the control plane of the container cloud platform.
3. The method for realizing high availability of the main and standby container cloud platforms according to claim 2, wherein the method comprises the following steps: the data produced by the Etcd service is backed up to the backup module in real time, when the main node is unavailable or split, the data of the Etcd service is found on the backup module, and the Etcd service is started at the backup node to provide available metadata for the container cloud platform.
4. The method for realizing high availability of the main and standby container cloud platforms according to claim 2, wherein the method comprises the following steps: the LabeL of the specified master node comprises: keepalived, CKE-advertisement, Kube-ApiServer, etc., may operate in the active-standby mode simultaneously.
5. The method for realizing high availability of the main and standby container cloud platforms according to claim 3, wherein the method comprises the following steps: the backup module also comprises a DRBD assembly deployment model, DRBD asynchronous mode configuration support, DRBD management disk configuration, DRBD drive symbol configuration and DRBD assembly running node configuration.
6. The method for realizing high availability of the main and standby container cloud platforms according to claim 5, wherein the method comprises the following steps: the DRBD component deployment model develops the DRBD component, disks such as DRBD1 and the like are respectively generated in a main node and a standby node of a container cloud platform, DRBD1 is used for synchronizing Etcd data, the standby node is guaranteed to backup the Etcd data of the main node through DRBD asynchronous mode synchronous data, and the standby node can monitor the state of the standby node in real time and guarantee the function of synchronizing data of the standby node because the standby node is in the container cloud platform.
7. The method for realizing high availability of the main and standby container cloud platforms according to claim 2, wherein the method comprises the following steps: the main and standby switching module supports that Etcd service is closed at a main node, Etcd service is started at a standby node, the main node Label is switched to the standby node, the main node is deleted in the container cloud platform cluster, and Kubelet is used for automatically switching non-Etcd control plane service to the standby node.
8. The method for realizing high availability of the main and standby container cloud platforms according to claim 7, wherein: when the main node is unavailable, the main-standby switching module is required to execute the executable script for switching the main node to the standby node, the main-standby switching is completed, the model is instantiated into the executable script, the main-standby switching script is executed, and the function of automatically switching the container cloud platform control component to the standby node is completed.
9. The method for realizing high availability of the main and standby container cloud platforms according to claim 8, wherein: the condition that the main node is unavailable comprises shutdown, damage or split brain of the main node.
10. The method according to claim 9, wherein the method for realizing high availability of the master and slave container cloud platforms comprises the following steps: the main node brain split condition comprises: the main node is not connected with the standby node and all worker nodes; the standby node has no split brain, and the standby node cannot be connected with the main node but can be connected with all worker nodes; the Etcd service of the master node is turned off.
CN202210221854.7A 2022-03-07 2022-03-07 Method for realizing high availability of main and standby container cloud platforms Pending CN114584459A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210221854.7A CN114584459A (en) 2022-03-07 2022-03-07 Method for realizing high availability of main and standby container cloud platforms

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210221854.7A CN114584459A (en) 2022-03-07 2022-03-07 Method for realizing high availability of main and standby container cloud platforms

Publications (1)

Publication Number Publication Date
CN114584459A true CN114584459A (en) 2022-06-03

Family

ID=81778971

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210221854.7A Pending CN114584459A (en) 2022-03-07 2022-03-07 Method for realizing high availability of main and standby container cloud platforms

Country Status (1)

Country Link
CN (1) CN114584459A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117201507A (en) * 2023-11-08 2023-12-08 苏州元脑智能科技有限公司 Cloud platform switching method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102810111A (en) * 2012-05-07 2012-12-05 互动在线(北京)科技有限公司 Implementation method and system for keeping high availability of Oracle database service
CN110825495A (en) * 2019-11-08 2020-02-21 北京浪潮数据技术有限公司 Container cloud platform recovery method, device, equipment and readable storage medium
CN112052127A (en) * 2020-10-12 2020-12-08 苏州浪潮智能科技有限公司 Data synchronization method and device for dual-computer hot standby environment
CN112698992A (en) * 2021-03-23 2021-04-23 腾讯科技(深圳)有限公司 Disaster recovery management method and related device for cloud cluster
CN113190378A (en) * 2020-12-31 2021-07-30 华数云科技有限公司 Edge cloud disaster recovery method based on distributed cloud platform

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102810111A (en) * 2012-05-07 2012-12-05 互动在线(北京)科技有限公司 Implementation method and system for keeping high availability of Oracle database service
CN110825495A (en) * 2019-11-08 2020-02-21 北京浪潮数据技术有限公司 Container cloud platform recovery method, device, equipment and readable storage medium
CN112052127A (en) * 2020-10-12 2020-12-08 苏州浪潮智能科技有限公司 Data synchronization method and device for dual-computer hot standby environment
CN113190378A (en) * 2020-12-31 2021-07-30 华数云科技有限公司 Edge cloud disaster recovery method based on distributed cloud platform
CN112698992A (en) * 2021-03-23 2021-04-23 腾讯科技(深圳)有限公司 Disaster recovery management method and related device for cloud cluster

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117201507A (en) * 2023-11-08 2023-12-08 苏州元脑智能科技有限公司 Cloud platform switching method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US11669409B2 (en) Application migration between environments
CN106713487B (en) Data synchronization method and device
CN107526626B (en) Docker container thermal migration method and system based on CRIU
US9727429B1 (en) Method and system for immediate recovery of replicated virtual machines
EP3819757A1 (en) Edge application management method and system
US10169173B2 (en) Preserving management services with distributed metadata through the disaster recovery life cycle
US9547563B2 (en) Recovery system and method for performing site recovery using replicated recovery-specific metadata
US20190391880A1 (en) Application backup and management
CN109151045B (en) Distributed cloud system and monitoring method
CN103414712B (en) A kind of distributed virtual desktop management system and method
CN103176831B (en) A kind of dummy machine system and management method thereof
CN109656742B (en) Node exception handling method and device and storage medium
US11880282B2 (en) Container-based application data protection method and system
CN115576655B (en) Container data protection system, method, device, equipment and readable storage medium
CN104660386A (en) DB2 disaster recovery high-availability improving method based on Itanium platform
CN112190924A (en) Data disaster tolerance method, device and computer readable medium
CN105389231A (en) Database dual-computer backup method and system
CN112711498A (en) Virtual machine disaster recovery backup method, device, equipment and readable storage medium
CN114584459A (en) Method for realizing high availability of main and standby container cloud platforms
CN106612314A (en) System for realizing software-defined storage based on virtual machine
CN110737501A (en) Method and system for realizing functions of check point and recovery point in Docker container
CN110554933A (en) Cloud management platform, and cross-cloud high-availability method and system for cloud platform service
CN111083074A (en) High availability method and system for main and standby dual OSPF state machines
CN110688259B (en) Private cloud backup recovery system and backup recovery method thereof
US10613789B1 (en) Analytics engine using consistent replication on distributed sites

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination