CN114257595B - Cloud platform disaster recovery machine room election system, method, device, medium and electronic equipment - Google Patents

Cloud platform disaster recovery machine room election system, method, device, medium and electronic equipment Download PDF

Info

Publication number
CN114257595B
CN114257595B CN202111590393.2A CN202111590393A CN114257595B CN 114257595 B CN114257595 B CN 114257595B CN 202111590393 A CN202111590393 A CN 202111590393A CN 114257595 B CN114257595 B CN 114257595B
Authority
CN
China
Prior art keywords
machine room
main
server
service
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111590393.2A
Other languages
Chinese (zh)
Other versions
CN114257595A (en
Inventor
石鸿伟
张阔意
史精文
黄韬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Network Communication and Security Zijinshan Laboratory
Original Assignee
Network Communication and Security Zijinshan Laboratory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Network Communication and Security Zijinshan Laboratory filed Critical Network Communication and Security Zijinshan Laboratory
Priority to CN202111590393.2A priority Critical patent/CN114257595B/en
Publication of CN114257595A publication Critical patent/CN114257595A/en
Application granted granted Critical
Publication of CN114257595B publication Critical patent/CN114257595B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/2033Failover techniques switching over of hardware resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/14Session management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45562Creating, deleting, cloning virtual machine instances
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45575Starting, stopping, suspending or resuming virtual machine instances
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45595Network integration; Enabling network access in virtual machine instances

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Hardware Redundancy (AREA)

Abstract

The invention discloses a cloud platform disaster recovery machine room election system, a method, a device, a medium and electronic equipment, wherein the system comprises at least two machine rooms and at least one first server for routing and forwarding, each machine room respectively selects a second server and the first server to form a Zookeeper cluster crossing the machine room, a container cloud platform based on a Kubernetes cluster is built among servers in a single machine room, the cloud platform dispatches a main selection service container for started server nodes in the machine room, server node information, a main contention selection machine room and the like are registered in the Zookeeper cluster through main selection service, and when the second server in the machine room is in fault and the like, the Zookeeper cluster informs other servers to reselect the main machine room. The election scheme of the invention is not limited by the number of servers, can realize automatic election of the main machine room, and can be suitable for machine room disaster recovery election scenes under the cloud platform technology.

Description

Cloud platform disaster recovery machine room election system, method, device, medium and electronic equipment
Technical Field
The invention relates to the technical field of computers, in particular to a cloud platform disaster recovery machine room election system, a method, a device, a medium and electronic equipment.
Background
The server room is a place for storing servers, and is a room designed for continuous operation of computer servers. In order to improve the reliability of the application service and avoid unexpected faults, a disaster-tolerant backup is usually performed on a server in a machine room or a disaster-tolerant backup center, namely a standby machine room, is newly built. The original machine room is used as a normal main machine room, and the standby machine room is started when the main machine room fails. In order to maintain high reliability, the disaster recovery backup center and the host room use different power supplies, such as a remote disaster recovery backup center.
The traditional machine room disaster recovery scheme is that a distributed message coordination management program is deployed on each server in a machine room and configured into a cluster mode, the cluster uses a mechanism based on node election, when the nodes are elected, half of nodes are required to be agreed to work, and based on the election principle, the cluster is required to have an odd number of servers, namely, after a cluster with 2N+1 servers is required to be agreed, the election can be effective. In particular, when a dual-machine room deployment scenario is reached, an even number of servers are deployed in one machine room, and an odd number of servers are deployed in the other machine room. The conventional machine room disaster recovery scheme has the following problems:
(1) If the machine rooms with the odd number of servers have faults such as network disconnection and power failure, the machine rooms with the even number of servers still cannot work.
(2) The traditional scheme cannot cover the current popular cloud platform disaster recovery scene.
Disclosure of Invention
The technical purpose is that: in order to solve the technical problems, the invention discloses a cloud platform disaster recovery machine room election system, a method, a device, a medium and electronic equipment, which can realize automatic election of a main machine room by the machine room without being limited by the number of servers.
The technical scheme is as follows: in order to achieve the technical purpose, the invention adopts the following technical scheme: a cloud platform disaster recovery machine room system, comprising:
At least two machine rooms and at least one first server for route forwarding, wherein each machine room comprises at least one second server;
A container cloud platform based on a Kubernetes cluster is constructed on all the second servers in each machine room, and is used for dispatching a main service selection container for the started second servers when the second servers are started, and main service selection containers are filled with main service selection;
Any one second server selected in each machine room and the first server are provided with a Zookeeper cluster crossing the machine room;
The master selecting service is used for registering node information and the information of the machine room of the started second server in the Zookeeper cluster, is used for competing the machine room to become a master machine room, and is used for monitoring the start and stop of other second servers.
Further, the Zookeeper cluster includes a data structure directory for access by the optional main service, and the data structure directory includes:
a selector directory node for representing a namespace of a selected host service;
an active directory node for storing host room information;
The standby directory node comprises a plurality of temporary directory nodes, wherein the temporary directory nodes are used for storing second server node information registered by each selected main service and computer room information;
and the lock directory node is used for generating a distributed exclusive lock in the host room election process.
Further, the master selecting service is configured to monitor start and stop of other second servers, and includes:
the master selecting service is used for monitoring the standby directory node, and when node adding and deleting occur under the standby directory node, the master selecting service receives notification about node adding and deleting information.
A cloud platform disaster recovery machine room election method comprises the following steps:
When the second server is started, the container cloud platform dispatches a main selection service container for the second server, main selection service in the main selection service container registers the node information of the second server and the information of the machine room in the Zookeeper cluster, and the main selection service is the machine room competitive choice machine room;
When any one of the second servers fails, closing a main selection service container on the failed second server, disconnecting the corresponding main selection service from the Zookeeper cluster, deleting the node information of the failed server in the Zookeeper cluster, notifying the main selection service on other second servers of the deleted node information, and reselecting a main machine room after the main selection service on other second servers receives the deleted node notification;
When the second server with the fault is recovered to be normal, the container cloud platform reschedules a main selection service container for the second server with the fault, wherein main selection services in the main selection service container register the node information of the server and the information of the machine room in the Zookeeper cluster, and the main selection services on the second server with the fault are recovered to be normal to re-compete the main machine room;
and updating the competitive results of the main machine room into the Zookeeper cluster.
Further, the main selecting service is a main machine room for the machine room, and the method comprises the following steps:
the master selecting service of the second server sequentially strives for distributed exclusive locks in the Zookeeper cluster according to the starting sequence of the second server;
The master selecting service of the distributed exclusive lock is obtained to judge whether the information of the host rooms in the Zookeeper cluster is empty, if so, the information of the machine room where the master selecting service is positioned is written in the information of the host rooms of the Zookeeper cluster, and the host rooms are successfully selected; if the machine room information is not empty, comparing the priority of the main machine room in the Zookeeper cluster with the priority of the machine room where the selected main service is located, and writing the machine room information with higher priority into the machine room information of the Zookeeper cluster, wherein the machine room with higher priority is successfully competitive with the machine room.
Further, the main machine room reselecting after the main selecting service on the other second servers receives the notification of deleting the node comprises the following steps:
The selected main service on the other second servers strives for distributed exclusive locks;
The master selecting service of the distributed exclusive lock is obtained to judge whether the deleted node belongs to a host room in the Zookeeper cluster, and if not, the master selecting service does not need to compete for the master room;
If so, the main selection service inquires whether second server nodes in the Zookeeper cluster still belong to the main machine room, and if so, the other second server nodes are in the main machine room, so that the main machine room does not need to be competitive; if no other second server nodes exist under the main machine room, the main selection service replaces the information of the main machine room in the Zookeeper cluster with the information of the machine room where the main machine room is located, and the host machine room is successfully selected.
Further, the main selection service on the second server, which is recovered to be normal, re-competing the main machine room, comprises the following steps:
The main selection service on the second server which is recovered to be normal acquires a distributed exclusive lock, compares the priorities of the main machine room in the Zookeeper cluster and the machine room where the main selection service is located, and if the priority is not higher than the priority of the existing main machine room, the main machine room is not required to be competitive;
if the priority is higher than the priority of the existing main machine room, the main machine room information in the Zookeeper cluster is replaced by the main machine room information on the second server which is recovered to be normal, and the host machine room competition is successful.
Further, the main selection service provides an interface, and the interface is used for switching the main machine room and the standby machine room according to the requirement.
A cloud platform disaster recovery machine room election device comprises:
The first connection module is used for respectively constructing a container cloud platform based on a Kubernetes cluster for each machine room participating in the election, and the container cloud platform is connected with all second servers in the machine room; the container cloud platform is used for dispatching a main service selecting container for the started second server when the second server is started, and main service selecting containers are filled with main service selecting containers;
the second connection module is used for respectively selecting a second server from each machine room and constructing a Zookeeper cluster crossing the machine room between the selected second server and a first server used for routing and forwarding outside the machine room;
The monitoring module is used for monitoring the current running state of all the second servers and the connection state of the second servers and the Zookeeper cluster;
the automatic competitive choice module is used for automatically competing and selecting a main machine room among the machine rooms participating in competitive choice according to the current running state of the second server, the connection state of the main selection service in the main selection container of the second server and the Zookeeper cluster, and the node information and the machine room information of the second server registered in the Zookeeper cluster by the main selection service;
And the competition result determining module is used for synchronously updating the competition result comprising the information of the main machine room and the standby machine room into the Zookeeper cluster.
Further, the device also comprises a setting module for switching the main machine room and the standby machine room according to the requirement.
A medium storing computer executable instructions which, when executed by a processing unit, are adapted to carry out the election method of any of the preceding claims.
An electronic device comprising a processing unit and a storage unit storing computer executable instructions that when executed by the processing unit are adapted to implement the election method of any of the preceding claims.
The beneficial effects are that: due to the adoption of the technical scheme, the invention has the following technical effects:
1. According to the disaster recovery scheme provided by the invention, a container cloud platform is constructed in each machine room, the Kubernetes container engines can be deeply integrated, the containerized deployment of application services is realized, containers are mutually isolated and are not mutually influenced, a Zookeeper cluster crossing the machine room is constructed between one server and a route load server in each machine room, the advantages of Zookeeper file system management and monitoring notification mechanisms and the like are fully utilized, the information of directory nodes in the Zookeeper cluster is updated, all selected main services can be notified, and the management and coordination capacity of the system is simplified.
2. The disaster recovery scheme provided by the invention starts from a current popular cloud platform scene, can get rid of the constraint of the failure number of the bottom server in the traditional machine room disaster recovery scheme, and does not need to frequently compete for the main machine room as long as the main machine room service container can be scheduled to resume normal operation, thereby improving the stability of the disaster recovery machine room.
3. The disaster recovery scheme provided by the invention can realize the automatic competitive choice of the main machine room of the machine room, supports the setting of the priority of the machine room, and automatically switches to the main machine room after the machine room with high priority is started, when a plurality of servers in one machine room are in failure and unavailable, the container service on the unavailable server can be automatically dispatched to other available servers in the machine room through the container service automatic dispatching strategy of the cloud platform, so that the machine room can still work normally.
4. The disaster recovery scheme provided by the invention also supports the manual switching of the main and standby roles of the machine room, and is more flexible and controllable.
Drawings
Fig. 1 is a schematic structural diagram of a disaster recovery system of a machine room based on a cloud platform in an embodiment of the present invention;
FIG. 2 is a node diagram of a master service container in an embodiment of the invention;
FIG. 3 is a view of directory node information in a Zookeeper cluster in an embodiment of the present invention;
FIG. 4 is a timing chart of disaster recovery operation of a machine room according to an embodiment of the present invention;
Fig. 5 is a schematic diagram of a cloud platform disaster recovery machine room election device in an embodiment of the invention.
Detailed Description
The present invention will be described in detail with reference to the accompanying drawings.
Aiming at the situation of disaster recovery election of a machine room in a current popular cloud platform scene, when a server in a host room fails, the operation of the current host machine room is automatically scheduled and restored through a container, and when the host room does not restore the operation, the operation is switched to a standby machine room. Forming a remote disaster recovery system by at least two machine rooms and a first server, wherein the server in the machine room is recorded as a second server, the first server is positioned outside the machine room and is mainly used for routing forwarding (routing), namely, a routing rule is set to enable the flow of a client to flow to a main machine room;
As shown in fig. 1, a machine room a cluster formed by a plurality of second servers is deployed in a machine room a, a machine room B cluster formed by a plurality of second servers is deployed in a machine room B, one second server and a first server are selected in each machine room to form a Zookeeper (distributed message coordination management program) cluster crossing the machine room, one second server selected in the machine room a can be any one second server in the machine room a cluster, one server selected in the machine room B can be any one second server in the machine room B cluster, the zk-1 node and the zk-2 node of the Zookeeper cluster are respectively used, and the first server is used as the zk-0 node of the Zookeeper cluster. The Zookeeper is a high-performance distributed consistent system, and based on the advantages of a Zookeeper file system management and monitoring notification mechanism, the Zookeeper can ensure the strong consistency of data, and the data of any directory node in a Zookeeper cluster is the same at any time by a user; different services monitor the same directory node, and once the content of the directory node is updated, all the services can be notified, so that the management and coordination capacity of the distributed application is simplified.
A container cloud platform based on a Kubernetes cluster is built between second servers in a single machine room, a specific business program can provide services for an external client or web end in a container mode on the cloud platform, the cloud platform can deeply integrate a Kubernetes container engine, containerized deployment of application services is achieved, and containers are isolated from each other and are not affected by each other. I.e. high availability, high scalability, high portability, etc. of the service can be guaranteed. The high availability is mainly reflected in that after service is containerized, if the service is crashed due to faults such as memory leakage and the like, other services cannot be influenced, and the service can be quickly rebuilt and recovered under the scheduling of the cloud platform; the high expandability is mainly realized in that the number of the service containers can be rapidly expanded and reduced according to actual needs, and the change of the workload can be responded in real time; the high portability is mainly reflected in that after the container mirror image is created, the service container can be easily moved to different environments and rapidly deployed, and the high consistency across environments is maintained.
The client accesses the remote disaster recovery system through vip (virtual IP), and the routing load service is responsible for guiding the flow to the host room according to the elected host room IP. The containers of the selected primary service (selector) in the cloud platform are scheduled as counteraffinity, which schedules the selected primary service containers to different second servers. When a certain second server in the machine room is started, the cloud platform dispatches a main selection service container on the second server node according to the counteraffinity, wherein the main selection service container is an example of operation of main selection service, and the main selection service container is filled with main selection service.
The zk cluster is accessed by a master selection service in a master selection container scheduled by the cloud platform on a second server in the machine room, and any zk node can be accessed by configuring ip of three zk nodes in the master selection service. The master selection service is connected with a Zookeeper cluster, registers machine room information and second server node information in the Zookeeper cluster, tries to contend for the master selection, maintains the connection state with zk, and monitors the start and stop of other second servers; when a certain second server fails, the primary selection container on the second server also disappears, the second server is disconnected with the Zookeeper cluster, the registration information of the second server in the Zookeeper cluster is deleted, the deleted registration information is notified to the primary selection service on other second servers, and other primary selection services contend for the primary selection machine room according to the principle of not switching as much as possible.
Example 1:
the cloud platform disaster recovery machine room system is characterized by comprising:
At least two machine rooms and at least one first server for route forwarding, wherein each machine room comprises at least one second server;
Constructing a container cloud platform based on a Kubernetes cluster on all second servers in each machine room, wherein the container cloud platform is used for dispatching a main service selection container for the started second server when the second server is started, and the main service selection container is filled with main service selection;
Constructing a Zookeeper cluster crossing the machine room on any one second server selected in each machine room and the first server;
The master selecting service is used for registering node information and the information of the machine room where the started second server is located in the Zookeeper cluster, is used for selecting the machine room where the master selecting service is located as a master machine room, and is used for monitoring the start and stop of other second servers.
Preferably, the Zookeeper cluster includes a data structure directory for access by the optional main service, and the data structure directory includes:
a selector directory node for representing a namespace of a selected host service;
an active directory node for storing host room information;
The standby directory node comprises a plurality of temporary directory nodes, wherein the temporary directory nodes are used for storing second server node information registered by each selected main service and computer room information;
and the lock directory node is used for generating a distributed exclusive lock in the host room election process.
Preferably, the service for selecting the master is used for monitoring start-stop of other second servers, and includes:
the master selecting service is used for monitoring the standby directory node, and when node adding and deleting occur under the standby directory node, the master selecting service receives notification about node adding and deleting information.
Example 2:
A cloud platform disaster recovery machine room election method comprises the following steps:
When the second server is started, the container cloud platform dispatches a main selection service container for the second server, main selection service in the main selection service container registers the node information of the second server and the information of the machine room in the Zookeeper cluster, and the main selection service is the machine room competitive choice machine room;
When any one of the second servers fails, closing a main selection service container on the failed second server, disconnecting the corresponding main selection service from the Zookeeper cluster, deleting the node information of the failed server in the Zookeeper cluster, notifying the main selection service on other second servers of the deleted node information, and reselecting a main machine room after the main selection service on other second servers receives the deleted node notification;
When the second server with the fault is recovered to be normal, the container cloud platform reschedules a main selection service container for the second server with the fault, wherein main selection services in the main selection service container register the node information of the server and the information of the machine room in the Zookeeper cluster, and the main selection services on the second server with the fault are recovered to be normal to re-compete the main machine room;
and updating the competitive results of the main machine room into the Zookeeper cluster.
Preferably, the main selecting service is a main machine room for the machine room, and the method comprises the following steps:
the master selecting service of the second server sequentially strives for distributed exclusive locks in the Zookeeper cluster according to the starting sequence of the second server;
The master selecting service of the distributed exclusive lock is obtained to judge whether the information of the host rooms in the Zookeeper cluster is empty, if so, the information of the machine room where the master selecting service is positioned is written in the information of the host rooms of the Zookeeper cluster, and the host rooms are successfully selected; if the machine room information is not empty, comparing the priority of the main machine room in the Zookeeper cluster with the priority of the machine room where the selected main service is located, and writing the machine room information with higher priority into the machine room information of the Zookeeper cluster, wherein the machine room with higher priority is successfully competitive with the machine room.
Preferably, the main selection service on the other second servers re-selects the main machine room after receiving the notification of deleting the node, including the steps of:
The selected main service on the other second servers strives for distributed exclusive locks;
The master selecting service of the distributed exclusive lock is obtained to judge whether the deleted node belongs to a host room in the Zookeeper cluster, and if not, the master selecting service does not need to compete for the master room;
If so, the main selection service inquires whether second server nodes in the Zookeeper cluster still belong to the main machine room, and if so, the other second server nodes are in the main machine room, so that the main machine room does not need to be competitive; if no other second server nodes exist under the main machine room, the main selection service replaces the information of the main machine room in the Zookeeper cluster with the information of the machine room where the main machine room is located, and the host machine room is successfully selected.
Preferably, the main selection service on the second server which is recovered to be normal re-competing the main machine room comprises the following steps:
The main selection service on the second server which is recovered to be normal acquires a distributed exclusive lock, compares the priorities of the main machine room in the Zookeeper cluster and the machine room where the main selection service is located, and if the priority is not higher than the priority of the existing main machine room, the main machine room is not required to be competitive;
if the priority is higher than the priority of the existing main machine room, the main machine room information in the Zookeeper cluster is replaced by the main machine room information on the second server which is recovered to be normal, and the host machine room competition is successful.
Preferably, the main selection service provides an interface, and the interface is used for switching the main machine room and the standby machine room according to the requirement.
Example 3:
the embodiment provides a disaster recovery election method for a cloud platform machine room, which comprises the following steps:
Step 1: a second server in one machine room is selected from at least two disaster recovery machine rooms, a Zookeeper cluster crossing the machine room is built with the first server, and a container cloud platform based on a Kubernetes cluster is built between the second servers in any machine room; specific:
The integral machine room disaster recovery system is built as shown in figure 1, the disaster recovery system is initialized and started, and a Zookeeper cluster crossing the machine room is built between a second server and a first server in each machine room;
Starting cloud platforms in all machine rooms, and then dispatching and selecting main service containers among second servers according to an anti-affinity strategy;
and after the main selection service is started and operated, the Zookeeper cluster is connected, and the information of the second server node where the main selection service is located and the information of the machine room where the main selection service is located are registered.
Step 2: when starting, the host machine room is selected.
After the main selection service container on each second server is started, the distributed exclusive lock can be strived for immediately, and main selection service for acquiring the lock contends for the main selection machine room for the machine room where the second server is located according to main selection rules such as the starting sequence of the machine rooms, the priority of the machine rooms and the like.
Step3: and when a fault occurs, the main machine room is selected.
When a certain second server fails, the registration information of the second server in the Zookeeper cluster disappears; and the main selecting service on the other second servers receives the notification that the second servers are offline, judges whether the current main machine room is still available, and keeps the principle of switching as little as possible to contend for the main machine room.
Step 4: and when the fault is recovered, the host machine room is selected.
When a failure of a second server is recovered, a main selection service container is dispatched on the second server, and the main selection service can judge the priority of the machine room, and try to select the main machine room. The selected service on the other second server may be notified that the second server is online.
Step 5: and actively switching the main machine room and the standby machine room during operation.
The main service providing interface of the disaster recovery system is used for switching the main machine room and the standby machine room according to the requirement of the client, and the client can acquire the current main/standby machine room information in the operation process of the disaster recovery system and manually switch the main/standby machine rooms according to the requirement.
Step 1 comprises the following steps:
step 1.1: as shown in fig. 1, the disaster recovery system is initialized and started, and a Zookeeper cluster crossing the machine room is built among a certain second server of the machine room a, a certain second server of the machine room B and the first server.
Step 1.2: as shown in fig. 2, the cloud platform in each machine room is started, and a main selection service (selector) container is scheduled on each second server of the cloud platform due to counteraffinity, and the main selection service container is filled with main selection services. The Zookeeper is a distributed service framework, as shown in fig. 3, a data structure similar to a file system is maintained, the data structure includes a plurality of directory nodes, such as a persistent directory node selector, active, standby and a lock, after the client is disconnected from the Zookeeper, the persistent directory node still exists, and a plurality of temporary directory nodes are arranged under the persistent directory node.
Step 1.3: the select master service is started and then connects to the Zookeeper cluster, initializes selector, active, standby and lock persistent directory nodes, as shown in fig. 3 and 4. The selector directory node represents a select-to-host service namespace (namespace) that is used to restrict any node of the select-to-host service from operating under the specified directory space; the active directory node stores host room information, the standby directory node stores second server node information, and the lock directory node is used for generating a distributed exclusive lock in the host room election process. Taking a piece of information "standby/ipA-2-001" under the standby directory node in fig. 3 as an example, where "ipA" represents a machine room ip of the machine room a where the second server is located, and "1" represents a priority of the machine room where the second server is located; the last 3 bits "001" is an increment number, which indicates the sequence of registration of the primary service on the second server in the Zookeeper cluster, and when a second server is started, the scheduled primary service on the second server will be registered in the Zookeeper, and the sequence number is incremented.
Step 1.4: the selected main service keeps connection with the Zookeeper cluster, an ordered temporary directory node is created under the standby directory node, and the information of the second server node and the information of the machine room are registered.
Step 1.5: the election master service sets watch monitoring for standby directory nodes. When there is an addition or deletion of nodes under the standby directory node, the host-selecting service will receive a notification about the information of the addition or deletion node.
Step 2, when starting, the competitive choice host machine room comprises the following steps:
step 2.1: the selected main service on each second server node sequentially strives for distributed exclusive locks according to the starting sequence.
Step 2.2: and acquiring the main selection service of the lock, judging whether the information of the host computer room in the active directory node is empty, and if so, writing the information of the computer room where the main selection service is located, namely, successfully competing the host computer room.
Step 2.3: if the machine room information is not empty, the priority of the main machine room and the machine room where the selected main service is located in the active directory node is compared, and the machine room information with high priority is written into the active directory node.
Step 3, when a fault occurs, the main machine room is selected in an competitive way, and the method comprises the following steps:
Step 3.1: when a certain second server fails, the selected main service running on the second server is then suspended. Because the chooser service container schedule is anti-affinity, the cloud platform cannot schedule it to other second servers, so the chooser service container, i.e., the second server, will be disconnected from the Zookeeper cluster.
Step 3.2: temporary nodes created by the hosting service under the Zookeeper cluster standby directory node are deleted. The Zookeeper then notifies all other election services monitoring the standby directory node of the deleted node information.
Step 3.3: other hosting services will immediately strive for a distributed exclusive lock upon notification of node deletion.
Step 3.4: and acquiring the main selection service of the lock, judging whether the deletion node belongs to the main machine room shown in the active directory node, and if not, eliminating the need of the main machine room for competitive selection.
Step 3.5: if so, the host-selecting service can inquire whether server nodes still belong to the host machine room under the standby node. If the other nodes are still in the host room, the host room is not required to be selected.
Step 3.6: if the fact that no other second server nodes exist under the main machine room is judged, the main selection service replaces the main machine room information in the active directory node with the machine room information where the main machine room information is located, and then the main machine room is selected successfully.
According to the steps, when a plurality of second servers are unavailable due to faults in one machine room, the container services on the unavailable second servers are automatically dispatched to other available servers in the machine room through the container service automatic dispatching strategy of the cloud platform, so that the machine room can still normally work as a main machine room service container to restore normal operation through dispatching of the cloud platform, and the main machine room and the standby machine room do not need to be switched. The switch between the main machine room and the standby machine room can perform additional operations including starting a service cluster of the standby machine room, backing up data, setting route forwarding, and the like, and the additional operations can cause the short-time unavailability of the platform system. The service is containerized, and the switching of the main machine room and the standby machine room can be avoided to a greater extent, so that the stability of the whole platform system is improved.
And 4, during fault recovery, selecting a main machine room in an competitive manner, wherein the method comprises the following steps of:
Step 4.1: when a fault of a certain second server is recovered, the cloud platform dispatches a main selection service container on the server.
Step 4.2: the master selecting service is connected with the Zookeeper cluster, an ordered temporary directory node is re-created under the standby directory node, and the information of the second server node and the information of the machine room are registered.
Step 4.3: the main selecting service obtains the distributed exclusive lock, and judges and compares the priority of the main machine room in the active directory node and the machine room where the main selecting service is located. If the priority is not higher than the priority of the existing host machine room, the host machine room is not required to be selected.
Step 4.4: if the priority is higher than the priority of the existing host rooms, the host selection service replaces the host room information in the active directory node with the machine room information where the host room information is located, namely the host room is successfully selected.
Step 4.5: the other hosting services monitoring the standby directory node receive notification of the node addition.
Step 5, actively switching the main machine room and the standby machine room during operation, and comprises the following steps:
step 5.1: the client calls the master selection service to acquire the current master/slave computer room information interface.
Step 5.2: and the active directory node in the master service query Zookeeper cluster acquires the information of the host computer room, and the query standby directory node sorts out the information of the slave computer room and returns the information to the client.
Step 5.3: the client selects the information of the slave machine room and invokes the set master machine room interface of the selected master service.
Step 5.4: the host selection service replaces the host computer room information in the active directory node with the slave computer room information selected by the client, namely the active host computer room switching is successful.
In the initialization step, the method of the invention firstly starts the container in which machine room to finish platform removal, and then firstly registers in the Zookeeper cluster to select the main machine room for the machine room. After the main machine room is selected, the start and stop of the second server in the standby machine room are not limited. When the second server in the standby machine room is started, the built container cloud platform is started, the arranged main service selecting container is started, and after the starting is finished, the information of the current main machine room and the standby machine room can be obtained by accessing the gateway address and the main service selecting interface of the standby machine room.
Example 4:
As shown in fig. 5, the present invention further provides a cloud platform disaster recovery machine room election device, including:
The first connection module is used for respectively constructing a container cloud platform based on a Kubernetes cluster for each machine room participating in the election, and the container cloud platform is connected with all second servers in the machine room; the container cloud platform is used for dispatching a main service selecting container for the started second server when the second server is started, and main service selecting containers are filled with main service selecting containers;
the second connection module is used for respectively selecting a second server from each machine room and constructing a Zookeeper cluster crossing the machine room between the selected second server and a first server used for routing and forwarding outside the machine room;
The monitoring module is used for monitoring the current running state of all the second servers and the connection state of the second servers and the Zookeeper cluster;
the automatic competitive choice module is used for automatically competing and selecting a main machine room among the machine rooms participating in competitive choice according to the current running state of the second server, the connection state of the main selection service in the main selection container of the second server and the Zookeeper cluster, and the node information and the machine room information of the second server registered in the Zookeeper cluster by the main selection service;
And the competition result determining module is used for synchronously updating the competition result comprising the information of the main machine room and the standby machine room into the Zookeeper cluster.
The device also comprises a setting module used for switching the main machine room and the standby machine room according to the requirement.
A medium storing computer executable instructions which, when executed by a processing unit, are adapted to carry out the election method of any of the preceding claims.
An electronic device comprising a processing unit and a storage unit storing computer executable instructions that when executed by the processing unit are adapted to implement the election method of any of the preceding claims.
The foregoing is only a preferred embodiment of the invention, it being noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the invention.

Claims (12)

1. The cloud platform disaster recovery machine room system is characterized by comprising:
At least two machine rooms and at least one first server for route forwarding, wherein each machine room comprises at least one second server;
A container cloud platform based on a Kubernetes cluster is constructed on all the second servers in each machine room, and is used for dispatching a main service selection container for the started second servers when the second servers are started, and main service selection containers are filled with main service selection;
Any one second server selected in each machine room and the first server are provided with a Zookeeper cluster crossing the machine room;
The main selection service is used for registering node information and the machine room information of the started second server in the Zookeeper cluster, is used for competing the machine room to become a main machine room, and is used for monitoring the start and stop of other second servers; when any one of the second servers fails, the main selection service container on the failed second server is closed, the corresponding main selection service is disconnected with the Zookeeper cluster, node information of the failed second server is deleted in the Zookeeper cluster, the main selection service on other second servers is informed of the deleted node information, and the main selection service on other second servers re-competing for the main machine room after receiving the deleted node notification.
2. The cloud platform disaster recovery machine room system of claim 1, wherein: the Zookeeper cluster comprises a data structure directory for the access of the optional main service, and the data structure directory comprises:
a selector directory node for representing a namespace of a selected host service;
an active directory node for storing host room information;
The standby directory node comprises a plurality of temporary directory nodes, wherein the temporary directory nodes are used for storing second server node information registered by each selected main service and computer room information;
and the lock directory node is used for generating a distributed exclusive lock in the host room election process.
3. The cloud platform disaster recovery machine room system of claim 2, wherein: the main selecting service is used for monitoring the start and stop of other second servers, and comprises the following steps:
the master selecting service is used for monitoring the standby directory node, and when node adding and deleting occur under the standby directory node, the master selecting service receives notification about node adding and deleting information.
4. A cloud platform disaster recovery machine room election method is characterized by comprising the following steps of: comprising the following steps:
when the second server is started, the container cloud platform dispatches a main selection service container for the second server, main selection service in the main selection service container registers the node information of the second server and the information of a machine room in a Zookeeper cluster, and the main selection service is a machine room competitive choice main machine room;
when any one of the second servers fails, closing a main selection service container on the failed second server, disconnecting the corresponding main selection service from the Zookeeper cluster, deleting node information of the failed second server in the Zookeeper cluster, notifying main selection service on other second servers of the deleted node information, and re-competing the main selection machine room after the main selection service on other second servers receives the deleted node notification;
When the second server with the fault is recovered to be normal, the container cloud platform reschedules a main selection service container for the second server with the fault, wherein main selection services in the main selection service container register the node information of the server and the information of the machine room in the Zookeeper cluster, and the main selection services on the second server with the fault are recovered to be normal to re-compete the main machine room;
and updating the competitive results of the main machine room into the Zookeeper cluster.
5. The cloud platform disaster recovery machine room election method according to claim 4, wherein the method comprises the following steps: the main selecting service is a main machine room for selecting a machine room, and comprises the following steps:
the master selecting service of the second server sequentially strives for distributed exclusive locks in the Zookeeper cluster according to the starting sequence of the second server;
The master selecting service of the distributed exclusive lock is obtained to judge whether the information of the host rooms in the Zookeeper cluster is empty, if so, the information of the machine room where the master selecting service is positioned is written in the information of the host rooms of the Zookeeper cluster, and the host rooms are successfully selected; if the machine room information is not empty, comparing the priority of the main machine room in the Zookeeper cluster with the priority of the machine room where the selected main service is located, and writing the machine room information with higher priority into the machine room information of the Zookeeper cluster, wherein the machine room with higher priority is successfully competitive with the machine room.
6. The cloud platform disaster recovery machine room election method according to claim 4, wherein the method comprises the following steps: and the main machine room is reelected after the main selecting service on the other second servers receives the notification of deleting the node, and the method comprises the following steps:
The selected main service on the other second servers strives for distributed exclusive locks;
The master selecting service of the distributed exclusive lock is obtained to judge whether the deleted node belongs to a host room in the Zookeeper cluster, and if not, the master selecting service does not need to compete for the master room;
If so, the main selection service inquires whether second server nodes in the Zookeeper cluster still belong to the main machine room, and if so, the other second server nodes are in the main machine room, so that the main machine room does not need to be competitive; if no other second server nodes exist under the main machine room, the main selection service replaces the information of the main machine room in the Zookeeper cluster with the information of the machine room where the main machine room is located, and the host machine room is successfully selected.
7. The cloud platform disaster recovery machine room election method according to claim 4, wherein the method comprises the following steps: the main selection service on the second server which is recovered to be normal re-competing the main machine room comprises the following steps:
The main selection service on the second server which is recovered to be normal acquires a distributed exclusive lock, compares the priorities of the main machine room in the Zookeeper cluster and the machine room where the main selection service is located, and if the priority is not higher than the priority of the existing main machine room, the main machine room is not required to be competitive;
if the priority is higher than the priority of the existing main machine room, the main machine room information in the Zookeeper cluster is replaced by the main machine room information on the second server which is recovered to be normal, and the host machine room competition is successful.
8. The cloud platform disaster recovery machine room election method according to claim 4, wherein the method comprises the following steps: the main selection service provides an interface which is used for switching the main machine room and the standby machine room according to the requirement.
9. The utility model provides a cloud platform disaster recovery computer lab election device which characterized in that, the device includes:
The first connection module is used for respectively constructing a container cloud platform based on a Kubernetes cluster for each machine room participating in the election, and the container cloud platform is connected with all second servers in the machine room; the container cloud platform is used for dispatching a main service selecting container for the started second server when the second server is started, and main service selecting containers are filled with main service selecting containers;
the second connection module is used for respectively selecting a second server from each machine room and constructing a Zookeeper cluster crossing the machine room between the selected second server and a first server used for routing and forwarding outside the machine room;
The monitoring module is used for monitoring the current running state of all the second servers and the connection state of the second servers and the Zookeeper cluster;
the automatic competitive choice module is used for automatically competing and selecting a main machine room among the machine rooms participating in competitive choice according to the current running state of the second server, the connection state of the main selection service in the main selection container of the second server and the Zookeeper cluster, and the node information and the machine room information of the second server registered in the Zookeeper cluster by the main selection service;
And the competition result determining module is used for synchronously updating the competition result comprising the information of the main machine room and the standby machine room into the Zookeeper cluster.
10. The cloud platform disaster recovery machine room election device according to claim 9, further comprising a setting module for switching between the main machine room and the standby machine room according to need.
11. A computer-readable storage medium storing computer-executable instructions, characterized in that: which instructions, when executed by a processing unit, are adapted to carry out the election method of any one of claims 4 to 8.
12. An electronic device, characterized in that: comprising a processing unit and a storage unit storing computer executable instructions for implementing the election method of any of claims 4 to 8 when executed by the processing unit.
CN202111590393.2A 2021-12-23 2021-12-23 Cloud platform disaster recovery machine room election system, method, device, medium and electronic equipment Active CN114257595B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111590393.2A CN114257595B (en) 2021-12-23 2021-12-23 Cloud platform disaster recovery machine room election system, method, device, medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111590393.2A CN114257595B (en) 2021-12-23 2021-12-23 Cloud platform disaster recovery machine room election system, method, device, medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN114257595A CN114257595A (en) 2022-03-29
CN114257595B true CN114257595B (en) 2024-05-17

Family

ID=80797175

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111590393.2A Active CN114257595B (en) 2021-12-23 2021-12-23 Cloud platform disaster recovery machine room election system, method, device, medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN114257595B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115580645A (en) * 2022-11-10 2023-01-06 北京青云科技股份有限公司 Service switching method and device, electronic equipment and storage medium
CN116775382B (en) * 2023-08-21 2023-10-27 江苏拓浦高科技有限公司 Main and standby server switching method and system based on ZooKeeper distributed coordination service
CN116980346B (en) * 2023-09-22 2023-11-28 新华三技术有限公司 Container management method and device based on cloud platform

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102932210A (en) * 2012-11-23 2013-02-13 北京搜狐新媒体信息技术有限公司 Method and system for monitoring node in PaaS cloud platform
CN111338858A (en) * 2020-02-18 2020-06-26 中国工商银行股份有限公司 Disaster recovery method and device for double machine rooms
CN112181724A (en) * 2020-09-23 2021-01-05 支付宝(杭州)信息技术有限公司 Big data disaster tolerance method and device and electronic equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10587463B2 (en) * 2017-12-20 2020-03-10 Hewlett Packard Enterprise Development Lp Distributed lifecycle management for cloud platforms

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102932210A (en) * 2012-11-23 2013-02-13 北京搜狐新媒体信息技术有限公司 Method and system for monitoring node in PaaS cloud platform
CN111338858A (en) * 2020-02-18 2020-06-26 中国工商银行股份有限公司 Disaster recovery method and device for double machine rooms
CN112181724A (en) * 2020-09-23 2021-01-05 支付宝(杭州)信息技术有限公司 Big data disaster tolerance method and device and electronic equipment

Also Published As

Publication number Publication date
CN114257595A (en) 2022-03-29

Similar Documents

Publication Publication Date Title
CN114257595B (en) Cloud platform disaster recovery machine room election system, method, device, medium and electronic equipment
US11360854B2 (en) Storage cluster configuration change method, storage cluster, and computer system
CN109729111B (en) Method, apparatus and computer program product for managing distributed systems
CN109639794A (en) A kind of stateful cluster recovery method, apparatus, equipment and readable storage medium storing program for executing
US20110178985A1 (en) Master monitoring mechanism for a geographical distributed database
US20030149735A1 (en) Network and method for coordinating high availability system services
WO2012071920A1 (en) Method, system, token conreoller and memory database for implementing distribute-type main memory database system
KR102192442B1 (en) Balanced leader distribution method and system in kubernetes cluster
CN103207867A (en) Method for processing data blocks, method for initiating recovery operation and nodes
CN112463366A (en) Cloud-native-oriented micro-service automatic expansion and contraction capacity and automatic fusing method and system
CN115102839B (en) Master-slave node election method, device, equipment and medium
JP4459999B2 (en) Non-stop service system using voting and information updating and providing method in the system
CN109560903B (en) Vehicle-mounted command communication system for complete disaster recovery
CN116561096A (en) Database management method and system based on container platform
CN111966466A (en) Container management method, device and medium
Lynch et al. Atomic Data access in Content Addressable Networks, A Position Paper
CN112087506B (en) Cluster node management method and device and computer storage medium
CN116781711A (en) Node deployment method and device and electronic equipment
CN104052799A (en) Method for achieving high availability storage through resource rings
CN114598593B (en) Message processing method, system, computing device and computer storage medium
CN209134427U (en) A kind of vehicle-mounted command communications system of complete disaster tolerance
CN110351122A (en) Disaster recovery method, device, system and electronic equipment
CN116346582A (en) Method, device, equipment and storage medium for realizing redundancy of main network and standby network
CN111400110B (en) Database access management system
CN114448995A (en) Distributed computing method based on raft selection main strategy

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant