CN114257595A - Cloud platform disaster tolerance machine room election system, method, device, medium and electronic equipment - Google Patents

Cloud platform disaster tolerance machine room election system, method, device, medium and electronic equipment Download PDF

Info

Publication number
CN114257595A
CN114257595A CN202111590393.2A CN202111590393A CN114257595A CN 114257595 A CN114257595 A CN 114257595A CN 202111590393 A CN202111590393 A CN 202111590393A CN 114257595 A CN114257595 A CN 114257595A
Authority
CN
China
Prior art keywords
server
main
service
room
computer room
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111590393.2A
Other languages
Chinese (zh)
Other versions
CN114257595B (en
Inventor
石鸿伟
张阔意
史精文
黄韬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Network Communication and Security Zijinshan Laboratory
Original Assignee
Network Communication and Security Zijinshan Laboratory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Network Communication and Security Zijinshan Laboratory filed Critical Network Communication and Security Zijinshan Laboratory
Priority to CN202111590393.2A priority Critical patent/CN114257595B/en
Publication of CN114257595A publication Critical patent/CN114257595A/en
Application granted granted Critical
Publication of CN114257595B publication Critical patent/CN114257595B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/2033Failover techniques switching over of hardware resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/14Session management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45562Creating, deleting, cloning virtual machine instances
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45575Starting, stopping, suspending or resuming virtual machine instances
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45595Network integration; Enabling network access in virtual machine instances

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Hardware Redundancy (AREA)

Abstract

The invention discloses a cloud platform disaster-tolerant computer room election system, a method, a device, a medium and electronic equipment, wherein the system comprises at least two computer rooms and at least one first server used for route forwarding, each computer room respectively selects one second server to form a Zookeeper cluster crossing the computer rooms together with the first server, a container cloud platform based on a Kubernetes cluster is built among the servers in a single computer room, the cloud platform schedules a selected main service container for a started server node in the computer room, the Zookeeper cluster registers server node information and contends for a selected host computer room and the like in the Zookeeper cluster through the selected main service, and when the second server in the computer room breaks down and the like, the Zookeeper cluster notifies other servers to contend for the selected host computer room again. The election scheme is not limited by the number of servers, can realize automatic election of the main computer room, and can be suitable for disaster recovery election scenes of the computer rooms under the cloud platform technology.

Description

Cloud platform disaster tolerance machine room election system, method, device, medium and electronic equipment
Technical Field
The invention relates to the technical field of computers, in particular to a cloud platform disaster recovery computer room election system, a cloud platform disaster recovery computer room election method, a cloud platform disaster recovery computer room election device, a cloud platform disaster recovery computer room election medium and electronic equipment.
Background
The server room is a place for storing the servers and is a room designed for the continuous operation of the computer servers. In order to improve the reliability of the application service and avoid the occurrence of unexpected failures, a disaster recovery backup is usually performed on the server in the computer room or a disaster recovery backup center, i.e., a standby computer room, is newly built. The original machine room is used as a normally used main machine room, and the standby machine room is started when the main machine room fails. In order to maintain high reliability, the disaster recovery backup center and the main computer room use different power supplies, such as a remote disaster recovery backup center.
According to the traditional computer room disaster recovery scheme, a distributed message coordination management program is deployed on each server in a computer room and configured into a cluster mode, a node election-based mechanism is applied to the cluster, more than half of nodes need to be agreed to work when the nodes are elected, and based on the election principle, the cluster is required to have an odd number of servers, namely, after the cluster with 2N +1 servers needs to be agreed with N +1 servers, election can be effective. Specifically, when a dual-machine room deployment scene is implemented, even numbers of servers are deployed in one machine room, and odd numbers of servers are deployed in the other machine room. The traditional computer room disaster recovery scheme has the following problems:
(1) if the machine rooms in which the odd number of servers are located have faults such as network and power failure, the machine rooms in which the even number of servers are located still cannot work.
(2) The traditional scheme cannot cover the current popular cloud platform disaster recovery scene.
Disclosure of Invention
The technical purpose is as follows: in order to solve the technical problems, the invention discloses a cloud platform disaster recovery computer room election system, a cloud platform disaster recovery computer room election method, a cloud platform disaster recovery computer room election device, a cloud platform disaster recovery computer room election medium and electronic equipment, which can realize automatic computer room election of a main computer room and are not limited by the number of servers.
The technical scheme is as follows: in order to achieve the technical purpose, the invention adopts the following technical scheme: a cloud platform disaster recovery computer room system comprises:
the system comprises at least two machine rooms and at least one first server for routing forwarding, wherein each machine room comprises at least one second server;
a container cloud platform based on a Kubernetes cluster is constructed on all second servers in each machine room, the container cloud platform is used for scheduling and selecting a main service container for the started second servers when the second servers are started, and the main service container is loaded with main service selection;
a Zookeeper cluster crossing the machine rooms is constructed on any one second server selected from each machine room and the first server;
the master selecting service is used for registering the node information of the started second server and the information of the machine room where the second server is located in the Zookeeper cluster, and is used for competing for the machine room where the second server is located to become a master room and monitoring the starting and stopping of other second servers.
Further, the Zookeeper cluster includes a data structure directory for the optional master service to access, where the data structure directory includes:
a selector directory node for representing a namespace of the election host service;
the active directory node is used for storing the information of the host computer room;
the standby directory node comprises a plurality of temporary directory nodes, and the temporary directory nodes are used for storing second server node information registered by each optional main service and information of the machine room where the optional main service is located;
and the lock directory node is used for generating a distributed exclusive lock in the process of the main computer room election.
Further, the selecting the master service is configured to monitor the start and stop of other second servers, including:
and the main selection service is used for monitoring the standby directory nodes, and when the addition and deletion of the nodes occur under the standby directory nodes, the main selection service receives the notification about the information of the addition and deletion nodes.
A cloud platform disaster recovery computer room election method comprises the following steps:
after the second server is started, the container cloud platform selects a main service container for scheduling, a main service in the main service container is selected to register node information of the second server and information of the machine room in the Zookeeper cluster, and the main service is selected to race for the machine room in which the main service is located;
when any one second server fails, closing a master selection service container on the failed second server, disconnecting the corresponding master selection service from the Zookeeper cluster, deleting the node information of the failed server in the Zookeeper cluster, notifying the deleted node information to the master selection services on other second servers, and re-competing for a master computer room after the master selection services on other second servers receive the notification of the deleted node;
when the second server which fails recovers to be normal, the container cloud platform reschedules the second server which recovers to be normal to select a main service container, the main service in the main service container is selected to register the information of the server node and the information of the machine room in the Zookeeper cluster, and the main service on the second server which recovers to be normal is selected to race for the main machine room again;
and updating the election result of the main computer room into the Zookeeper cluster.
Further, the election service elects the main computer room for the computer room in which the election service is located, including the steps of:
the master selecting service of the second server sequentially strives for distributed exclusive locks in the Zookeeper cluster according to the starting sequence of the second server;
the method comprises the steps that a main selecting service which acquires a distributed exclusive lock judges whether information of a main computer room in a Zookeeper cluster is empty, if so, information of a machine room where the main selecting service is located is written into the information of the main computer room of the Zookeeper cluster, and the main computer room is selected successfully; if not, comparing the priorities of the host computer room in the Zookeeper cluster and the computer room where the selected host service is located, writing the computer room information with higher priority into the host computer room information of the Zookeeper cluster, and enabling the computer room with higher priority to race for the host computer room successfully.
Further, the election service on the other second servers re-elects the main computer room after receiving the notification of the deleted node, including the steps of:
the election master services on other second servers strive for a distributed exclusive lock;
the method comprises the steps that a selecting service which acquires a distributed exclusive lock judges whether a deleted node belongs to a mainframe in a Zookeeper cluster, and if not, the election of the mainframe is not needed;
if the first server node belongs to the main computer room, the main selection service inquires whether a second server node belonging to the main computer room still exists in the Zookeeper cluster, and if other second server nodes exist in the main computer room, the main computer room does not need to be selected; and if no other second server nodes exist under the main computer room, the main computer room information in the Zookeeper cluster is replaced by the machine room information of the main computer room by the main server, and the main computer room is selected successfully.
Further, the returning of the normal election service on the second server to re-elect the host computer room includes the steps:
the selected host service on the second server which is recovered to be normal obtains a distributed exclusive lock, the priorities of the host computer room in the Zookeeper cluster and the machine room where the selected host service is located are compared, and if the priorities are not higher than the priorities of the existing host computer rooms, the host computer rooms do not need to be selected in an election mode;
if the priority is higher than that of the existing mainframe room, the owner selecting service on the second server which is recovered to be normal replaces the mainframe room information in the Zookeeper cluster with the information of the mainframe room where the owner selects, and the mainframe room is selected successfully.
Further, the optional main service provides an interface, and the interface is used for switching the main machine room and the standby machine room as required.
A cloud platform disaster recovery computer lab election device includes:
the first connection module is used for respectively building a container cloud platform based on a Kubernetes cluster for each machine room participating in the election, and the container cloud platform is connected with all second servers in the machine room; the container cloud platform is used for scheduling a selected main service container for a started second server when the second server is started, and the selected main service container is loaded with selected main services;
the second connection module is used for selecting a second server in each machine room respectively, and establishing a Zookeeper cluster crossing the machine rooms between the selected second server and a first server which is outside the machine rooms and used for route forwarding;
the monitoring module is used for monitoring the current running states of all the second servers and the connection states of the second servers and the Zookeeper cluster;
the automatic election module is used for automatically election a mainframe room among the computer rooms participating in election according to the current running state of the second server, the connection state of the selected main service and the Zookeeper cluster in the main container selected on the second server, the node information of the second server registered in the Zookeeper cluster by the selected main service and the information of the computer room where the selected main service is located;
and the election result determining module is used for synchronously updating the election result comprising the information of the main machine room and the standby machine room into the Zookeeper cluster.
Further, the device also comprises a setting module used for switching the main machine room and the standby machine room according to the requirement.
A medium having stored thereon computer-executable instructions for implementing any of the aforementioned election methods when executed by a processing unit.
An electronic device comprising a processing unit and a storage unit, storing computer-executable instructions, which when executed by the processing unit, are adapted to implement any of the aforementioned election methods.
Has the advantages that: due to the adoption of the technical scheme, the invention has the following technical effects:
1. according to the disaster recovery scheme provided by the invention, a container cloud platform is constructed in each machine room, Kubernetes container engines can be deeply integrated, containerized deployment of application services is realized, the containers are mutually isolated and do not influence each other, a Zookeeper cluster crossing the machine rooms is constructed between one server and a route load server in each machine room, the advantages of a Zookeeper file system management and monitoring notification mechanism and the like are fully utilized, the information of directory nodes in the Zookeeper cluster is updated, all selected main services can receive notifications, and the management and coordination capability of the system is simplified.
2. The disaster recovery scheme provided by the invention starts from the current popular cloud platform scene, can get rid of the constraint of the number of faults of the bottom-layer server in the traditional computer room disaster recovery scheme, and as long as the service container of the main computer room can be recovered to normally operate through scheduling, the main computer room does not need to be frequently selected, thereby improving the stability of the disaster recovery computer room.
3. The disaster recovery scheme provided by the invention can realize automatic computer room election, supports the setting of computer room priority, automatically switches to the main computer room after the computer room with high priority is started, and automatically dispatches the container service on the unavailable server to other available servers in the computer room through the container service automatic dispatching strategy of the cloud platform when a plurality of servers in the computer room are in failure and unavailable, so that the computer room can still normally work.
4. The disaster recovery scheme provided by the invention also supports the manual switching of the main and standby roles of the machine room, and is more flexible and controllable.
Drawings
Fig. 1 is a schematic structural diagram of a cloud platform-based computer room disaster recovery system in an embodiment of the present invention;
FIG. 2 is a node diagram of a election master service container in an embodiment of the present invention;
FIG. 3 is a table of directory node information within a Zookeeper cluster in an embodiment of the present invention;
fig. 4 is a timing diagram of disaster recovery operation of the machine room in the embodiment of the present invention;
fig. 5 is a cloud platform disaster recovery computer room election device in the embodiment of the present invention.
Detailed Description
The present invention will be described in detail with reference to the accompanying drawings.
Aiming at the situation of disaster recovery election of the machine room in the current popular cloud platform scene, when a server in the main machine room fails, the invention tries to restore the work of the current main machine room through automatic scheduling of the container, and when the main machine room can not restore the work, the standby machine room is switched. The method comprises the steps that a remote disaster recovery system is formed by at least two machine rooms and a first server, the server in each machine room is marked as a second server, the first server is located outside the machine room, and the first server is mainly used for route forwarding (routing), namely, routing rules are set, so that the flow of a client (client) flows to a main machine room;
as shown in fig. 1, a machine room a cluster composed of a plurality of second servers is deployed in the machine room a, a machine room B cluster composed of a plurality of second servers is deployed in the machine room B, one second server and a first server are selected from each machine room to form a Zookeeper (distributed message coordination management program) cluster across the machine room, one second server selected from the machine room a can be any one second server in the machine room a cluster, one server selected from the machine room B can be any one second server in the machine room B cluster and respectively serve as a zk-1 node and a zk-2 node of the Zookeeper cluster, and the first server serves as a zk-0 node of the Zookeeper cluster. The Zookeeper is a high-performance distributed consistent system, based on the advantages of a Zookeeper file system management and monitoring notification mechanism, the Zookeeper can ensure strong consistency of data, and data of a user accessing any directory node in a Zookeeper cluster at any time are the same; different services monitor the same directory node, and once the contents of the directory node are updated, all the services can receive the notification, so that the management and coordination capacity of the distributed application is simplified.
A container cloud platform based on a Kubernet cluster is built between second servers in a single machine room, specific business programs can provide services for external clients or web ends in a container mode on the cloud platform, the cloud platform can deeply integrate Kubernet container engines, containerization deployment of application services is achieved, and containers are isolated from one another and do not affect one another. Namely, high availability, high expandability, high portability, etc. of the service can be ensured. The high availability is mainly embodied in that after the service is containerized, if the service is crashed due to faults such as memory leakage and the like, other services cannot be influenced, and the service can be quickly rebuilt and recovered under the scheduling of a cloud platform; the high expandability is mainly embodied in that the quantity of the service containers can be rapidly expanded and reduced according to actual needs, and the change of the workload can be responded in real time; the high portability is mainly embodied in that after the container mirror image is created, the service container can be easily moved to different environments and rapidly deployed, and the high consistency of cross-environment is kept.
The client accesses the remote disaster recovery system through vip (virtual IP), and the routing load service is responsible for guiding the flow to the main computer room according to the elected main computer room IP. The container scheduling of the selected main service (selector) in the cloud platform is arranged to be the inverse affinity, and the inverse affinity schedules the selected main service container to a different second server. When a certain second server in the machine room is started, the cloud platform dispatches a selected main service container on the second server node according to the inverse affinity, the selected main service container is an example of the selected main service operation, and the selected main service container is loaded with the selected main service.
And a master-selecting service in a master-selecting container scheduled by the cloud platform on a second server in the machine room accesses the zk cluster, and any zk node can be accessed by configuring ip of zk three nodes in the master-selecting service. The main selection service is connected with the Zookeeper cluster, machine room information and the node information of the second server where the main selection service is located are registered in the Zookeeper cluster, the main selection service tries to contend for selecting the main room, the connection state of the main selection service and zk is kept, and the starting and stopping of other second servers are monitored; when a certain second server fails, the owner selecting container on the second server also disappears, the second server is disconnected with the Zookeeper cluster, the registration information of the second server in the Zookeeper cluster is deleted, the deleted registration information is notified to the owner selecting service on other second servers, and other owner selecting services contend for the selected host room on the principle that switching is not performed as far as possible.
Example 1:
a cloud platform disaster recovery computer lab system, its characterized in that includes:
the system comprises at least two machine rooms and at least one first server for routing forwarding, wherein each machine room comprises at least one second server;
constructing a container cloud platform based on a Kubernetes cluster on all second servers in each machine room, wherein the container cloud platform is used for scheduling and selecting a main service container for the started second servers when the second servers are started, and the main service container is loaded with main service selection;
constructing a Zookeeper cluster crossing the machine rooms on any second server and the first server selected from each machine room;
the master selecting service is used for registering the node information of the started second server and the information of the machine room where the second server is located in the Zookeeper cluster, and is used for selecting the machine room where the second server is located to be a master machine room and monitoring the starting and stopping of other second servers.
Preferably, the Zookeeper cluster includes a data structure directory for the selected master service to access, and the data structure directory includes:
a selector directory node for representing a namespace of the election host service;
the active directory node is used for storing the information of the host computer room;
the standby directory node comprises a plurality of temporary directory nodes, and the temporary directory nodes are used for storing second server node information registered by each optional main service and information of the machine room where the optional main service is located;
and the lock directory node is used for generating a distributed exclusive lock in the process of the main computer room election.
Preferably, the method for monitoring the start and stop of the second server includes:
and the main selection service is used for monitoring the standby directory nodes, and when the addition and deletion of the nodes occur under the standby directory nodes, the main selection service receives the notification about the information of the addition and deletion nodes.
Example 2:
a cloud platform disaster recovery computer room election method comprises the following steps:
after the second server is started, the container cloud platform selects a main service container for scheduling, a main service in the main service container is selected to register node information of the second server and information of the machine room in the Zookeeper cluster, and the main service is selected to race for the machine room in which the main service is located;
when any one second server fails, closing a master selection service container on the failed second server, disconnecting the corresponding master selection service from the Zookeeper cluster, deleting the node information of the failed server in the Zookeeper cluster, notifying the deleted node information to the master selection services on other second servers, and re-competing for a master computer room after the master selection services on other second servers receive the notification of the deleted node;
when the second server which fails recovers to be normal, the container cloud platform reschedules the second server which recovers to be normal to select a main service container, the main service in the main service container is selected to register the information of the server node and the information of the machine room in the Zookeeper cluster, and the main service on the second server which recovers to be normal is selected to race for the main machine room again;
and updating the election result of the main computer room into the Zookeeper cluster.
Preferably, the election service elects a main computer room for the computer room in which the election service is located, including the steps of:
the master selecting service of the second server sequentially strives for distributed exclusive locks in the Zookeeper cluster according to the starting sequence of the second server;
the method comprises the steps that a main selecting service which acquires a distributed exclusive lock judges whether information of a main computer room in a Zookeeper cluster is empty, if so, information of a machine room where the main selecting service is located is written into the information of the main computer room of the Zookeeper cluster, and the main computer room is selected successfully; if not, comparing the priorities of the host computer room in the Zookeeper cluster and the computer room where the selected host service is located, writing the computer room information with higher priority into the host computer room information of the Zookeeper cluster, and enabling the computer room with higher priority to race for the host computer room successfully.
Preferably, the election service on the other second server re-elects the main computer room after receiving the notification of the deleted node, including the steps of:
the election master services on other second servers strive for a distributed exclusive lock;
the method comprises the steps that a selecting service which acquires a distributed exclusive lock judges whether a deleted node belongs to a mainframe in a Zookeeper cluster, and if not, the election of the mainframe is not needed;
if the first server node belongs to the main computer room, the main selection service inquires whether a second server node belonging to the main computer room still exists in the Zookeeper cluster, and if other second server nodes exist in the main computer room, the main computer room does not need to be selected; and if no other second server nodes exist under the main computer room, the main computer room information in the Zookeeper cluster is replaced by the machine room information of the main computer room by the main server, and the main computer room is selected successfully.
Preferably, the returning to normal election service on the second server re-elects the host computer room, including the steps of:
the selected host service on the second server which is recovered to be normal obtains a distributed exclusive lock, the priorities of the host computer room in the Zookeeper cluster and the machine room where the selected host service is located are compared, and if the priorities are not higher than the priorities of the existing host computer rooms, the host computer rooms do not need to be selected in an election mode;
if the priority is higher than that of the existing mainframe room, the owner selecting service on the second server which is recovered to be normal replaces the mainframe room information in the Zookeeper cluster with the information of the mainframe room where the owner selects, and the mainframe room is selected successfully.
Preferably, the alternative main service provides an interface, and the interface is used for switching the main machine room and the standby machine room as required.
Example 3:
the embodiment provides a cloud platform machine room disaster recovery election method, which comprises the following steps:
step 1: selecting a second server in one machine room from at least two disaster recovery machine rooms, constructing a Zookeeper cluster crossing the machine rooms together with the first server, and constructing a container cloud platform based on a Kubernetes cluster between the second servers in any machine room; specifically, the method comprises the following steps:
as shown in fig. 1, the disaster recovery system of the whole machine room is initialized and started, and a Zookeeper cluster crossing machine rooms is established between a second server and a first server in each of the two machine rooms;
starting a cloud platform in each machine room, and then scheduling and selecting a main service container between second servers by using an anti-affinity strategy;
and after the selected main service is started to operate, connecting the Zookeeper cluster, and registering the information of the second server node where the selected main service is located and the information of the machine room where the selected main service is located.
Step 2: and when the computer is started, the main computer room is selected in an competitive way.
After the main service container on each second server is started, the distributed exclusive locks are contended immediately, the main service for acquiring the locks selects a main room according to the main rules of the starting sequence of the machine rooms, the priority of the machine rooms and the like, and the main room is contended for the machine room where the second server is located.
And step 3: and (5) competing and selecting the main computer room when the fault occurs.
When a certain second server fails, the second server registration information in the Zookeeper cluster disappears; and the main selecting service on other second servers receives the notification that the second server is offline, judges whether the current main computer room is still available or not and keeps the principle of not switching as much as possible to contend for selecting the main computer room.
And 4, step 4: and (4) competing for the main computer room when the fault is recovered.
When a second server is recovered from the fault, a host selection service container is dispatched on the second server, and the host selection service can judge the priority of the machine room where the second server is located and tries to contend for selecting the host machine room. The owner selecting service on other second servers receives the notice that the second server is on line.
And 5: and actively switching the main machine room and the standby machine room during operation.
The main service providing interface of the disaster recovery system is used for switching a main machine room and a standby machine room according to the requirement of a client, and the client can acquire the current main/standby machine room information in the operation process of the disaster recovery system and manually switch the main machine room and the standby machine room according to the requirement.
The step 1 comprises the following steps:
step 1.1: as shown in fig. 1, the disaster recovery system is initialized and started, and a Zookeeper cluster crossing the machine room is established between a certain second server in the machine room a, a certain second server in the machine room B and the first server.
Step 1.2: as shown in fig. 2, when the cloud platform in each machine room is started, a selected main service (selector) container is scheduled on each second server of the cloud platform due to the inverse affinity, and the selected main service container is loaded with the selected main service. Zookeeper is a distributed service framework, as shown in fig. 3, maintains a data structure similar to a file system, and includes a plurality of directory nodes, such as persistent directory node selector, active, standby and lock, under the data structure, after a client is disconnected from the Zookeeper, the persistent directory nodes still exist, and a plurality of temporary directory nodes are arranged under the persistent directory nodes.
Step 1.3: after the election main service is started, the Zookeeper cluster is connected, and the selector, the active, the standby and the lock persistent directory node are initialized, as shown in fig. 3 and 4. The selector directory node represents a naming space (namespace) of the election main service, and the naming space is used for limiting any node of the election main service to operate under the appointed directory space; the active directory node stores information of the mainframe room, the standby directory node stores information of each second server node, and the lock directory node is used for generating a distributed exclusive lock in the process of the mainframe room election. Taking a piece of information "standby/ipA-2-001" under the standby directory node in fig. 3 as an example, where "ipA" represents a room ip of a room a where the second server is located, and "1" represents a priority of the room a where the second server is located; the last 3 bits of "001" are increasing numbers, which represent the registration sequence of the owner selecting service on the second server in the Zookeeper cluster, and when a second server is started, the owner selecting service scheduled on the second server will register in the Zookeeper cluster, and the serial number is increased progressively.
Step 1.4: and the selected main service keeps connection with the Zookeeper cluster, creates an ordered temporary directory node under the standby directory node, and registers the information of the second server node and the information of the machine room.
Step 1.5: the opt-host service sets watch monitoring for standby directory nodes. When the added and deleted nodes exist under the standby directory node, the master service receives the notification about the information of the added and deleted nodes.
Step 2, when the computer is started, the main computer room is selected by competition, which comprises the following steps:
step 2.1: and the main service on each second server node sequentially strives for the distributed exclusive locks according to the starting sequence.
Step 2.2: and acquiring the host selecting service of the lock, judging whether the information of the host computer room in the active directory node is empty, and if the information of the host computer room in the active directory node is empty, writing the information of the host computer room in which the host selecting service is located, namely the host computer room election is successful.
Step 2.3: if the judgment result is not null, comparing the priorities of the host computer room and the computer room where the selective host service is located in the active directory node, and writing the computer room information with high priority into the active directory node.
Step 3, the main computer room is selected by the race when the fault occurs, which comprises the following steps:
step 3.1: when a second server fails, the selected primary service running on the second server is subsequently hung up. Because the elector service container scheduling is inverse affinity, the cloud platform cannot schedule it to other second servers, so the elector service container, i.e., the second server, can be disconnected from the Zookeeper cluster.
Step 3.2: the temporary node created by the election master service under the Zookeeper cluster standby directory node is deleted. The Zookeeper will then inform all other opt-in services monitoring the standby directory node of the delete node information.
Step 3.3: other election services, informed of the node deletion, immediately contend for the distributed exclusive lock.
Step 3.4: and acquiring the host selection service of the lock, and judging whether the deleted node belongs to the host computer room shown in the active directory node, wherein if not, the host computer room does not need to be selected by competition.
Step 3.5: if yes, the main selection service inquires whether server nodes in the main computer room still exist under the standby node. If other nodes are still in the mainframe room, the mainframe room does not need to be selected.
Step 3.6: and if the judgment result shows that no other second server nodes exist under the host computer room, the host computer room information in the active directory node is replaced by the machine room information of the host computer room by the host computer room selection service, namely the host computer room election is successful.
According to the steps, when a plurality of second servers in one machine room are unavailable due to faults, container services on the unavailable second servers are automatically dispatched to other available servers in the machine room through a container service automatic dispatching strategy of the cloud platform, so that the machine room can still normally work, the service containers of the main machine room can be normally recovered to normally run through the dispatching of the cloud platform, and the main machine room and the standby machine room do not need to be switched. The switching of the main machine room and the standby machine room can perform additional operations, including starting of a service cluster of the standby machine room, data backup, routing forwarding setting and the like, and the additional operations can cause the platform system to be temporarily unavailable. The service container is formed, and the main machine room and the standby machine room can not be switched to a greater extent, so that the stability of the whole platform system is improved.
Step 4, election of the main computer room during fault recovery comprises the following steps:
step 4.1: and when the failure of one second server is recovered, the cloud platform schedules a selected main service container on the server.
Step 4.2: the selected main service is connected with the Zookeeper cluster, an ordered temporary directory node is created under the standby directory node again, and the information of the second server node and the information of the machine room are registered.
Step 4.3: the master selecting service acquires a distributed exclusive lock and judges and compares the priorities of the master room in the active directory node and the machine room where the master selecting service is located. If the priority is not higher than the priority of the existing mainframe room, the mainframe room does not need to be selected by competition.
Step 4.4: if the priority is higher than the priority of the existing host computer room, the election host service replaces the host computer room information in the active directory node with the machine room information of the host computer room, namely the election host computer room succeeds.
Step 4.5: other owner selection services monitoring the standby directory nodes receive the notification of the node increase.
And 5, actively switching the main machine room and the standby machine room during operation, wherein the method comprises the following steps:
step 5.1: and the client calls an interface for acquiring the current master/slave computer room information of the master selection service.
Step 5.2: and the host selection service inquires active directory nodes in the Zookeeper cluster to obtain host computer room information, inquires standby directory nodes to arrange the slave computer room information, and returns the slave computer room information to the client.
Step 5.3: the client selects the information of the slave computer room and calls a setting master computer room interface of the master selection service.
Step 5.4: the master selecting service replaces the master room information in the active directory node with the slave room information selected by the client, namely, the master room is actively switched successfully.
In the initialization step, the method of the invention firstly registers in the Zookeeper cluster by which the container in the machine room is started to the platform, and then selects the main machine room for the machine room. After the main machine room is selected, the start and stop of the second server in the standby machine room are not limited. When a second server in the standby machine room is started, the built container cloud platform is started, the arranged main service selection container is started, and after the starting is finished, the information of the current main machine room and the standby machine room can be obtained by accessing the gateway address of the standby machine room and the main service selection interface.
Example 4:
as shown in fig. 5, the present invention further provides a cloud platform disaster recovery computer room election device, including:
the first connection module is used for respectively building a container cloud platform based on a Kubernetes cluster for each machine room participating in the election, and the container cloud platform is connected with all second servers in the machine room; the container cloud platform is used for scheduling a selected main service container for a started second server when the second server is started, and the selected main service container is loaded with selected main services;
the second connection module is used for selecting a second server in each machine room respectively, and establishing a Zookeeper cluster crossing the machine rooms between the selected second server and a first server which is outside the machine rooms and used for route forwarding;
the monitoring module is used for monitoring the current running states of all the second servers and the connection states of the second servers and the Zookeeper cluster;
the automatic election module is used for automatically election a mainframe room among the computer rooms participating in election according to the current running state of the second server, the connection state of the selected main service and the Zookeeper cluster in the main container selected on the second server, the node information of the second server registered in the Zookeeper cluster by the selected main service and the information of the computer room where the selected main service is located;
and the election result determining module is used for synchronously updating the election result comprising the information of the main machine room and the standby machine room into the Zookeeper cluster.
The device also comprises a setting module used for switching the main machine room and the standby machine room according to the requirement.
A medium having stored thereon computer-executable instructions for implementing any of the aforementioned election methods when executed by a processing unit.
An electronic device comprising a processing unit and a storage unit, storing computer-executable instructions, which when executed by the processing unit, are adapted to implement any of the aforementioned election methods.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims (12)

1. A cloud platform disaster recovery computer lab system, its characterized in that includes:
the system comprises at least two machine rooms and at least one first server for routing forwarding, wherein each machine room comprises at least one second server;
a container cloud platform based on a Kubernetes cluster is constructed on all second servers in each machine room, the container cloud platform is used for scheduling and selecting a main service container for the started second servers when the second servers are started, and the main service container is loaded with main service selection;
a Zookeeper cluster crossing the machine rooms is constructed on any one second server selected from each machine room and the first server;
the master selecting service is used for registering the node information of the started second server and the information of the machine room where the second server is located in the Zookeeper cluster, and is used for competing for the machine room where the second server is located to become a master room and monitoring the starting and stopping of other second servers.
2. The cloud platform disaster recovery computer room system according to claim 1, wherein: the Zookeeper cluster comprises a data structure directory for the selected master service to access, and the data structure directory comprises:
a selector directory node for representing a namespace of the election host service;
the active directory node is used for storing the information of the host computer room;
the standby directory node comprises a plurality of temporary directory nodes, and the temporary directory nodes are used for storing second server node information registered by each optional main service and information of the machine room where the optional main service is located;
and the lock directory node is used for generating a distributed exclusive lock in the process of the main computer room election.
3. The cloud platform disaster recovery computer room system according to claim 2, wherein: the master selection service is used for monitoring the start and stop of other second servers, and comprises the following steps:
and the main selection service is used for monitoring the standby directory nodes, and when the addition and deletion of the nodes occur under the standby directory nodes, the main selection service receives the notification about the information of the addition and deletion nodes.
4. A cloud platform disaster recovery computer room election method is characterized by comprising the following steps: the method comprises the following steps:
after the second server is started, the container cloud platform selects a main service container for scheduling, a main service in the main service container is selected to register node information of the second server and information of the machine room in the Zookeeper cluster, and the main service is selected to race for the machine room in which the main service is located;
when any one second server fails, closing a master selection service container on the failed second server, disconnecting the corresponding master selection service from the Zookeeper cluster, deleting the node information of the failed server in the Zookeeper cluster, notifying the deleted node information to the master selection services on other second servers, and re-competing for a master computer room after the master selection services on other second servers receive the notification of the deleted node;
when the second server which fails recovers to be normal, the container cloud platform reschedules the second server which recovers to be normal to select a main service container, the main service in the main service container is selected to register the information of the server node and the information of the machine room in the Zookeeper cluster, and the main service on the second server which recovers to be normal is selected to race for the main machine room again;
and updating the election result of the main computer room into the Zookeeper cluster.
5. The cloud platform disaster recovery computer room election method according to claim 4, characterized in that: the main selecting service is used for selecting the main computer room for the machine room in which the main selecting service is located, and comprises the following steps:
the master selecting service of the second server sequentially strives for distributed exclusive locks in the Zookeeper cluster according to the starting sequence of the second server;
the method comprises the steps that a main selecting service which acquires a distributed exclusive lock judges whether information of a main computer room in a Zookeeper cluster is empty, if so, information of a machine room where the main selecting service is located is written into the information of the main computer room of the Zookeeper cluster, and the main computer room is selected successfully; if not, comparing the priorities of the host computer room in the Zookeeper cluster and the computer room where the selected host service is located, writing the computer room information with higher priority into the host computer room information of the Zookeeper cluster, and enabling the computer room with higher priority to race for the host computer room successfully.
6. The cloud platform disaster recovery computer room election method according to claim 4, characterized in that: and the election service on the other second servers re-election the host computer after receiving the notification of the deleted node, comprising the following steps:
the election master services on other second servers strive for a distributed exclusive lock;
the method comprises the steps that a selecting service which acquires a distributed exclusive lock judges whether a deleted node belongs to a mainframe in a Zookeeper cluster, and if not, the election of the mainframe is not needed;
if the first server node belongs to the main computer room, the main selection service inquires whether a second server node belonging to the main computer room still exists in the Zookeeper cluster, and if other second server nodes exist in the main computer room, the main computer room does not need to be selected; and if no other second server nodes exist under the main computer room, the main computer room information in the Zookeeper cluster is replaced by the machine room information of the main computer room by the main server, and the main computer room is selected successfully.
7. The cloud platform disaster recovery computer room election method according to claim 4, characterized in that: and the electing host service on the second server which is recovered to be normal re-electing the host computer room comprises the following steps:
the selected host service on the second server which is recovered to be normal obtains a distributed exclusive lock, the priorities of the host computer room in the Zookeeper cluster and the machine room where the selected host service is located are compared, and if the priorities are not higher than the priorities of the existing host computer rooms, the host computer rooms do not need to be selected in an election mode;
if the priority is higher than that of the existing mainframe room, the owner selecting service on the second server which is recovered to be normal replaces the mainframe room information in the Zookeeper cluster with the information of the mainframe room where the owner selects, and the mainframe room is selected successfully.
8. The cloud platform disaster recovery computer room election method according to claim 4, characterized in that: the main service is selected to provide an interface, and the interface is used for switching the main machine room and the standby machine room according to needs.
9. The utility model provides a cloud platform disaster recovery computer lab election device which characterized in that, the device includes:
the first connection module is used for respectively building a container cloud platform based on a Kubernetes cluster for each machine room participating in the election, and the container cloud platform is connected with all second servers in the machine room; the container cloud platform is used for scheduling a selected main service container for a started second server when the second server is started, and the selected main service container is loaded with selected main services;
the second connection module is used for selecting a second server in each machine room respectively, and establishing a Zookeeper cluster crossing the machine rooms between the selected second server and a first server which is outside the machine rooms and used for route forwarding;
the monitoring module is used for monitoring the current running states of all the second servers and the connection states of the second servers and the Zookeeper cluster;
the automatic election module is used for automatically election a mainframe room among the computer rooms participating in election according to the current running state of the second server, the connection state of the selected main service and the Zookeeper cluster in the main container selected on the second server, the node information of the second server registered in the Zookeeper cluster by the selected main service and the information of the computer room where the selected main service is located;
and the election result determining module is used for synchronously updating the election result comprising the information of the main machine room and the standby machine room into the Zookeeper cluster.
10. The cloud platform disaster recovery computer room election device according to claim 8, further comprising a setting module for switching the main computer room and the standby computer room as required.
11. A medium storing computer-executable instructions, characterized in that: the instructions when executed by the processing unit are for implementing the election method of any one of claims 1 to 8.
12. An electronic device, characterized in that: comprising a processing unit and a storage unit, storing computer-executable instructions, which when executed by the processing unit, are adapted to carry out the election method according to any one of claims 1 to 8.
CN202111590393.2A 2021-12-23 2021-12-23 Cloud platform disaster recovery machine room election system, method, device, medium and electronic equipment Active CN114257595B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111590393.2A CN114257595B (en) 2021-12-23 2021-12-23 Cloud platform disaster recovery machine room election system, method, device, medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111590393.2A CN114257595B (en) 2021-12-23 2021-12-23 Cloud platform disaster recovery machine room election system, method, device, medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN114257595A true CN114257595A (en) 2022-03-29
CN114257595B CN114257595B (en) 2024-05-17

Family

ID=80797175

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111590393.2A Active CN114257595B (en) 2021-12-23 2021-12-23 Cloud platform disaster recovery machine room election system, method, device, medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN114257595B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116775382A (en) * 2023-08-21 2023-09-19 江苏拓浦高科技有限公司 Main and standby server switching method and system based on ZooKeeper distributed coordination service
CN116980346A (en) * 2023-09-22 2023-10-31 新华三技术有限公司 Container management method and device based on cloud platform

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102932210A (en) * 2012-11-23 2013-02-13 北京搜狐新媒体信息技术有限公司 Method and system for monitoring node in PaaS cloud platform
US20190190778A1 (en) * 2017-12-20 2019-06-20 Hewlett Packard Enterprise Development Lp Distributed lifecycle management for cloud platforms
CN111338858A (en) * 2020-02-18 2020-06-26 中国工商银行股份有限公司 Disaster recovery method and device for double machine rooms
CN112181724A (en) * 2020-09-23 2021-01-05 支付宝(杭州)信息技术有限公司 Big data disaster tolerance method and device and electronic equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102932210A (en) * 2012-11-23 2013-02-13 北京搜狐新媒体信息技术有限公司 Method and system for monitoring node in PaaS cloud platform
US20190190778A1 (en) * 2017-12-20 2019-06-20 Hewlett Packard Enterprise Development Lp Distributed lifecycle management for cloud platforms
CN111338858A (en) * 2020-02-18 2020-06-26 中国工商银行股份有限公司 Disaster recovery method and device for double machine rooms
CN112181724A (en) * 2020-09-23 2021-01-05 支付宝(杭州)信息技术有限公司 Big data disaster tolerance method and device and electronic equipment

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116775382A (en) * 2023-08-21 2023-09-19 江苏拓浦高科技有限公司 Main and standby server switching method and system based on ZooKeeper distributed coordination service
CN116775382B (en) * 2023-08-21 2023-10-27 江苏拓浦高科技有限公司 Main and standby server switching method and system based on ZooKeeper distributed coordination service
CN116980346A (en) * 2023-09-22 2023-10-31 新华三技术有限公司 Container management method and device based on cloud platform
CN116980346B (en) * 2023-09-22 2023-11-28 新华三技术有限公司 Container management method and device based on cloud platform

Also Published As

Publication number Publication date
CN114257595B (en) 2024-05-17

Similar Documents

Publication Publication Date Title
EP3694148B1 (en) Configuration modification method for storage cluster, storage cluster and computer system
US10979286B2 (en) Method, device and computer program product for managing distributed system
CN103744809B (en) Vehicle information management system double hot standby method based on VRRP
CN114257595B (en) Cloud platform disaster recovery machine room election system, method, device, medium and electronic equipment
EP2648114B1 (en) Method, system, token conreoller and memory database for implementing distribute-type main memory database system
CN109639794A (en) A kind of stateful cluster recovery method, apparatus, equipment and readable storage medium storing program for executing
JP2017534133A (en) Distributed storage and replication system and method
CN113132159B (en) Storage cluster node fault processing method, equipment and storage system
JPH08212095A (en) Client server control system
CN109560903B (en) Vehicle-mounted command communication system for complete disaster recovery
CN111966466A (en) Container management method, device and medium
CN116561096A (en) Database management method and system based on container platform
CN105959145B (en) A kind of method and system for the concurrent management server being applicable in high availability cluster
CN102045187A (en) Method and equipment for realizing HA (high-availability) system with checkpoints
CN114020279A (en) Application software distributed deployment method, system, terminal and storage medium
CN109495528A (en) Distributed lock ownership dispatching method and device
CN112887367B (en) Method, system and computer readable medium for realizing high availability of distributed cluster
CN109508261A (en) A kind of electric network data node standby method and standby system based on big data
CN114598593B (en) Message processing method, system, computing device and computer storage medium
CN114338670B (en) Edge cloud platform and network-connected traffic three-level cloud control platform with same
CN209134427U (en) A kind of vehicle-mounted command communications system of complete disaster tolerance
JP2000215177A (en) Client server system, server-client device and computer readable recording medium recording management program of server-client software
CN101453354A (en) High availability system based on ATCA architecture
CN111866041B (en) Service equipment selection method, cloud storage cluster updating method, device and storage medium
CN113254159B (en) Migration method and device of stateful service, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant