CN114584458A - Cluster disaster recovery management method, system, equipment and storage medium based on ETCD - Google Patents

Cluster disaster recovery management method, system, equipment and storage medium based on ETCD Download PDF

Info

Publication number
CN114584458A
CN114584458A CN202210209647.XA CN202210209647A CN114584458A CN 114584458 A CN114584458 A CN 114584458A CN 202210209647 A CN202210209647 A CN 202210209647A CN 114584458 A CN114584458 A CN 114584458A
Authority
CN
China
Prior art keywords
cluster
storage system
standby
etcd
switching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210209647.XA
Other languages
Chinese (zh)
Other versions
CN114584458B (en
Inventor
雷特
白小龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202210209647.XA priority Critical patent/CN114584458B/en
Publication of CN114584458A publication Critical patent/CN114584458A/en
Application granted granted Critical
Publication of CN114584458B publication Critical patent/CN114584458B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0663Performing the actions predefined by failover planning, e.g. switching to standby network elements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L1/00Arrangements for detecting or preventing errors in the information received
    • H04L1/22Arrangements for detecting or preventing errors in the information received using redundant apparatus to increase reliability
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A10/00TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE at coastal zones; at river basins
    • Y02A10/40Controlling or monitoring, e.g. of flood or hurricane; Forecasting, e.g. risk assessment or mapping

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the field of cloud storage, and discloses a cluster disaster recovery management method, a cluster disaster recovery management system, cluster disaster recovery management equipment and a storage medium based on ETCD. The method comprises the following steps: establishing a standby storage system for a main storage system, and setting a key value mark for disaster recovery switching for a standby ETCD cluster in the standby storage system; monitoring the state of the main storage system, connecting a standby ETCD cluster in a standby storage system when the main storage system is monitored to be in an unavailable state, and acquiring a key value of the standby ETCD cluster; and switching the bottom layer cluster flow and the user flow of the main storage system to a standby storage system according to the key value of the standby ETCD cluster. The embodiment of the invention carries out disaster recovery switching based on the ETCD, and can realize system-level disaster recovery.

Description

Cluster disaster recovery management method, system, equipment and storage medium based on ETCD
Technical Field
The invention relates to the technical field of cloud storage, in particular to a cluster disaster recovery management method, a cluster disaster recovery management system, cluster disaster recovery management equipment and a storage medium based on ETCD.
Background
The disaster recovery system is characterized in that two or more sets of IT systems with the same function are established in different places far away from each other, health status monitoring and function switching can be performed among the IT systems, and when one system stops working due to accidents (such as fire, earthquake and the like), the whole application system can be switched to the other system, so that the system functions can continue to work normally.
At present, in a part of cloud architectures, a cloud storage resource management platform does not support a disaster recovery function, so that a system is interrupted for a long time when accidents such as fire, flood, earthquake and the like occur, and service continuity cannot be ensured.
Disclosure of Invention
The invention provides a cluster disaster recovery management method, a cluster disaster recovery management system, cluster disaster recovery management equipment and a storage medium based on ETCD (electronic toll collection), and aims to solve the technical problem that the existing cloud storage resource management platform does not support a disaster recovery function, so that the system is interrupted in service for a long time when an accident occurs.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a cluster disaster recovery management method based on ETCD comprises the following steps:
establishing a standby storage system for a main storage system, and setting a key value mark for disaster recovery switching for a standby ETCD cluster in the standby storage system;
monitoring the state of the main storage system, and when the main storage system is monitored to be in an unavailable state, connecting a standby ETCD cluster in the standby storage system and acquiring a key value of the standby ETCD cluster;
and switching the bottom layer cluster flow and the user flow of the main storage system to a standby storage system according to the key value of the standby ETCD cluster.
The technical scheme adopted by the embodiment of the invention also comprises the following steps: the switching of the bottom-layer cluster traffic and the user traffic of the main storage system to the standby storage system according to the key value of the standby ETCD cluster comprises:
judging whether the bottom layer cluster traffic can be switched to a standby storage system according to the key value of the standby ETCD cluster, if so,
and loading the configuration content of the standby ETCD cluster, reporting the heartbeat to the standby ETCD cluster, and switching the flow of the bottom-layer cluster to a standby storage system.
The technical scheme adopted by the embodiment of the invention also comprises the following steps: the switching the bottom-layer cluster traffic and the user traffic of the primary storage system to the standby storage system according to the key value of the standby ETCD cluster further comprises:
and after the switching of the bottom layer cluster flow is finished, automatically upgrading the standby storage system into a main storage system, and switching the user flow into the upgraded main storage system.
The technical scheme adopted by the embodiment of the invention also comprises the following steps: after switching the bottom-layer cluster traffic and the user traffic of the main storage system to the standby storage system according to the key value of the standby ETCD cluster, the method further comprises the following steps:
and monitoring the state of the main storage system, and automatically switching the bottom layer cluster flow and the user flow back to the main storage system when the main storage system is restored to the available state.
The technical scheme adopted by the embodiment of the invention also comprises the following steps: the main storage system and the standby storage system are CSSP systems respectively, and the standby CSSP system and the main CSSP system are independent.
The technical scheme adopted by the embodiment of the invention also comprises the following steps: the switching of the bottom-layer cluster traffic and the user traffic of the main storage system to the standby storage system according to the key value of the standby ETCD cluster specifically includes:
when the main CSSP system is in an unavailable state, executing disaster recovery switching logic, loading configuration contents of the standby ETCD, wherein the configuration contents comprise mysql and rabbitmq, switching the bottom layer cluster flow of the main CSSP system to the standby CSSP system, and managing bottom layer storage resources by the standby CSSP system; the mysql is a relational database management system, and the rabbitmq is a message queue service module;
and monitoring the main CSSP system by adopting a cluster management module, automatically upgrading the standby storage system to a main storage system when monitoring that a service port of the main CSSP system is disconnected and the flow of the bottom-layer cluster is switched, and switching the user flow into the upgraded main CSSP system.
The embodiment of the invention adopts another technical scheme that: a cluster disaster recovery management system based on ETCD comprises:
ETCD marking module: the system comprises a standby storage system, a key value mark, a disaster recovery switching device and a data processing device, wherein the standby storage system is used for establishing the standby storage system for a main storage system, and the key value mark is used for setting a standby ETCD cluster in the standby storage system;
a state monitoring module: the system comprises a main storage system, a standby ETCD cluster and a key value, wherein the main storage system is used for monitoring the state of the main storage system, and when the main storage system is monitored to be in an unavailable state, the standby ETCD cluster in the standby storage system is connected, and the key value of the standby ETCD cluster is acquired;
disaster recovery switching module: and the system is used for switching the bottom layer cluster traffic and the user traffic of the main storage system to the standby storage system according to the key value of the standby ETCD cluster.
The technical scheme adopted by the embodiment of the invention also comprises the following steps: the disaster recovery switching module switches the bottom layer cluster flow and the user flow of the main storage system to the standby storage system according to the key value of the standby ETCD cluster specifically:
judging whether the bottom layer cluster traffic can be switched to a standby storage system according to the key value of the standby ETCD cluster, if so,
loading the configuration content of the standby ETCD cluster, reporting the heartbeat to the standby ETCD cluster, and switching the bottom-layer cluster flow to a standby storage system;
and after the switching of the bottom layer cluster flow is finished, automatically upgrading the standby storage system into a main storage system, and switching the user flow into the upgraded main storage system.
The embodiment of the invention adopts another technical scheme that: an ETCD-based cluster disaster recovery management device, the device comprising a processor, a memory coupled to the processor, wherein,
the memory stores program instructions for implementing the cluster disaster recovery management method based on the ETCD;
the processor is configured to execute the program instructions stored by the memory to perform the ETCD-based cluster disaster recovery management operations.
The embodiment of the invention adopts another technical scheme that: a storage medium storing program instructions executable by a processor to perform the above-described etc d-based cluster disaster recovery management method.
The invention has the beneficial effects that: according to the cluster disaster recovery management method, system, device and storage medium based on the ETCD, disaster recovery switching is carried out based on the ETCD, when a main storage system cannot provide services due to irresistible factors, rapid switching between the main storage system and a standby storage system can be achieved, so that the stability of a cluster is guaranteed, service providing services for users are not influenced, and the high availability of cloud services is effectively improved. Meanwhile, the ETCD is a strongly consistent middleware, so that fault misjudgment and misswitching caused by data inconsistency can be avoided, and system-level disaster tolerance can be realized.
Drawings
Fig. 1 is a schematic flow chart of a cluster disaster recovery management method based on an etc in a first embodiment of the present invention;
fig. 2 is a schematic flowchart of an ETCD-based cluster disaster recovery management method according to a second embodiment of the present invention;
fig. 3 is a schematic flow chart of a cluster disaster recovery management method based on an etc d according to a third embodiment of the present invention;
fig. 4 is a schematic diagram of a CSSP system disaster recovery architecture according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an ETCD-based cluster disaster recovery management system according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a cluster disaster recovery management device based on an etc d according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of a storage medium structure according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first", "second" and "third" in the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first," "second," or "third" may explicitly or implicitly include at least one of the feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless explicitly specified otherwise. All directional indicators (such as up, down, left, right, front, and rear … …) in the embodiments of the present invention are only used to explain the relative positional relationship between the components, the movement, and the like in a specific posture (as shown in the drawings), and if the specific posture is changed, the directional indicator is changed accordingly. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
Fig. 1 is a schematic flow chart of a cluster disaster recovery management method based on an etc d according to a first embodiment of the present invention. The cluster disaster recovery management method based on the ETCD comprises the following steps:
s10: establishing a standby storage system for the main storage system, and setting a key value mark for disaster recovery switching for a standby ETCD cluster in the standby storage system;
in this step, in the embodiment of the present application, a backup storage system is provided for the primary storage system, and data synchronization is not required between the primary storage system and the backup storage system. Before disaster tolerance, a management function of the underlying kvm and storage cluster management is provided for a user through the main storage system, data traffic does not pass through the standby storage system, and the standby storage system does not provide services to the outside. The ETCD is a distributed, highly available, consistent key-value storage database for storing configuration data of the entire system for the configuration center.
S11: monitoring the state of a main storage system, connecting a standby ETCD cluster in a standby storage system when the main storage system is monitored to be in an unavailable state, and acquiring a key value of the standby ETCD cluster;
s12: and switching the flow of the main storage system to the standby storage system according to the key value of the standby ETCD cluster.
In this step, the traffic switching includes bottom-layer cluster traffic switching and user traffic switching. The bottom-layer cluster flow switching specifically comprises the following steps: judging whether the bottom layer cluster flow can be switched to a standby storage system or not according to the key value of the standby ETCD cluster, if so, loading the configuration content of the standby ETCD cluster, reporting the heartbeat to the standby ETCD cluster, switching the bottom layer cluster flow to the standby storage system, and managing bottom layer storage resources by the standby storage system; after the disaster recovery switching is completed, the standby storage system takes over the main storage system to provide the user with the management functions of lower layer KVM (abbreviation of Keyboard Video Mouse, KVM can access and control the computer by directly connecting with a Keyboard, Video or Mouse port) and storage cluster management. The backup storage system is completely independent from the main storage system, and only data synchronization of a Database (DB) is performed.
Taking the CSSP system as an example, the user traffic switching is mainly switching between res-mgr and the external service address, and since the CSSP system entry is res-mgr, the domain name is used to provide service to the outside. The user flow switching specifically comprises the following steps: before disaster recovery switching, keepalived is adopted to monitor the main CSSP system, and when keepalived monitors that a service port of the main CSSP system is down (disconnected) and the flow of a bottom layer cluster is switched, a standby CSSP system is automatically upgraded to the main CSSP system, and the flow of a user is switched to the upgraded main CSSP system.
Based on the above, the cluster disaster recovery management method based on the etc d according to the first embodiment of the present invention performs disaster recovery switching based on the etc d, and when the main storage system cannot provide a service due to the irresistible factor, the main storage system and the standby storage system can be switched quickly. Because the ETCD is a strongly consistent middleware, fault misjudgment and misswitching caused by data inconsistency can not occur, and system-level disaster tolerance can be realized.
Please refer to fig. 2, which is a flowchart illustrating a cluster disaster recovery management method based on an etc d according to a second embodiment of the present application. The cluster disaster recovery management method based on the ETCD comprises the following steps:
s20: establishing a standby storage system for the main storage system, and setting a key value mark capable of carrying out disaster recovery switching for a standby ETCD cluster in the standby storage system;
in this step, in the embodiment of the present application, a backup storage system is provided for the primary storage system, and data synchronization is not required between the primary storage system and the backup storage system. Before disaster tolerance, a management function of the underlying kvm and storage cluster management is provided for a user through the main storage system, data traffic does not pass through the standby storage system, and the standby storage system does not provide services to the outside.
S21: monitoring the main storage system in real time, judging whether an ETCD cluster of the main storage system is in an available state or not, and if so, continuing to execute S21; otherwise, go to S22;
s22: connecting a standby ETCD cluster in the standby storage system, and acquiring a key value of the standby ETCD cluster after the connection is successful;
s23: judging whether the bottom layer cluster flow can be switched to a standby storage system or not according to the key value of the standby ETCD cluster, and executing S24 if the bottom layer cluster flow can be switched; otherwise, go to S27;
s24: loading configuration contents of the standby ETCD cluster, reporting heartbeat to the standby ETCD cluster, and switching the flow of the bottom-layer cluster to a standby storage system;
in this step, when disaster recovery switching is performed, the main storage system is unavailable, the bottom-layer cluster traffic is switched to the backup storage system, and the backup storage system manages the bottom-layer storage resources. After the disaster recovery switching is completed, the standby storage system takes over the main storage system to provide the user with the management functions of lower layer KVM (abbreviation of Keyboard Video Mouse, KVM can access and control the computer by directly connecting with a Keyboard, Video or Mouse port) and storage cluster management. The backup storage system is completely independent from the main storage system, and only data synchronization of a Database (DB) is performed.
S25: after the switching of the bottom-layer cluster flow is completed, the main-up operation is automatically carried out on the standby storage system, and the user flow is switched into the standby storage system;
s26: judging whether the main storage system is restored to an available state or not, and automatically switching the bottom-layer cluster flow and the user flow back to the main storage system when the main storage system is restored to the available state;
s27: and (6) ending.
Based on the above, the cluster disaster recovery management method based on the etc d according to the second embodiment of the present application performs disaster recovery switching based on the etc d, and when the main storage system cannot provide a service due to the irresistible factor, the main storage system and the standby storage system can be quickly switched, so that the stability of the cluster is ensured, the service provision service used by a user is not affected, and the high availability of the cloud service is effectively improved. Meanwhile, the ETCD is a strongly consistent middleware, so that fault misjudgment and misswitching caused by data inconsistency can be avoided, and system-level disaster tolerance is realized.
Further, in order to more clearly explain the implementation process of the embodiment of the present application, a cluster disaster recovery management of a CSSP (Cloud Storage Service Platform) system is taken as an example for specific description. Please refer to fig. 3, which is a flowchart illustrating a cluster disaster recovery management method based on an etc d according to a third embodiment of the present application. The cluster disaster recovery management method based on the ETCD in the third embodiment of the application comprises the following steps:
s30: establishing a standby CSSP system for the main CSSP system, and setting a key (CSSP _ can _ switch) value mark capable of carrying out disaster recovery switching for a standby ETCD cluster in the standby CSSP system;
in this step, as shown in fig. 4, a schematic diagram of a disaster recovery architecture of a CSSP system according to an embodiment of the present application is shown. The CSSP system comprises res-mgr, mysql, rabbitmq, etcd and other service components, wherein res-mgr is used for storing resource management service, mysql is a relational database management system of open source codes and is used for storing persistent data at a management side, rabbitmq is a set of open source message queue service module and is used for communicating with bottom layer service, and etcd is used for storing configuration data of the whole system for a configuration center. The PM physical Machine is a data gateway layer of the efs and is used for providing NAS (Network Attached Storage) services such as efs-server processes, vm (Virtual Machine) and nasagent in the vm, the efs-server is used for providing interfaces for managing Virtual Machine resources on the physical Machine and Network resources of the physical Machine for res-mgr calling, the nasagent is used for providing interfaces for managing resources such as disks, networks, computing and software resources (nfsd, samba, syslog) inside the vm, and the efs-server calls the interfaces of the nasagent through grpc to manage the resources inside the vm.
In the embodiment of the application, the standby CSSP (main/standby) res-mgr, rabbitmq, etcd and other service components are arranged for the main CSSP system in a cluster or high availability mode for deployment, the res-mgr service is stateless, and persistent data is needed when the rabbitmq and etcd do not run, so that data synchronization is not needed between the main CSSP system and the standby CSSP system. Before disaster recovery, the main CSSP system provides management functions of lower-layer kvm and storage cluster management for users through res-mgr, mysql, rabbitmq, etcd and the like, data traffic does not pass through the standby CSSP system, and the standby CSSP system does not provide services for the outside.
S31: monitoring the main CSSP system in real time, judging whether an ETCD cluster of the main CSSP system is in an available state or not, and if so, continuing to execute S31; otherwise, go to S32;
in this step, when services such as the efs-server and the nasagent detect that the main CSSP system is unavailable, the disaster recovery switching logic is executed to switch the bottom-layer cluster traffic of the main CSSP system to the standby CSSP system, and after the disaster recovery switching is completed, the standby CSSP system manages the bottom-layer storage resources.
S32: connecting a standby ETCD cluster in the standby CSSP system, and acquiring a key value of the standby ETCD cluster after the connection is successful;
s33: judging whether the flow of the bottom layer cluster can be switched to a standby CSSP system or not according to the key value of the standby ETCD cluster, and executing S34 if the flow of the bottom layer cluster can be switched; otherwise, go to S37;
s34: loading configuration contents of a standby ETCD cluster, reporting heartbeat to the standby ETCD cluster, and switching the flow of a bottom layer cluster to a standby CSSP system;
in this step, the loaded configuration content includes mysql, rabbitmq and other related configurations. As can be seen from the CSSP disaster recovery architecture diagram shown in fig. 4, in the CSSP system, the efs-server and rpcserver at the bottom layer provide the vm management function through rabbitmq service, and the heartbeat report and configuration modification function are provided by the heartbeat module through etcd service. Therefore, only the traffic switching of the underlying cluster and the traffic switching of the upper layer user of the services such as efs-server, rpcserver, heartbeat and the like in the underlying kvm cluster need to be considered in the disaster recovery switching. The nasgen and the efs-server in vm on the PM physical machine in the kvm cluster need to use services such as etcd and rabbitmq in the CSSP system, and because the etcd and the rabbitmq in the main CSSP system and the standby CSSP system are mutually independent, when the flow switching of the bottom-layer cluster is performed, the related configuration and access services of the nasgen and the efs-server in the kvm cluster need to be switched to the standby CSSP system. After the switching of the bottom-layer cluster flow is completed, the standby CSSP system takes over the management function of the main CSSP system for providing the management of the bottom-layer kvm and the storage cluster for the user. Wherein the backup CSSP system is completely independent from the main CSSP system and only performs data synchronization of the DB.
S35: after the switching of the bottom-layer cluster flow is finished, automatically upgrading the standby CSSP system into a main CSSP system by adopting a cluster management module, and switching the user flow into the upgraded CSSP system;
in the above, the user traffic switching is mainly switching between res-mgr and the external service address, and as the CSSP system entry is res-mgr, the domain name is used to provide the external service. In the embodiment of the application, the cluster management module performs user traffic switching on the CSSP system by using Keepalived. keepalive is service software which guarantees high availability of the cluster in cluster management, can detect cluster service nodes, provides the same service IP to the outside through a Virtual Redundant Routing Protocol (VRRP), and can have a plurality of service systems or nodes at the back end. Specifically, before disaster recovery switching, a keepalived is adopted to monitor the main CSSP system, and when the keepalived monitors that a service port of the main CSSP system is down and the bottom-layer cluster traffic is switched, the keepalived automatically upgrades the standby CSSP system to the main CSSP system and switches the user traffic to the upgraded CSSP system.
S36: judging whether the main CSSP system is recovered to an available state or not, and automatically switching the bottom layer cluster flow and the user flow to the main CSSP system when the main CSSP system is recovered to the available state;
s37: and (6) ending.
Based on the above, the cluster disaster recovery management method based on the etc d according to the third embodiment of the present application performs disaster recovery switching on the main CSSP system and the standby CSSP system based on the etc d, and when the main CSSP system cannot provide a service due to irresistible factors, the main CSSP system and the standby CSSP system can be quickly switched, so that stability of a cluster is ensured, a service providing service used by a user is not affected, and high availability of a cloud service is effectively improved. Meanwhile, the ETCD is a strongly consistent middleware, so that fault misjudgment and misswitching caused by data inconsistency can be avoided, and system-level disaster tolerance is realized.
In an alternative embodiment, it is also possible to: and uploading the result of the cluster disaster recovery management method based on the ETCD to a block chain.
Specifically, the corresponding digest information is obtained based on the result of the cluster disaster recovery management method based on the etc, specifically, the digest information is obtained by performing hash processing on the result of the cluster disaster recovery management method based on the etc, for example, by using the sha256s algorithm. Uploading summary information to the blockchain can ensure the safety and the fair transparency of the user. The user can download the summary information from the blockchain, so as to verify whether the result of the cluster disaster recovery management method based on the ETCD is tampered. The blockchain referred to in this example is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm, and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Fig. 5 is a schematic structural diagram of a cluster disaster recovery management system based on an etc d according to an embodiment of the present invention. The cluster disaster recovery management system 40 based on the ETCD in the embodiment of the invention comprises:
the etc flag module 41: the system comprises a main storage system, a standby storage system, a key value mark, a disaster recovery switching system and a cluster management system, wherein the main storage system is used for establishing the standby storage system, and the key value mark is used for setting a standby ETCD cluster in the standby storage system; in the embodiment of the application, a backup storage system is arranged for the main storage system, and data synchronization is not required between the main storage system and the backup storage system. Before disaster tolerance, a management function of the underlying kvm and storage cluster management is provided for a user through the main storage system, data traffic does not pass through the standby storage system, and the standby storage system does not provide services to the outside.
The condition monitoring module 42: the system comprises a main storage system, a standby ETCD cluster and a key value, wherein the main storage system is used for monitoring the state of the main storage system, and when the main storage system is monitored to be in an unavailable state, the standby ETCD cluster in the standby storage system is connected, and the key value of the standby ETCD cluster is acquired;
disaster recovery switching module 43: the main storage system is used for switching the flow of the main storage system to the standby storage system according to the key value of the standby ETCD cluster; the flow switching comprises bottom layer cluster flow switching and user flow switching. The bottom-layer cluster flow switching specifically comprises the following steps: judging whether the bottom-layer cluster flow can be switched to a standby storage system or not according to the key value, if so, loading the configuration content of the standby ETC cluster D, reporting the heartbeat to the standby ETC cluster, switching the bottom-layer cluster flow to the standby storage system, and managing bottom-layer storage resources by the standby storage system; after the disaster recovery switching is completed, the standby storage system takes over the main storage system to provide the user with the management functions of lower layer KVM (abbreviation of Keyboard Video Mouse, KVM can access and control the computer by directly connecting with a Keyboard, Video or Mouse port) and storage cluster management. The backup storage system is completely independent from the main storage system, and only data synchronization of a Database (DB) is performed.
Taking the CSSP system as an example, the user traffic switching is mainly switching between res-mgr and the external service address, and since the CSSP system entry is res-mgr, the domain name is used to provide service to the outside. The user flow switching specifically comprises the following steps: before disaster recovery switching, a keepalive is adopted to monitor the main CSSP system, and when the keepalive monitors that a service port of the main CSSP system is down and the flow of a bottom layer cluster is switched, the keepalive automatically carries out a main-up operation on a standby CSSP system, and the flow of a user is switched into the standby CSSP system.
Based on the above, the cluster disaster recovery management system based on the ETCD according to the embodiment of the invention performs disaster recovery switching based on the ETCD, and when the main storage system cannot provide services due to irresistible factors, the main storage system and the standby storage system can be quickly switched, so that the stability of a cluster is ensured, service providing services used by users are not influenced, and the high availability of cloud services is effectively improved. Meanwhile, the ETCD is a strongly consistent middleware, so that fault misjudgment and misswitching caused by data inconsistency can be avoided, and system-level disaster tolerance is realized.
Fig. 6 is a schematic structural diagram of a cluster disaster recovery management device based on an etc d according to an embodiment of the present invention. The device 50 comprises a processor 51, a memory 52 coupled to the processor 51.
The memory 52 stores program instructions for implementing the above-described cluster disaster recovery management method based on the etc.
The processor 51 is configured to execute program instructions stored by the memory 52 to perform ETCD-based cluster disaster recovery management operations.
The processor 51 may also be referred to as a CPU (Central Processing Unit). The processor 51 may be an integrated circuit chip having signal processing capabilities. The processor 51 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The terminal of the embodiment of the application carries out disaster recovery switching based on the ETCD, and when the main storage system cannot provide services due to irresistible factors, the main storage system and the standby storage system can be rapidly switched, so that the stability of a cluster is ensured, service providing services used by users are not influenced, and the high availability of cloud services is effectively improved. Meanwhile, the ETCD is a strongly consistent middleware, so that fault misjudgment and misswitching caused by data inconsistency can be avoided, and system-level disaster tolerance is realized.
Fig. 7 is a schematic structural diagram of a storage medium according to an embodiment of the invention. The storage medium of the embodiment of the present invention stores a program file 61 capable of implementing all the methods described above, wherein the program file 61 may be stored in the storage medium in the form of a software product, and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a mobile hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, or terminal devices, such as a computer, a server, a mobile phone, and a tablet.
The storage medium of the embodiment of the application is subjected to disaster recovery switching based on the ETCD, and when the main storage system cannot provide services due to irresistible factors, the main storage system and the standby storage system can be quickly switched, so that the stability of a cluster is ensured, service providing services used by users are not influenced, and the high availability of cloud services is effectively improved. Meanwhile, the ETCD is a strongly consistent middleware, so that fault misjudgment and misswitching caused by data inconsistency can be avoided, and system-level disaster tolerance is realized.
In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit. The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A cluster disaster recovery management method based on ETCD is characterized by comprising the following steps:
establishing a standby storage system for a main storage system, and setting a key value mark for disaster recovery switching for a standby ETCD cluster in the standby storage system;
monitoring the state of the main storage system, and when the main storage system is monitored to be in an unavailable state, connecting a standby ETCD cluster in the standby storage system and acquiring a key value of the standby ETCD cluster;
and switching the bottom layer cluster flow and the user flow of the main storage system to a standby storage system according to the key value of the standby ETCD cluster.
2. The ETCD-based cluster disaster recovery management method according to claim 1, wherein the switching of the underlying cluster traffic and the user traffic of the primary storage system to a backup storage system according to the key value of the backup ETCD cluster comprises:
judging whether the bottom layer cluster traffic can be switched to a standby storage system according to the key value of the standby ETCD cluster, if so,
and loading the configuration content of the standby ETCD cluster, reporting the heartbeat to the standby ETCD cluster, and switching the flow of the bottom layer cluster to a standby storage system.
3. The ETCD-based cluster disaster recovery management method according to claim 2, wherein switching the primary storage system's underlying cluster traffic and user traffic to a backup storage system according to the backup ETCD cluster's key value further comprises:
and after the switching of the bottom layer cluster flow is finished, automatically upgrading the standby storage system into a main storage system, and switching the user flow into the upgraded main storage system.
4. The ETCD-based cluster disaster recovery management method according to claim 3, wherein switching the underlying cluster traffic and the user traffic of the primary storage system to a backup storage system according to the key value of the backup ETCD cluster further comprises:
and monitoring the state of the main storage system, and automatically switching the bottom layer cluster flow and the user flow back to the main storage system when the main storage system is restored to the available state.
5. The ETCD-based cluster disaster recovery management method according to any one of claims 1 to 4, wherein the main storage system and the backup storage system are CSSP systems, and the backup CSSP system and the main CSSP system are independent.
6. The ETCD-based cluster disaster recovery management method according to claim 5, wherein the switching of the bottom-layer cluster traffic and the user traffic of the main storage system to the standby storage system according to the key value is specifically:
when the main CSSP system is in an unavailable state, executing disaster recovery switching logic, loading configuration contents of the standby ETCD, wherein the configuration contents comprise a mysql relational database management system and a rabbitmq message queue service module, switching the bottom-layer cluster flow of the main CSSP system to the standby CSSP system, and managing bottom-layer storage resources by the standby CSSP system;
and monitoring the main CSSP system by adopting a cluster management module, automatically upgrading the standby storage system to a main storage system when monitoring that a service port of the main CSSP system is disconnected and the flow of the bottom-layer cluster is switched, and switching the user flow into the upgraded main CSSP system.
7. The utility model provides a cluster disaster recovery management system based on ETCD which characterized in that includes:
ETCD marking module: the system comprises a main storage system, a standby storage system, a key value mark and a key value mark, wherein the standby storage system is used for establishing the standby storage system for the main storage system, and the key value mark is used for disaster recovery switching for a standby ETCD cluster in the standby storage system;
a state monitoring module: the system comprises a main storage system, a standby ETCD cluster and a key value, wherein the main storage system is used for monitoring the state of the main storage system, and when the main storage system is monitored to be in an unavailable state, the standby ETCD cluster in the standby storage system is connected, and the key value of the standby ETCD cluster is acquired;
disaster recovery switching module: and the system is used for switching the bottom layer cluster traffic and the user traffic of the main storage system to the standby storage system according to the key value of the standby ETCD cluster.
8. The ETCD-based cluster disaster recovery management system according to claim 7, wherein the disaster recovery switching module switches the bottom-layer cluster traffic and the user traffic of the main storage system to the backup storage system according to the key value of the backup ETCD cluster specifically:
judging whether the bottom layer cluster traffic can be switched to a standby storage system according to the key value of the standby ETCD cluster, if so,
loading the configuration content of the standby ETCD cluster, reporting the heartbeat to the standby ETCD cluster, and switching the bottom-layer cluster flow to a standby storage system;
and after the switching of the bottom layer cluster flow is completed, automatically upgrading the standby storage system into a main storage system, and switching the user flow into the upgraded main storage system.
9. An ETCD-based cluster disaster recovery management device, the device comprising a processor, a memory coupled to the processor, wherein,
the memory stores program instructions for implementing the ETCD-based cluster disaster recovery management method according to any one of claims 1 to 6;
the processor is configured to execute the program instructions stored by the memory to execute the ETCD-based cluster disaster recovery management method.
10. A storage medium storing program instructions executable by a processor to perform the etc d based cluster disaster recovery management method according to any one of claims 1 to 6.
CN202210209647.XA 2022-03-03 2022-03-03 Cluster disaster recovery management method, system, equipment and storage medium based on ETCD Active CN114584458B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210209647.XA CN114584458B (en) 2022-03-03 2022-03-03 Cluster disaster recovery management method, system, equipment and storage medium based on ETCD

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210209647.XA CN114584458B (en) 2022-03-03 2022-03-03 Cluster disaster recovery management method, system, equipment and storage medium based on ETCD

Publications (2)

Publication Number Publication Date
CN114584458A true CN114584458A (en) 2022-06-03
CN114584458B CN114584458B (en) 2023-06-06

Family

ID=81774231

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210209647.XA Active CN114584458B (en) 2022-03-03 2022-03-03 Cluster disaster recovery management method, system, equipment and storage medium based on ETCD

Country Status (1)

Country Link
CN (1) CN114584458B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115208743A (en) * 2022-07-18 2022-10-18 中国工商银行股份有限公司 ETCD-based cross-site cluster deployment method and device

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103176909A (en) * 2011-12-26 2013-06-26 中国银联股份有限公司 Service processing method and service processing system
US20190028538A1 (en) * 2016-03-25 2019-01-24 Alibaba Group Holding Limited Method, apparatus, and system for controlling service traffic between data centers
CN110046064A (en) * 2018-01-15 2019-07-23 厦门靠谱云股份有限公司 A kind of Cloud Server disaster tolerance implementation method based on failure drift
CN110086726A (en) * 2019-04-22 2019-08-02 航天云网科技发展有限责任公司 A method of automatically switching Kubernetes host node
CN110740167A (en) * 2019-09-20 2020-01-31 北京浪潮数据技术有限公司 distributed storage system and node monitoring method thereof
CN111176888A (en) * 2018-11-13 2020-05-19 浙江宇视科技有限公司 Cloud storage disaster recovery method, device and system
CN111371599A (en) * 2020-02-26 2020-07-03 山东汇贸电子口岸有限公司 Cluster disaster recovery management system based on ETCD
CN112367214A (en) * 2020-10-12 2021-02-12 成都精灵云科技有限公司 Method for rapidly detecting and switching main node based on etcd
CN113590386A (en) * 2021-07-30 2021-11-02 深圳前海微众银行股份有限公司 Disaster recovery method, system, terminal device and computer storage medium for data
CN113949691A (en) * 2021-10-15 2022-01-18 湖南麒麟信安科技股份有限公司 ETCD-based virtual network address high-availability implementation method and system

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103176909A (en) * 2011-12-26 2013-06-26 中国银联股份有限公司 Service processing method and service processing system
US20190028538A1 (en) * 2016-03-25 2019-01-24 Alibaba Group Holding Limited Method, apparatus, and system for controlling service traffic between data centers
CN110046064A (en) * 2018-01-15 2019-07-23 厦门靠谱云股份有限公司 A kind of Cloud Server disaster tolerance implementation method based on failure drift
CN111176888A (en) * 2018-11-13 2020-05-19 浙江宇视科技有限公司 Cloud storage disaster recovery method, device and system
CN110086726A (en) * 2019-04-22 2019-08-02 航天云网科技发展有限责任公司 A method of automatically switching Kubernetes host node
CN110740167A (en) * 2019-09-20 2020-01-31 北京浪潮数据技术有限公司 distributed storage system and node monitoring method thereof
CN111371599A (en) * 2020-02-26 2020-07-03 山东汇贸电子口岸有限公司 Cluster disaster recovery management system based on ETCD
CN112367214A (en) * 2020-10-12 2021-02-12 成都精灵云科技有限公司 Method for rapidly detecting and switching main node based on etcd
CN113590386A (en) * 2021-07-30 2021-11-02 深圳前海微众银行股份有限公司 Disaster recovery method, system, terminal device and computer storage medium for data
CN113949691A (en) * 2021-10-15 2022-01-18 湖南麒麟信安科技股份有限公司 ETCD-based virtual network address high-availability implementation method and system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115208743A (en) * 2022-07-18 2022-10-18 中国工商银行股份有限公司 ETCD-based cross-site cluster deployment method and device

Also Published As

Publication number Publication date
CN114584458B (en) 2023-06-06

Similar Documents

Publication Publication Date Title
CN116302719B (en) System and method for enabling high availability managed failover services
JP6602369B2 (en) Secure data access after memory failure
US20140365537A1 (en) File storage system, apparatus, and file access method
CN107846316B (en) Cloud mobile phone management system and exception handling method thereof
CN104408071A (en) Distributive database high-availability method and system based on cluster manager
CN108616574B (en) Management data storage method, device and storage medium
CN108572976A (en) Data reconstruction method, relevant device and system in a kind of distributed data base
CN113489691B (en) Network access method, network access device, computer readable medium and electronic equipment
CN108614728A (en) Virtual machine service providing method, device, equipment and computer readable storage medium
CN112395047A (en) Virtual machine fault evacuation method, system and computer readable medium
CN109565447A (en) Network function processing method and relevant device
CN114363144B (en) Fault information association reporting method and related equipment for distributed system
US20090044186A1 (en) System and method for implementation of java ais api
CN110895469A (en) Method and device for upgrading dual-computer hot standby system, electronic equipment and storage medium
CN113010313A (en) Load balancing method and device, electronic equipment and computer storage medium
CN110990190A (en) Distributed file lock fault processing method, system, terminal and storage medium
CN112749178A (en) Method for ensuring data consistency and related equipment
CN114584458A (en) Cluster disaster recovery management method, system, equipment and storage medium based on ETCD
CN113347257A (en) Communication method, communication device, server and storage medium
CN112291082A (en) Computer room disaster recovery processing method, terminal and storage medium
CN111147274A (en) System and method for creating a highly available arbitration set for a cluster solution
CN109510730A (en) Distributed system and its monitoring method, device, electronic equipment and storage medium
CN116126457A (en) Container migration method and server cluster
CN112350856B (en) Distributed service sign-off method and equipment
CN114070716A (en) Application management system, application management method, and server

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant