CN113434340B - Server and cache cluster fault rapid recovery method - Google Patents

Server and cache cluster fault rapid recovery method Download PDF

Info

Publication number
CN113434340B
CN113434340B CN202110729463.1A CN202110729463A CN113434340B CN 113434340 B CN113434340 B CN 113434340B CN 202110729463 A CN202110729463 A CN 202110729463A CN 113434340 B CN113434340 B CN 113434340B
Authority
CN
China
Prior art keywords
cache cluster
cluster
cloud storage
storage volume
cache
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110729463.1A
Other languages
Chinese (zh)
Other versions
CN113434340A (en
Inventor
胡新静
刘先攀
胡晓峰
张纪宽
矫恒浩
王宝云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Juhaokan Technology Co Ltd
Original Assignee
Juhaokan Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Juhaokan Technology Co Ltd filed Critical Juhaokan Technology Co Ltd
Priority to CN202110729463.1A priority Critical patent/CN113434340B/en
Publication of CN113434340A publication Critical patent/CN113434340A/en
Application granted granted Critical
Publication of CN113434340B publication Critical patent/CN113434340B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44505Configuring for program initiating, e.g. using registry, configuration files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/448Execution paradigms, e.g. implementations of programming paradigms
    • G06F9/4482Procedural

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a server and a fault rapid recovery method of a cache cluster, wherein the server comprises a first cache cluster, a second cache cluster, a cluster fault management module and a cold standby transfer node, and the cluster fault management module is configured to: monitoring whether the first cache cluster has a preset fault or not; when a preset fault occurs, unbinding a cloud storage volume from a cold-standby transfer node and binding the cloud storage volume with a second cache cluster, wherein the cold-standby transfer node is provided with a plurality of directories which are in one-to-one correspondence with the cloud storage volume, the number of the cloud storage volumes is the same as the number of the instances of the first cache cluster, the cloud storage volume is configured to store backup data from the first cache cluster when being bound with the cold-standby transfer node, and the second cache cluster is provided with the instances corresponding to the first cache cluster; sending a starting instruction to a second cache cluster; and setting the cache cluster of the external service as a second cache cluster. The method and the device solve the technical problem that the backup cost and the fault recovery efficiency cannot be considered at the same time.

Description

Server and cache cluster fault rapid recovery method
Technical Field
The present application relates to the technical field of servers, and in particular, to a server and a cache cluster fault fast recovery method.
Background
Data backup is an important way to protect data, and data in a primary server is backed up to a backup server, and once the data in the primary server is lost due to an accident, the data can be recovered from the backup server. In order to guarantee the data security of the backup server, the backup server can be arranged in a machine room which is different from the main server, and the backup server and the main server are prevented from simultaneously breaking down.
In the related technology, cache data in a server is stored through a cache cluster, when the cache data in the cache cluster is backed up, two backup modes can be adopted, wherein the first backup mode is to periodically backup data to be backed up to a uniform backup server, when a main server fails, the backed-up data is copied to a cluster node newly started by a system, and the newly started cluster node provides external service; the second backup mode is to backup the data to be backed up to the cluster node of the different place machine room in real time, and when the main server fails, the cluster node of the different place machine room is started to provide the external service.
However, both of the above backup approaches have drawbacks. If the first backup mode is adopted, when a fault occurs, the backup file cannot be quickly copied to the newly started cluster node, and the target of quick recovery service is difficult to meet; if the second backup method is adopted, the cluster node of the remote machine room needs to be kept in the starting state, and the backup cost is also increased.
Disclosure of Invention
In order to solve the technical problem that the backup cost and the fault recovery efficiency cannot be considered at the same time, the application provides a fault rapid recovery method for a server and a cache cluster.
In a first aspect, the present application provides a server, where the server includes a first cache cluster, a second cache cluster, a cluster fault management module, and a cold-standby transfer node, where the cluster fault management module is configured to:
monitoring whether the first cache cluster has a preset fault;
when it is monitored that a preset fault occurs in the first cache cluster, unbinding a cloud storage volume from a cold-standby transfer node, and binding the cloud storage volume with a second cache cluster, wherein the cold-standby transfer node is provided with a plurality of directories which are in one-to-one correspondence with the cloud storage volume, the number of the cloud storage volume is the same as the number of instances of the first cache cluster, the cloud storage volume is configured to store backup data from the first cache cluster when the cloud storage volume is bound with the cold-standby transfer node, and the second cache cluster is provided with an instance corresponding to the first cache cluster;
sending a starting instruction to the second cache cluster to enable the second cache cluster to start a service process;
and setting the cache cluster of the external service as the second cache cluster.
In some embodiments, unbinding the cloud storage volume from the cold-standby transit node and binding the cloud storage volume to the second cache cluster includes:
obtaining equipment bound by the cloud storage volume as a cold-standby transfer node by calling a cloud storage unbinding function interface;
unbinding the cloud storage volume from the cold-standby transfer node through an equipment unbinding command;
obtaining the devices which can be bound by the cloud storage volume, including the transfer node in the cold standby and the devices of the second cache cluster, by calling a cloud storage binding function interface;
and selecting the equipment of the second cache cluster, and binding the cloud storage volume and the equipment of the second code cluster through an equipment binding command.
In some embodiments, sending a start instruction to the second cache cluster to cause the second cache cluster to start a service process includes:
and calling a starting instance interface, and designating an instance id as the instance id of the second cache cluster.
In some embodiments, the first cache cluster and the second cache cluster are both code clusters, and setting the cache cluster serving for external service as the second cache cluster includes:
configuring a load balancing service to a code-proxy of the second code cluster;
modifying the service connection address into a load balancing address of the second coding cluster;
and automatically reestablishing the connection according to the service support without restarting the service, otherwise, restarting the service to reestablish the codis connection pool.
In some embodiments, the server is further configured to:
the method comprises the steps that before whether a preset fault occurs in a first cache cluster is monitored, each cloud storage volume is bound with a corresponding instance of a second cache cluster in advance, a starting data directory of the instance of each second cache cluster is configured on the corresponding cloud storage volume, cluster application is installed, and a starting script of the cluster application is configured to be started up and started up;
and closing the equipment of the second cache cluster, and unbinding the cloud storage volume from the bound instance of the second cache cluster.
In a second aspect, the present application provides a method for quickly recovering a failure of a cache cluster, where the method includes:
monitoring whether a first cache cluster has a preset fault;
when it is monitored that a preset fault occurs in the first cache cluster, unbinding a cloud storage volume from a cold-standby transfer node, and binding the cloud storage volume with a second cache cluster, wherein the cold-standby transfer node is provided with a plurality of directories which are in one-to-one correspondence with the cloud storage volume, the number of the cloud storage volume is the same as the number of instances of the first cache cluster, the cloud storage volume is configured to store backup data from the first cache cluster when the cloud storage volume is bound with the cold-standby transfer node, and the second cache cluster is provided with an instance corresponding to the first cache cluster;
sending a starting instruction to the second cache cluster to enable the second cache cluster to start a service process;
and setting the cache cluster of the external service as the second cache cluster.
The server and cache cluster fault rapid recovery method provided by the application has the beneficial effects that:
according to the method, the cold standby transfer node is arranged in the cold standby machine room, the plurality of cloud storage volumes are arranged on the cold standby transfer node, and the cloud storage volumes correspond to the instances in the cold standby machine room respectively, so that when the first cache cluster operates normally, the instances in the second cache cluster can be bound with the cloud storage volumes, rdb files of the instances in the first cache cluster are backed up into the corresponding cloud storage volumes, and the second cache cluster can keep a shutdown state, so that resources are saved, and the cost is reduced; when the first cache cluster breaks down, the cloud storage volume can be bound with the instance of the cold standby machine room, then the instance of the cold standby machine room is started, and external access is provided through the instance of the cold standby machine room. The method and the device can achieve the effect of enabling the examples in the cold standby machine room to respond quickly, and improve the usability of the backup cluster.
Drawings
In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without any creative effort.
Fig. 1 is a schematic diagram illustrating an operational scenario between a display device and a control apparatus according to some embodiments;
FIG. 2 is a flow diagram illustrating environment initialization, according to some embodiments;
FIG. 3 is a diagram illustrating backup timing in normal operation of a primary codis cluster according to some embodiments;
a timing diagram for switching to a standby cluster upon failure of a primary coding cluster according to some embodiments is illustrated in fig. 4.
Detailed Description
To make the purpose and embodiments of the present application clearer, the following will clearly and completely describe the exemplary embodiments of the present application with reference to the attached drawings in the exemplary embodiments of the present application, and it is obvious that the described exemplary embodiments are only a part of the embodiments of the present application, and not all of the embodiments.
It should be noted that the brief descriptions of the terms in the present application are only for the convenience of understanding the embodiments described below, and are not intended to limit the embodiments of the present application. These terms should be understood in their ordinary and customary meaning unless otherwise indicated.
The terms "first," "second," "third," and the like in the description and claims of this application and in the above-described drawings are used for distinguishing between similar or analogous objects or entities and not necessarily for describing a particular sequential or chronological order, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.
The terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements expressly listed, but may include other elements not expressly listed or inherent to such product or apparatus.
The term "module" refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware and/or software code that is capable of performing the functionality associated with that element.
Fig. 1 is a schematic diagram of an operation scenario between a display device and a control apparatus according to an embodiment. As shown in fig. 1, the user may operate the display device 200 through the smart device 300 or the control apparatus 100.
In some embodiments, the control apparatus 100 may be a remote controller, and the communication between the remote controller and the display device includes an infrared protocol communication or a bluetooth protocol communication, and other short-distance communication methods, and controls the display device 200 in a wireless or wired manner. The user may input a user command through a key on a remote controller, a voice input, a control panel input, etc. to control the display apparatus 200.
In some embodiments, the smart device 300 (e.g., mobile terminal, tablet, computer, laptop, etc.) may also be used to control the display device 200. For example, the display device 200 is controlled using an application program running on the smart device.
In some embodiments, the display device 200 may also be controlled in a manner other than the control apparatus 100 and the smart device 300, for example, the voice command control of the user may be directly received by a module configured inside the display device 200 to obtain a voice command, or may be received by a voice control device provided outside the display device 200.
In some embodiments, the display device 200 is also in data communication with a server 400. The display device 200 may be allowed to be communicatively connected through a Local Area Network (LAN), a Wireless Local Area Network (WLAN), and other networks. The server 400 may provide various contents and interactions to the display apparatus 200. The server 400 may be a cluster or a plurality of clusters, and may include one or more types of servers.
In some embodiments, the server 400 may include multiple types of servers, where the multiple types of servers form a system for external services, and the system may include a cache cluster, and the cache cluster has stronger concurrency capability and can bear larger service traffic than the system that provides services directly through a database in the system.
In some embodiments, the cache cluster may be a codis cluster. A codis cluster is a distributed Redis solution, one codis cluster can comprise a plurality of computers, each computer can be provided with a Redis master-slave instance, and the Redis master-slave instance can respond to an access request from the outside of the cluster and provide access service.
In some embodiments, the server 400 may include a cluster fault management module and a plurality of cache clusters, such as a first cache cluster and a second cache cluster, so that when one of the cache clusters fails, the cache cluster for external service may be switched to another cache cluster through the cluster fault management module, so as to improve service quality, where the first cache cluster may be a first coding cluster, and the second cache cluster may be a second coding cluster, and of course, the first cache cluster and the second cache cluster may also be other cache clusters, such as a zookeeper cluster, as long as the system loads backup files in the cache clusters in the memory and successfully starts external service.
In some embodiments, to improve the switching efficiency, the cluster backup environment may be initialized in the system in advance.
Referring to fig. 2, a flow diagram of environment initialization according to some embodiments, as shown in fig. 2, the initialization method may include steps S101-S104.
Step S101: and building a code cold standby cluster which is equivalent to the main computer room in the different-place computer room.
In some embodiments, the codis cluster of the main computer room is a main codis cluster, and the codis cluster can be a main codis cluster of the cloud platform and can provide services for external applications in normal operation. And taking one different-place machine room as a standby machine room, and constructing a codis cluster in the different-place machine room, wherein the constructed codis cluster is a standby codis cluster which is used for providing a server when the codis cluster of the main machine room fails. The main computer room and the standby computer room can be different-place computer rooms, and the probability of simultaneous failure can be reduced. For the convenience of distinguishing, the main coding cluster of the main computer room can also be called a first coding cluster, and the standby coding cluster built by the standby computer room can also be called a second coding cluster.
In some embodiments, the first codis cluster may be provided with a plurality of codis servers (master-slave instances) for serving external applications. For example, the first coding cluster may be provided with 4 sets of redis master-slave instances, which are a, B, C, and D, and the master-slave instances of the first coding cluster are a, B, C, and D for description below.
In some embodiments, the second codis cluster needs to be equal to the first codis cluster, that is, the second codis cluster needs to be provided with the same number of master and slave instances as the first codis cluster, so that the same service function can be realized, the second codis cluster can replace the first codis cluster to provide the same service for external applications, the external applications hardly perceive the switching of the codis cluster, and the influence of the codis cluster switching on the external applications is reduced. When the master-slave instances of the first coding cluster are a, B, C and D, the master-slave instances of the second coding cluster may be referred to as a ', B', C 'and D', wherein the master-slave instance a 'corresponds to the master-slave instance a, the master-slave instance B' corresponds to the master-slave instance B, the master-slave instance C 'corresponds to the master-slave instance C, and the master-slave instance D' corresponds to the master-slave instance D.
Step S102: and arranging a cloud storage volume in a remote computer room.
In some embodiments, the cloud storage volume may be disposed on a server of the standby room, independent of a server corresponding to the redis instance of the standby room.
In some embodiments, a cloud storage volume binding function of the cloud platform may be invoked to bind the four master redis instances of a ', B', C ', and D' with the four cloud storage volumes one by one, wherein a 'mounted cloud storage volume may be referred to as EBS volume 1, a B' mounted cloud storage volume may be referred to as EBS volume 2, a C 'mounted cloud storage volume may be referred to as EBS volume 3, and a D' mounted cloud storage volume may be referred to as EBS volume 4.
In some embodiments, the cloud storage volume binding function may be implemented by calling a binding function interface of the cloud storage volume and calling an equipment mount command, and after binding, the bound cloud storage volume may be displayed on a server where the a ', B', C ', and D' instances are located.
In some embodiments, after the cloud storage volume is bound to the master instance of a ', B', C ', and D', the cloud storage volume may be configured in a redis configuration file redis. After configuration is completed, if a redis master-slave instance starts data backup, the backed-up data can be written into a cloud storage volume.
In some embodiments, after the boot data of a ', B', C ', D' are configured on the respective cloud storage volumes, the cluster application may be installed on the second codis cluster. The cluster application may refer to applications of the codis cluster except for the redis master-slave instance, and comprises a codis-proxy component and a codis-config component. The code-proxy component and the code-config component are proposed to be deployed on an independent server, and are separated from a redis master-slave instance, the number of the code-proxy components is generally 3, the minimum number of the code-proxy components is 2, horizontal expansion can be carried out according to the size of traffic, and only one code-config component is generally arranged.
In some embodiments, the code-proxy component is used to execute a client-connected redis proxy service.
In some embodiments, a boot-up self-boot service process of the cluster application may be configured for the primary redis instance of the second codis cluster. Setting a script auto _ start _ code-proxy.sh for starting the code-proxy component and a script auto _ start _ code-config.sh for starting the code-config component, and configuring the script as boot self-starting.
For example, under the linux operating system, the script auto _ start _ code-proxy.sh for starting the code-proxy component and the script auto _ start _ code-config.sh for starting the code-config component may be set at the following positions: 1. chkconfig mode. And placing an auto _ start _ code-proxy script and an auto _ start _ code-config.sh script under/etc/init.d/d/h, setting the script authority to be executable, and adding a boot self-starting item by using the following commands: chkconfig-add-config, chkconfig-config on, chkconfig-add-code-proxy, chkconfig-proxy on; 2. local mode is configured manually/etc/rc. At the end of the/etc/rc.local file, adding an sh/path/to/auto _ start _ code-config.sh command and an sh/path/to/auto _ start _ code-proxy.sh command; 3. the specific method is not described in detail in the modes of/etc/rc 0-6.
After the configuration of the above embodiment is completed, a server shutdown interface of the cloud platform of the second code cluster may be called, and all servers provided with redis instances of the second code cluster are set to a shutdown state. In the closed state, for the server which charges according to the flow, the manager does not need to pay for resources such as a CPU (central processing unit), a memory and the like, and the aim of saving the cost is achieved.
In some embodiments, after the server provided with the redis instance is shut down, a cloud storage volume unbinding function of the cloud platform can be invoked, and the four main redis instances of a ', B', C 'and D' are unbound with the four cloud storage volumes one by one respectively, so that the cloud storage volumes are no longer associated with the server provided with the redis instance.
Step S103: a transit node is created.
In some embodiments, one server of the remote machine room may be set as a cold-standby transfer node X, and the cold-standby transfer node X is set with 4 storage directories, which are respectively: the data _ a,/data _ b,/data _ c,/data _ d. The server where the cold standby transit node X is located may be independent of the server corresponding to the redis instance in the second codis cluster.
In some embodiments, after the four master redis instances of a ', B', C ', and D' are respectively unbound with the four cloud storage volumes one by one, each cloud storage volume may be respectively mounted to different directories of the cold-standby transfer node X of the second code cluster, so that the cloud storage volume is bound with the cold-standby transfer node X. For example, the EBS volumes 1-4 are bound to the disk partition of the cold-standby transfer node X, then the EBS volume 1 is mounted to the/data _ a directory, the EBS volume 2 is mounted to the/data _ b directory, the EBS volume 3 is mounted to the/data _ c directory, and the EBS volume 4 is mounted to the/data _ d directory on the disk partition.
Step S104: and configuring a regular backup task.
In some embodiments, after binding the cloud storage volume with the cold-standby transit node X, a periodic task may be configured that may periodically copy RDB backup files from the primary redis instance servers of a, B, C, D of the first code cluster to the/data _ a,/data _ B,/data _ C,/data _ D directories of the cold-standby transit node X.
In some embodiments, initializing the cluster backup environment further includes setting up a cluster failure management module. The cluster fault management module can be arranged in the standby machine room or a third machine room except the main machine room and the standby machine room, and can control the cache cluster which provides service for the application in the system to switch between the first code cluster and the second code cluster.
In some embodiments, after the configuration of the second codis cluster is completed through the setting of the above embodiments, and the communication connection is established between the first codis cluster and the cold standby transit node X through the configuration of the periodic task, a schematic diagram of the system operation may be as shown in fig. 3.
As shown in fig. 3, the second coding cluster may be always in a power-off state while the first coding cluster is operating normally. Each code server in the first code cluster can back up the rdb file to the directory of the cloud storage volume of the cold-standby transfer node X in one-to-one correspondence, and after one-time backup is completed, the cold-standby transfer node X can return a notification message of the completion of the backup to the first code cluster.
In some embodiments, the cluster fault management module may access a plurality of services in the first coding cluster in real time, and if the access fails, record the time of the access failure. When the number of service access failures exceeding a preset threshold value exist in the first codis cluster, it can be determined that a major failure occurs in the first codis cluster, such as a machine room level failure or a cluster level failure, and the time of the major failure is recorded, and the time can be determined according to the time of multiple service access failures. In general, when a major failure occurs, the times at which the access failures of the plurality of services are identical or close to each other.
If the first code cluster is judged to have major faults, the cluster fault management module can pull up the second code cluster and switch the service provided for the external application by the first code cluster into the service provided for the external application by the second code cluster. Referring to fig. 4, a schematic diagram of cluster switching according to some embodiments is shown.
As shown in fig. 4, after detecting that the code cache cluster failure of the main computer room exceeds the specified time, the cluster failure management module makes a decision to switch to the standby computer room, where the switching specifically includes the following steps:
1) And (5) unbinding the cloud storage volume.
In some embodiments, the binding removing function of the cloud storage volume of the cloud platform is called to bind the EBS volume 1 with the/data _ a directory of the cold-standby transfer node X, bind the EBS volume 2 with the/data _ b directory of the cold-standby transfer node X, bind the EBS volume 3 with the/data _ c directory of the cold-standby transfer node X, and bind the EBS volume 4 with the/data _ d directory of the cold-standby transfer node X.
In some embodiments, after the cold-standby transfer node X unbinds the cloud storage volume, the periodic backup task described in the above embodiments is automatically ended.
2) And binding the cloud storage volume.
And binding the EBS volume 1 with a master-slave instance A 'of the second coding cluster, binding the EBS volume 2 with a master-slave instance B' of the second coding cluster, binding the EBS volume 3 with a master-slave instance C 'of the second coding cluster, and binding the EBS volume 4 with a master-slave instance D' of the second coding cluster by calling a cloud storage volume binding function of the cloud platform. After the cloud storage volume is bound with the second code cluster,
in some embodiments, after the cluster fault management module binds the cloud storage volume with the second code cluster, the cluster fault management module may perform the following operation: and starting each server node of the second codis cluster. And each server node of the second code cluster is the server provided with the instances A ', B', C 'and D'. And after each server node of the second code cluster is started, backup data in the cloud storage volume can be called to provide service for the application.
3) And starting each server node of the second code cluster.
In some embodiments, an instance interface is started by calling a cloud server of the cloud platform, an instance id is designated as a master-slave instance of the second codis cluster, and after the designation is completed, the starting of each server node of the second codis cluster is completed. And ending the shutdown state at each server node and entering the running state. After the cluster enters the running state, each server node automatically starts a redis process, and after the redis process finishes loading the RDB file, the standby cluster can have the capacity of external service.
In some embodiments, after each server node of the second codis cluster is started, the current codis cluster of the system needs to be set as the second codis cluster, so that the second codis cluster can start external service.
4) And setting the codis cluster providing the service as a second codis cluster.
In some embodiments, the first cod cluster provides services for the service before the first cod cluster fails, and the first cod cluster cannot provide services any more after the first cod cluster fails. In order to enable the service to be switched from the first coding cluster to the second coding cluster, after the server nodes of the second coding cluster are started, the coding cluster providing the service by the system can be set as the second coding cluster, wherein the service refers to the application or process accessing the first coding cluster before the first coding cluster fails. The specific setting of setting the current codis cluster of the system as the second codis cluster is as follows: configuring the load balancing service of the system as a code-proxy of a second code cluster; modifying the connection address of the service or the application into a load balancing address of the second coding cluster; and determining whether to restart the service to reconstruct a codis connection pool according to whether the service supports automatic connection reconstruction, if a certain service supports automatic connection reconstruction, not restarting the service, and if an application program does not support automatic connection reconstruction, restarting the service to ensure that the service can continuously communicate with the system.
In some embodiments, after the cluster fault management module switches the service provided by the first code cluster for the external application to the service provided by the second code cluster for the external application, the cluster fault management module still monitors the access state of the first code cluster in real time, if the access to the code cluster can be normally accessed for multiple times within a period of time, the access state is kept normal, and the fault removal of the first code cluster can be determined according to the access state. After the failure of the first codis cluster is relieved, the cluster failure management module can reset the first codis cluster to provide service for the external application.
According to the embodiments, the cold-standby transfer node is arranged in the cold-standby machine room, the plurality of cloud storage volumes are arranged on the cold-standby transfer node, and the cloud storage volumes correspond to the examples in the cold-standby machine room respectively, so that when the main code cluster operates normally, the examples in the main code cluster can be bound with the cloud storage volumes, the rdb files of the examples in the main code cluster are backed up to the corresponding cloud storage volumes, the cold-standby code cluster can keep a shutdown state, and resources are saved; when the main code cluster breaks down, the cloud storage volume can be bound with the instance of the cold-standby machine room, then the instance of the cold-standby machine room is started, and external access is provided with services through the instance of the cold-standby machine room.
Since the above embodiments are all described by referring to and combining with other embodiments, the same portions are provided between different embodiments, and the same and similar portions between the various embodiments in this specification may be referred to each other. And will not be described in detail herein.
It is noted that, in this specification, relational terms such as "first" and "second," and the like, are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a circuit structure, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such circuit structure, article, or apparatus. Without further limitation, the phrases "comprising a" \8230; "defining an element do not exclude the presence of additional like elements in circuit structures, articles, or devices comprising the element.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
The above embodiments of the present application do not limit the scope of the present application.

Claims (10)

1. A server, comprising a first cache cluster, a second cache cluster, a cluster fault management module, and a cold-standby transit node, wherein the cluster fault management module is configured to:
monitoring whether the first cache cluster has a preset fault or not;
when it is monitored that a preset fault occurs in the first cache cluster, unbinding a cloud storage volume from a cold-standby transfer node, and binding the cloud storage volume with a second cache cluster, wherein the cold-standby transfer node is provided with a plurality of directories which are in one-to-one correspondence with the cloud storage volume, the number of the cloud storage volume is the same as the number of instances of the first cache cluster, the cloud storage volume is configured to be bound with the cold-standby transfer node when the preset fault does not occur in the first cache cluster, store backup data from the first cache cluster, the second cache cluster is provided with instances corresponding to the first cache cluster, the backup data comprises an rdb file, and the second cache cluster is always in a shutdown state when the preset fault does not occur in the first cache cluster;
sending a starting instruction to the second cache cluster, so that the second cache cluster automatically starts a redis process to load the rdb file, and the second cache cluster has the capacity of serving outside;
and setting the cache cluster of the external service as the second cache cluster.
2. The server according to claim 1, wherein monitoring whether the first cache cluster has a predetermined failure comprises:
monitoring access state information of a first cache cluster, if the access state information is access failure information and the time interval of the two times of access failure information exceeds first time, judging that a preset fault occurs in the first cache cluster, otherwise, judging that the preset fault does not occur in the first cache cluster.
3. The server according to claim 1, wherein unbinding the cloud storage volume from the cold-standby transit node and binding the cloud storage volume to the second cache cluster comprises:
calling a unbinding function interface of a cloud storage volume, and unbinding the cloud storage volume and the cold standby transfer node through an equipment unbinding command;
and calling a binding function interface of the cloud storage volume, and binding the cloud storage volume with the equipment of the second cache cluster through an equipment binding command.
4. The server according to claim 1, wherein sending a start instruction to the second cache cluster to cause the second cache cluster to start a service process comprises:
and calling a starting instance interface, and designating an instance id as the instance id of the second cache cluster.
5. The server according to claim 1, wherein the first cache cluster and the second cache cluster are both a codis cluster, and setting a cache cluster serving an external service as the second cache cluster includes:
configuring a load balancing service to a code-proxy of the second cache cluster;
modifying the service connection address into a load balancing address of the second cache cluster;
and automatically reestablishing the connection according to the service support without restarting the service, otherwise, restarting the service to reestablish the codis connection pool.
6. The server according to claim 1, wherein unbinding the cloud storage volume from the cold-standby transit node and binding the cloud storage volume to the second cache cluster comprises:
each cloud storage volume is respectively unbound with a corresponding directory in the cold standby transfer node;
and binding each cloud storage volume with the corresponding instance of the second cache cluster.
7. The server of claim 1, wherein the server is further configured to:
before monitoring whether a first cache cluster has a preset fault, binding each cloud storage volume with a corresponding instance of a second cache cluster in advance, configuring a starting data directory of the instance of each second cache cluster on the corresponding cloud storage volume, installing cluster application, and configuring a starting script of the cluster application as a startup self-starting;
and closing the equipment of the second cache cluster, and unbinding the cloud storage volume from the bound instance of the second cache cluster.
8. The server of claim 1, wherein the server is further configured to:
and before monitoring whether the first cache cluster has a preset fault, a regular backup task is configured and executed in advance, and the regular backup task is configured to regularly backup the data of each instance of the first cache cluster to a corresponding directory of the cloud storage volume.
9. A failure quick recovery method for a cache cluster is characterized by comprising the following steps:
monitoring whether a first cache cluster has a preset fault;
when a preset fault of the first cache cluster is monitored, unbinding a cloud storage volume from a cold standby transfer node, and binding the cloud storage volume with a second cache cluster, wherein the cold standby transfer node is provided with a plurality of directories which are in one-to-one correspondence with the cloud storage volume, the number of the cloud storage volume is the same as the number of the instances of the first cache cluster, the cloud storage volume is configured to bind with the cold standby transfer node and store backup data from the first cache cluster when the preset fault of the first cache cluster does not occur, the second cache cluster is provided with the instances corresponding to the first cache cluster, the backup data comprises an rdb file, and the second cache cluster is always in a shutdown state when the preset fault of the first cache cluster does not occur;
sending a starting instruction to the second cache cluster, so that the second cache cluster automatically starts a redis process to load the rdb file, and the second cache cluster has the capability of external service;
and setting the cache cluster of the external service as the second cache cluster.
10. The method for rapidly recovering from the failure of the cache cluster according to claim 9, wherein unbinding the cloud storage volume from the cold-standby transit node and binding the cloud storage volume with the second cache cluster comprises:
unbinding each cloud storage volume with a corresponding directory in the cold-standby transfer node respectively;
and binding each cloud storage volume with the corresponding instance of the second cache cluster.
CN202110729463.1A 2021-06-29 2021-06-29 Server and cache cluster fault rapid recovery method Active CN113434340B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110729463.1A CN113434340B (en) 2021-06-29 2021-06-29 Server and cache cluster fault rapid recovery method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110729463.1A CN113434340B (en) 2021-06-29 2021-06-29 Server and cache cluster fault rapid recovery method

Publications (2)

Publication Number Publication Date
CN113434340A CN113434340A (en) 2021-09-24
CN113434340B true CN113434340B (en) 2022-11-25

Family

ID=77757774

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110729463.1A Active CN113434340B (en) 2021-06-29 2021-06-29 Server and cache cluster fault rapid recovery method

Country Status (1)

Country Link
CN (1) CN113434340B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114138732A (en) * 2021-09-29 2022-03-04 聚好看科技股份有限公司 Data processing method and device
CN117076180B (en) * 2023-09-04 2024-05-28 深信服科技股份有限公司 Information processing method, device, equipment and computer readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107256184A (en) * 2017-06-05 2017-10-17 郑州云海信息技术有限公司 A kind of data disaster backup method and device based on storage pool
CN108647118A (en) * 2018-05-15 2018-10-12 新华三技术有限公司成都分公司 Copy abnormal restoring method, device and computer equipment based on storage cluster
CN112463451A (en) * 2020-12-02 2021-03-09 中国工商银行股份有限公司 Cache disaster recovery cluster switching method and soft load balancing cluster device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10108502B1 (en) * 2015-06-26 2018-10-23 EMC IP Holding Company LLC Data protection using checkpoint restart for cluster shared resources
CN106412011A (en) * 2016-08-30 2017-02-15 广州鼎甲计算机科技有限公司 High-availability cluster system without shared storage among multiple nodes, and implementation
CN110377459A (en) * 2019-06-28 2019-10-25 苏州浪潮智能科技有限公司 A kind of disaster tolerance system, disaster tolerance processing method, monitoring node and backup cluster
CN112214351A (en) * 2020-10-12 2021-01-12 珠海格力电器股份有限公司 Backup data recovery method and device, electronic equipment and storage medium
CN112653723A (en) * 2020-11-19 2021-04-13 苏州浪潮智能科技有限公司 Cross-cloud-platform storage volume migration method, device and system based on scsi protocol

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107256184A (en) * 2017-06-05 2017-10-17 郑州云海信息技术有限公司 A kind of data disaster backup method and device based on storage pool
CN108647118A (en) * 2018-05-15 2018-10-12 新华三技术有限公司成都分公司 Copy abnormal restoring method, device and computer equipment based on storage cluster
CN112463451A (en) * 2020-12-02 2021-03-09 中国工商银行股份有限公司 Cache disaster recovery cluster switching method and soft load balancing cluster device

Also Published As

Publication number Publication date
CN113434340A (en) 2021-09-24

Similar Documents

Publication Publication Date Title
CN108683516B (en) Application instance upgrading method, device and system
CN107526659B (en) Method and apparatus for failover
CN113434340B (en) Server and cache cluster fault rapid recovery method
US9037899B2 (en) Automated node fencing integrated within a quorum service of a cluster infrastructure
CN111327467A (en) Server system, disaster recovery backup method thereof and related equipment
CN108984349B (en) Method and device for electing master node, medium and computing equipment
CN103200036B (en) A kind of automation collocation method of electric power system cloud computing platform
WO2020001354A1 (en) Master/standby container system switch
CN112380062A (en) Method and system for rapidly recovering system for multiple times based on system backup point
CN115562911B (en) Virtual machine data backup method, device, system, electronic equipment and storage medium
CN110069365B (en) Method for managing database and corresponding device, computer readable storage medium
CN115658390A (en) Container disaster tolerance method, system, device, equipment and computer readable storage medium
CN112860787A (en) Method for switching master nodes in distributed master-slave system, master node device and storage medium
CN113515316A (en) Novel edge cloud operating system
CN111917588A (en) Edge device management method, device, edge gateway device and storage medium
CN101251815B (en) System and method for recoverring computer system
CN104052799A (en) Method for achieving high availability storage through resource rings
EP4443291A1 (en) Cluster management method and device, and computing system
CN114598711B (en) Data migration method, device, equipment and medium
CN114676118B (en) Database switching method, device, equipment and storage medium
CN111090537A (en) Cluster starting method and device, electronic equipment and readable storage medium
CN112612652A (en) Distributed storage system abnormal node restarting method and system
CN101131653A (en) Perspective communication method between super operating system and its intermedium
CN107783855B (en) Fault self-healing control device and method for virtual network element
CN102999403B (en) A kind ofly call the test disposal route of PC, system and server

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant