CN114598594A

CN114598594A - Method, system, medium and device for processing application faults under multiple clusters

Info

Publication number: CN114598594A
Application number: CN202210247855.9A
Authority: CN
Inventors: 牛乐川; 颜开; 孙亮; 戴秋萍; 郭峰
Original assignee: Shanghai Daoke Network Technology Co ltd
Current assignee: Shanghai Daoke Network Technology Co ltd
Priority date: 2022-03-14
Filing date: 2022-03-14
Publication date: 2022-06-07
Anticipated expiration: 2042-03-14
Also published as: CN114598594B

Abstract

The embodiment of the application fault processing method and system under multiple clusters, a computer readable storage medium and electronic equipment are provided. The application in the method is deployed in a first cluster, the first cluster belongs to a first application cluster group, and other clusters except the first cluster in the first application cluster group form a second application cluster group. And each cluster in the first application cluster group is provided with a disaster recovery backup system, and the disaster recovery backup system of the first cluster is used as a disaster recovery backup main system of the application to backup the application to all clusters in the second application cluster group. The method comprises the following steps: in response to the application failure, the disaster recovery main system stops backing up the application; and determining the disaster recovery system of the second cluster as a new disaster recovery main system, and backing up the new application in the second cluster to other clusters except the second cluster in the second application cluster group. Therefore, when the application in the first cluster fails, the disaster recovery system and the application for backup are switched, and the application synchronization mechanism is ensured to be effective for a long time.

Description

Method, system, medium and device for processing application faults under multiple clusters

Technical Field

The present application relates to the field of cloud-native technologies, and in particular, to a method and a system for processing an application failure in multiple clusters, a computer-readable storage medium, and an electronic device.

Background

In production practice, an enterprise typically deploys application instances in a master/backup cluster deployment mode, that is, the application instances are respectively deployed in a master cluster and a backup cluster, and the application instances (also referred to as application copies) in the backup cluster synchronize the running states of the applications in the master cluster, and when an application in the master cluster fails and cannot normally respond to external access traffic, the application copies in the backup cluster respond instead of the applications in the master cluster, or an application copy of a backup cluster is selected from a plurality of backup clusters to respond instead of the applications in the master cluster. Specifically, when an application in the primary cluster fails, the multi-cluster administrator manually modifies the load balancing traffic distribution configuration and switches the external access traffic accessing the application to the application copy in the backup cluster.

In the prior art, a flow switching manner is manually performed to deal with a situation that an application in a main cluster fails, but after an application copy in a standby cluster becomes a new application, a synchronization mechanism between an original application and a corresponding application copy cannot automatically modify corresponding synchronization configuration and is in a failure state, that is, the original application synchronization mechanism cannot automatically adapt to flow switching, application instances in other clusters cannot synchronize the running state of the new application, and the new application does not have a corresponding application copy.

Therefore, there is a need to provide an improved solution to the above-mentioned deficiencies of the prior art.

Disclosure of Invention

An object of the present application is to provide a method, a system, a computer-readable storage medium, and an electronic device for processing application failures in multiple clusters, so as to solve or alleviate the above problems in the prior art.

In order to achieve the above purpose, the present application provides the following technical solutions:

the embodiment of the application provides a method for processing application faults under multiple clusters, wherein an application is deployed in a first cluster, the first cluster belongs to a first application cluster group, and other clusters except the first cluster in the first application cluster group form a second application cluster group; each cluster in the first application cluster group is deployed with a disaster recovery backup system, the disaster recovery backup system of the first cluster serves as a disaster recovery backup main system of the application and is used for backing up the application to all clusters in the second application cluster group, and the method for processing the multi-cluster application faults comprises the following steps: responding to the application failure, and stopping the backup of the application by the disaster recovery main system; determining a new disaster recovery main system in the disaster recovery systems of all clusters in the second application cluster group; the new disaster recovery main system is deployed in a second cluster, an application copy corresponding to the application in the second cluster becomes a new application, and the new disaster recovery main system is used for backing up the new application to other clusters except the second cluster in the second application cluster group.

Preferably, the disaster recovery system of the first cluster becomes the disaster recovery main system of the application based on the received application management resource object, and determines all clusters in the first application cluster group according to the content of the application management resource object.

Preferably, after determining all clusters in the first application cluster group according to the content of the application management resource object, the method further includes: the disaster recovery main system of the application synchronizes the content of the application management resource object to the disaster recovery systems of all clusters in the second application cluster group; and the disaster recovery system of each cluster in the second application cluster group acquires all clusters in the first application cluster group according to the received content of the application management resource object.

Preferably, the stopping of the backup of the application by the disaster recovery main system includes: and modifying the content of the cluster management resource object in the disaster recovery main system so that the disaster recovery main system loses the operation authority of the application.

Preferably, after the disaster recovery main system stops backing up the application, the method further includes: the disaster backup main system becomes a disaster backup auxiliary system and informs disaster backup systems of all clusters in the second application cluster group; the disaster recovery systems of all clusters in the second application cluster group mark the first cluster as a maintenance state.

Preferably, the method further comprises the following steps: responding to the application to recover to normal, and informing disaster recovery backup systems of all clusters in the second application cluster group by the disaster recovery backup auxiliary system; the disaster recovery systems of all the clusters in the second application cluster group remove the maintenance state marks of the first cluster; and the new disaster recovery main system backs up the new application to the first cluster.

Preferably, the method further comprises the following steps: in response to the application failing, the disaster recovery main system instructs a load balancing component to stop forwarding the external access traffic of the application to the first cluster; after determining a new disaster recovery main system, the new disaster recovery main system instructs the load balancing component to forward the external access traffic of the application to the second cluster.

An embodiment of the present application further provides a system for processing an application failure under multiple clusters, where an application is deployed in a first cluster, the first cluster belongs to a first application cluster group, other clusters except the first cluster in the first application cluster group form a second application cluster group, each cluster in the first application cluster group is deployed with a disaster recovery system, the disaster recovery system of the first cluster serves as a disaster recovery main system of the application and is configured to backup the application to all clusters in the second application cluster group, and the system for processing an application failure under multiple clusters includes: the stopping unit is configured to respond to the application failure, and the disaster recovery main system stops backing up the application; the backup unit is configured to determine a new disaster recovery main system in the disaster recovery systems of all the clusters in the second application cluster group; the new disaster recovery main system is deployed in a second cluster, an application copy corresponding to the application in the second cluster becomes a new application, and the new disaster recovery main system is used for backing up the new application to other clusters except the second cluster in the second application cluster group.

An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is a method for processing an application failure in multiple clusters as described above.

An embodiment of the present application further provides an electronic device, including: the system comprises a memory, a processor and a program stored in the memory and capable of running on the processor, wherein the processor executes the program to realize the method for processing the application fault in the multi-cluster as described in any one of the above.

Has the advantages that:

in the technical solution provided in the embodiment of the present application, the application is deployed in a first cluster belonging to a first application cluster group, and other clusters except the first cluster in the first application cluster group form a second application cluster group; the disaster recovery backup method comprises the following steps that a disaster recovery backup system is deployed in each cluster in a first application cluster group, the disaster recovery backup system of the first cluster serves as a disaster recovery backup main system of an appointed application to be backed up, the application is backed up to all clusters in a second cluster group, and when the application to be backed up in the first cluster fails, the disaster recovery backup system of the first cluster stops backing up the application; and determining a new disaster recovery main system in the disaster recovery systems of all clusters in the second application cluster group, wherein the application copies corresponding to the applications in the second cluster become new applications to be backed up, and the new disaster recovery main system in the second cluster backs up the new applications to other clusters except the second cluster in the second cluster. Therefore, when the application to be backed up in the first cluster fails, the original disaster recovery main system immediately stops synchronizing the application, a new disaster recovery main system is determined from other disaster recovery systems except the disaster recovery main system, and the new application to be backed up is synchronized by the new disaster recovery main system, so that the disaster recovery main system for backup and the application to be backed up are automatically switched when the application to be backed up fails, and the application synchronization mechanism is ensured to be effective for a long time.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application. Wherein:

fig. 1 is a logical schematic diagram of establishing connections between clusters provided in accordance with some embodiments of the present application;

fig. 2 is a logical schematic diagram of a selected disaster recovery primary system according to some embodiments of the present application;

FIG. 3 is a logical illustration of establishing a group of application sets provided according to some embodiments of the present application;

FIG. 4 is a flow chart illustrating a method for handling application failures in multiple clusters according to some embodiments of the present disclosure;

FIG. 5 is a logic diagram of external access traffic switching upon application failure, provided in accordance with some embodiments of the present application;

FIG. 6 is a schematic diagram illustrating an architecture of a system for handling application failures in multiple clusters according to some embodiments of the present application;

FIG. 7 is a schematic structural diagram of an electronic device provided in accordance with some embodiments of the present application;

fig. 8 is a hardware configuration of an electronic device provided according to some embodiments of the present application.

Detailed Description

The present application will be described in detail below with reference to the embodiments with reference to the attached drawings. The various examples are provided by way of explanation of the application and are not limiting of the application. In fact, it will be apparent to those skilled in the art that modifications and variations can be made in the present application without departing from the scope or spirit of the application. For instance, features illustrated or described as part of one embodiment, can be used with another embodiment to yield a still further embodiment. It is therefore intended that the present application cover the modifications and variations of this invention provided they come within the scope of the appended claims and their equivalents.

In a multi-cluster scene, the disaster recovery system is deployed in each cluster in a distributed manner in the embodiment of the application, and is used for synchronizing the running states of applications in the multi-cluster scene.

For convenience of explanation, the cluster in which the designated application to be backed up is located is referred to as a first cluster, and the application is synchronized.

A cluster group consisting of the first cluster and at least one backup cluster corresponding to the designated application to be backed up is referred to as a first cluster group, and a cluster group consisting of the at least one backup cluster corresponding to the designated application to be backed up is referred to as a second cluster group.

According to different functions of the disaster recovery system in the process of synchronizing the running state of the designated application to be backed up, the disaster recovery systems in all the clusters in the first cluster group can be divided into a disaster recovery main system and a disaster recovery auxiliary system, and the disaster recovery main system and at least one disaster recovery auxiliary system in all the clusters in the first cluster group jointly form a complete disaster recovery system.

The disaster recovery system in the first cluster is responsible for synchronizing the running state of the designated application to be backed up to all the backup clusters corresponding to the application, so that the disaster recovery system in the first cluster is the disaster recovery main system corresponding to the designated application to be backed up, and the disaster recovery systems in all the backup clusters in the second cluster are the disaster recovery sub-systems corresponding to the designated application to be backed up. Correspondingly, the first cluster in which the disaster recovery primary system is located may also be referred to as a primary cluster, and the cluster in which the disaster recovery secondary system is deployed in the second cluster group is a backup cluster corresponding to the primary cluster.

That is, a designated application to be backed up (simply referred to as "application") is deployed in a first cluster belonging to a first application cluster group, and clusters other than the first cluster in the first application cluster group constitute a second application cluster group; each cluster in the first application cluster group is provided with a disaster recovery backup system, and the disaster recovery backup system of the first cluster is used as a disaster recovery backup main system of the application and used for backing up the application to all clusters in the second application cluster group.

To more clearly illustrate the method for processing application failures in multiple clusters proposed in the embodiments of the present application, a process of specifying an application to be backed up and a corresponding backup cluster and establishing an application backup system with the specified application to be backed up as a core is described first.

Fig. 1 is a logical schematic diagram of establishing connections between clusters provided in accordance with some embodiments of the present application; as shown in fig. 1, the method for processing application failures provided in the embodiment of the present application is suitable for a multi-cluster scenario, and a disaster recovery system is deployed in each cluster in a distributed manner. The multi-cluster manager writes the cluster resource management object into the user-defined cluster management resource deployed on the multi-cluster management system, and the multi-cluster management system sends the connection authentication information for establishing connection between the clusters to each cluster to realize the connection between the clusters. Here, the connection authentication information for establishing a connection between the clusters includes: each cluster can be connected with other clusters through the connection authentication information, and the distributed disaster recovery system in the cluster establishes a real-time detection and communication channel through the connection authentication information, so that interconnection among the distributed disaster recovery systems is realized.

Fig. 2 is a logical schematic diagram of a selected disaster recovery primary system according to some embodiments of the present application; as shown in fig. 2, the method for processing application failure in multiple clusters further includes: the disaster recovery backup system of the first cluster becomes a disaster recovery backup main system of the application based on the received application management resource object, and determines all clusters in the first application cluster group according to the content of the application management resource object.

Specifically, the application manager writes the application management resource object into the disaster recovery backup system of the cluster where the designated application to be backed up is located, so that the cluster where the designated application to be backed up is located becomes a main cluster, and the disaster recovery backup system of the main cluster becomes a disaster recovery backup main system. The application field to be backed up in the application management resource object indicates the unique identification information (such as name, ID, and the like) of the application to be backed up, and the standby cluster field lists the unique identification information (such as name, ID, access address, and the like) of the designated standby cluster, so that the disaster recovery main system analyzes the content of the application management resource object, that is, the designated application to be backed up and the designated at least one standby cluster can be determined, and further, all clusters in the first application cluster group, that is, the main cluster and all the standby clusters can be determined.

FIG. 3 is a logical illustration of establishing a group of application sets provided according to some embodiments of the present application; as shown in fig. 3, after determining all clusters in the first application cluster group according to the content of the application management resource object, the method for processing application failure in multiple clusters further includes: the disaster recovery main system of the application synchronizes the content of the application management resource object to the disaster recovery systems of all the clusters in the second application cluster group; and the disaster recovery system of each cluster in the second application cluster group acquires all clusters in the first application cluster group according to the received content of the application management resource object.

After the disaster recovery main system determines all the standby clusters, the connection authentication information issued by the multi-cluster management system can be used for actively sending a request for establishing connection to all the standby clusters, establishing a connection relation with the disaster recovery systems in the standby clusters, and synchronizing the content of the application management resource object to the disaster recovery systems of all the standby clusters, namely the disaster recovery systems of all the clusters in the second application cluster. It should be understood that after the disaster recovery primary system establishes a connection relationship with the disaster recovery system of each cluster in the second application cluster group, no connection relationship is established between the disaster recovery systems of all clusters in the second application cluster group, that is, the disaster recovery system of each cluster in the second application cluster group does not know which clusters are also included in the second application cluster group.

The disaster backup main system synchronizes the content of the application management resource object to the disaster backup systems of all clusters in the second application cluster group, and the disaster backup system of each cluster in the second application cluster group analyzes the received content of the application management resource object, so that all clusters in the first application cluster group including the main cluster can be obtained, and then the connection authentication information issued by the multi-cluster management system can be used for actively sending a connection establishment request to other backup clusters in the second application cluster group to establish a connection relationship with the disaster backup systems in other backup clusters, so that the connection relationship is established between every two all the disaster backup systems in the first application cluster group.

Based on the above description, after an application manager writes an application management resource object into a disaster recovery system of a cluster where an application to be backed up is specified, the cluster is a main cluster corresponding to the application, the disaster recovery system of the main cluster is a disaster recovery main system corresponding to the application, and disaster recovery systems of other specified backup clusters are disaster recovery sub-systems relative to the disaster recovery system of the cluster where the application to be backed up is specified; and a connection relation is established between the disaster recovery main system and the disaster recovery auxiliary system through the application management resource object written by the application administrator. It can be understood that, if an application administrator writes different application management resource objects into disaster recovery systems of a plurality of designated clusters where applications to be backed up are located (the plurality of applications to be backed up may be located in the same cluster or in different clusters), and at least one backup cluster is respectively designated for each application to be backed up in different application management resource objects, a plurality of application backup systems including a disaster recovery main system, at least one disaster recovery sub-system, a main cluster, and at least one backup cluster may be formed in a cluster group formed by a plurality of clusters.

Specifically, a configuration manager may be provided in each disaster recovery system, and configured to receive an application management resource object written by an application manager, and the application manager may further modify, by accessing the configuration manager, the content of the application management resource object to change the specified application to be backed up and the specified at least one backup cluster, and configure an operation process of the disaster recovery system to synchronize the content of the application management resource object to the disaster recovery systems of all clusters in the second application cluster group.

Therefore, the application to be backed up and the corresponding standby cluster are designated in a mode that an application administrator writes in the application management resource object, and the disaster recovery main system, the disaster recovery auxiliary system, the main cluster and the standby cluster corresponding to the application are defined, so that the plurality of clusters and the disaster recovery system deployed on the clusters can synchronize a plurality of designated applications to be backed up in different clusters to the corresponding standby clusters at the same time.

After an application backup system taking the specified application to be backed up as a core is established, the disaster recovery main system synchronizes the running state of the specified application to be backed up to the specified at least one standby cluster according to the content of the application management resource object.

When the designated application to be backed up normally runs, the disaster recovery main system synchronizes the running state of the designated application to be backed up to all the standby clusters, namely all the standby clusters in the second application cluster group, and generates application copies corresponding to the designated application to be backed up in all the standby clusters, so that the running state of the application copies in each standby cluster is always consistent with the designated application to be backed up in the main cluster.

When a designated application to be backed up fails, the processing method for the application failure under multiple clusters provided in the embodiments of the present application is adopted to process, and fig. 4 is a schematic flow diagram of the processing method for the application failure under multiple clusters provided in some embodiments of the present application; as shown in fig. 4, the method for processing application failure in multiple clusters includes:

step S401, responding to the application failure, the disaster recovery main system stops backing up the application.

It should be noted that, while backing up the designated application to be backed up, the disaster recovery main system also detects the running state of the designated application to be backed up in real time, and when the disaster recovery main system monitors that the designated application to be backed up has a fault and cannot normally respond to external access traffic, immediately stops synchronizing the designated application to be backed up.

It can be understood that, if the designated application to be backed up fails, the designated application to be backed up cannot normally respond to the external access traffic, and new valid running state information is not generated, so that the running state of the designated application to be backed up should be immediately stopped from being synchronized to the standby cluster, so as to prevent the application copy in the standby cluster from being affected by the failure of the application.

Further, the backup of the application by the disaster recovery main system in step S401 is stopped, which includes various possible implementation manners.

The first possible implementation manner is that an alarm system is set in each cluster, the alarm system detects the running states of all applications deployed in the cluster in real time, when the alarm system in a certain cluster detects that a certain application in the cluster has a fault and cannot normally respond to external traffic, the cluster manager of the cluster is immediately informed of the unique identification information of the application with the fault in communication manners such as short message, mail, telephone and the like, and the cluster manager manually processes the unique identification information.

Specifically, after knowing the application that has failed, the cluster administrator may modify the content of the application management resource object corresponding to the application by accessing the configuration manager in the disaster recovery backup main system corresponding to the application, so that the disaster recovery backup main system corresponding to the application stops performing real-time or periodic backup on the application.

The second possible implementation manner is that a cluster controller is arranged in each disaster backup system, when the disaster backup system becomes a disaster backup main system corresponding to the specified application to be backed up, the cluster controller in the disaster backup main system immediately starts to perform real-time detection on the state of the specified application to be backed up in the main cluster, and when the application is detected to be faulty and cannot normally respond to external traffic, the cluster controller instructs the disaster backup main system corresponding to the application to stop performing real-time or periodic backup on the application without manually closing by a cluster administrator.

Specifically, the content of the cluster management resource object in the disaster recovery main system is modified, so that the disaster recovery main system loses the operation authority of the application, and the disaster recovery main system cannot perform real-time or periodic backup on the application.

The third possible implementation manner is that a cluster controller is arranged in each disaster recovery system, an alarm system is arranged in each cluster, the cluster controller performs real-time detection on the designated application to be backed up and the alarm system performs real-time detection on the running states of all the applications deployed in the cluster, when the designated application to be backed up is detected to be in fault and cannot normally respond to external flow, the cluster controller automatically modifies the content of a cluster management resource object in the disaster recovery main system to enable the disaster recovery main system to stop backing up the application, and meanwhile, the alarm system informs a cluster manager of the cluster of the unique identification information of the application in fault, so that the cluster manager can process the application in fault as soon as possible.

Further, after the disaster recovery main system stops backing up the application in step S401, the method for processing the application failure in the multiple clusters further includes: the disaster backup main system becomes a disaster backup secondary system and informs all the disaster backup systems of the second application cluster group; the disaster recovery systems of all clusters in the second application cluster group mark the first cluster as maintenance state. That is, when the designated application to be backed up has a fault, the disaster recovery main system automatically converts the application into the disaster recovery sub-system corresponding to the designated application to be backed up while stopping the backup of the application, and the whole set of application backup system using the designated application to be backed up as a core is in a state without the disaster recovery main system, so that a new disaster recovery main system needs to be determined as soon as possible.

Step S402, determining a new disaster recovery main system in the disaster recovery systems of all the clusters in the second application cluster group.

The new disaster recovery main system is deployed in the second cluster, the application copy corresponding to the application in the second cluster becomes a new application, and the new disaster recovery main system is used for backing up the new application to other clusters except the second cluster in the second application cluster group.

In the embodiment of the application, when an application to be backed up fails, the disaster recovery secondary systems deployed in all corresponding backup clusters are notified to automatically determine a new disaster recovery primary system, the backup cluster where the new disaster recovery primary system is located is converted into a new primary cluster, an application copy corresponding to the application to be backed up is deployed in the new primary cluster, the running state of the application copy is always consistent with the specified application to be backed up, when the backup cluster where the application copy is located becomes the new primary cluster, the application copy is also correspondingly converted into the new specified application to be backed up (simply referred to as "new application"), and the new disaster recovery primary system synchronously backs up the running state of the new application to other backup clusters.

It should be noted that the backup cluster where the new disaster recovery main system is located is one of the original backup clusters, if only one backup cluster exists originally, the disaster recovery auxiliary systems in the backup cluster are directly converted into the disaster recovery main cluster, if more than two backup clusters exist originally, the backup cluster where the new disaster recovery main system is located is the new main cluster after the disaster recovery auxiliary systems in the backup clusters determine the new disaster recovery main system by themselves, and the other clusters except for the backup cluster converted into the new main cluster in the original backup cluster are the new backup cluster (i.e., the backup cluster corresponding to the new main cluster). After the new disaster recovery main system is determined, the new disaster recovery main system synchronizes the running state of the new application in the new cluster to the new standby cluster, so that the automatic switching between the main cluster and the standby cluster is realized, and the long-term effectiveness of an application synchronization mechanism is ensured.

Furthermore, more than two disaster backup auxiliary systems can adopt an election mechanism to determine a new disaster backup main system, specifically, election can be performed by setting an election priority in advance, that is, priority is set for each cluster in advance, a disaster backup auxiliary system in a cluster with the highest priority is selected as a new disaster backup main system during election, or election can be performed by rating the performance of the cluster, that is, the use condition of hardware resources of the cluster is collected by a monitoring component deployed in each standby cluster, the performance of the cluster is scored by the disaster backup system according to the collected use condition of the hardware resources, the disaster backup auxiliary system in the cluster with the highest performance rating is selected as a new disaster backup main system, or election can be performed by weighting and scoring the cluster, that is, a weight value is set for each cluster in advance, when the performance of the cluster is evaluated according to the collected use condition of the hardware resource, the obtained evaluation is multiplied by a weight value, and the disaster recovery auxiliary system in the cluster with the highest performance evaluation after weighting processing is selected as a new disaster recovery main system.

Here, it should be noted that the disaster recovery main system, the disaster recovery auxiliary system, the main cluster and the standby cluster are relative to a specific application, that is, one cluster can be simultaneously used as a main cluster of a first application and a standby cluster of a second application different from the first application, and accordingly, the disaster recovery system operating in the cluster is simultaneously used as the disaster recovery main system of the first application and the disaster recovery auxiliary system of the second application. It can be understood that a cluster where an application to be backed up is located is a main cluster corresponding to the application, and a disaster recovery system deployed in the main cluster is a disaster recovery main system corresponding to the application; the cluster where the application copy corresponding to the application is located is the standby cluster corresponding to the application, and the disaster recovery backup system deployed in the standby cluster is the disaster recovery backup secondary system corresponding to the application. The main cluster, the standby cluster, the disaster recovery main system and the disaster recovery auxiliary system are changed along with the change of the application to be backed up, after the application to be backed up in the current main cluster fails, the application copy in one cluster in the standby cluster corresponding to the application is changed into a new application, the corresponding cluster is correspondingly changed into a new main cluster, and the corresponding disaster recovery system is also correspondingly changed into a new disaster recovery main system.

According to the above description, the real-time detection and communication channels are established between the disaster recovery systems which are deployed in a distributed manner in the cluster through the connection authentication information, so that the interconnection between the distributed disaster recovery systems is realized. When the disaster recovery main system stops synchronizing the designated application to be backed up, the corresponding disaster recovery auxiliary system is informed to replace the original disaster recovery main system to become a new disaster recovery main system, and correspondingly, the cluster (second cluster) where the new disaster recovery main system is located replaces the original main cluster (first cluster) to become a new main cluster for providing service to the outside, and the application copy in the new main cluster is converted into a new application.

It can be understood that, if the main cluster has a plurality of standby clusters, when the first cluster enters the maintenance state corresponding to a certain application, the disaster backup main system running in the first cluster informs the disaster backup auxiliary systems in all the standby clusters corresponding to the application that the application deployed in the main cluster has a failure, the disaster backup auxiliary systems in all the standby clusters select a new disaster backup main system by themselves, and the disaster backup auxiliary systems in all the standby clusters mark the first cluster as the maintenance state corresponding to the application. When the first cluster is in the maintenance state corresponding to the application, the disaster recovery backup system deployed in the first cluster cannot be selected as the disaster recovery backup main system corresponding to the application, and the first cluster cannot be selected as the main cluster corresponding to the application. That is, the first cluster is excluded from the selected range of the master cluster corresponding to the application until the first cluster exits the maintenance state corresponding to the application. Based on the foregoing description, it can be seen that the disaster recovery primary system, the disaster recovery secondary system, the primary cluster and the standby cluster in the present application are all relative to a specific application, and therefore, even if the application a to be backed up in the first cluster fails, the disaster recovery secondary system corresponding to the application a in the first cluster is marked as a maintenance state, and the disaster recovery secondary system in the first cluster can still be selected as the disaster recovery primary system corresponding to the application B.

When the designated application to be backed up in the first cluster fails, the disaster backup secondary system generates a new disaster backup main system through election, the cluster where the new disaster backup main system is located becomes a new main cluster, and other clusters in the original standby cluster continue to be used as standby clusters of the new disaster backup main system so as to deploy application copies corresponding to the new application, and meanwhile, the new disaster backup main system synchronously backs up data of the new application to other standby clusters. That is to say, the determination of the main cluster depends on the selection of the disaster backup main system, and the cluster where the disaster backup main system is located is the main cluster, so when all the disaster backup secondary systems select and generate a new disaster backup main system by themselves, the selection of the main cluster is indirectly realized.

It can be understood that, since the application copies in each original standby cluster need to be synchronized with the applications in the original main cluster, the application copies in each original standby cluster also always remain synchronized with each other. Therefore, after the original main cluster is converted into the standby cluster and a certain standby cluster becomes a new main cluster, the remaining standby clusters can continue to serve as the standby clusters of the new main cluster to receive the running state of the disaster recovery system in the new main cluster and synchronize the new application, and the synchronous seamless switching of the running state is realized.

FIG. 5 is a logic diagram of external access traffic switching upon application failure, provided in accordance with some embodiments of the present application; as shown in fig. 5, in order to ensure high availability of the deployed application, once the application to be backed up specified in the first cluster fails, in addition to the need to adjust the application backup system, it is also necessary to switch the external access traffic directed to the specified application to be backed up, and specifically, it is necessary to immediately switch the access traffic of the application to a new application (i.e., an application copy corresponding to the application) in the new primary cluster. Specifically, in response to the application failing, the disaster recovery main system instructs the load balancing component to stop forwarding the external access traffic of the application to the first cluster. The cluster manager can be configured in the disaster recovery system in each cluster to monitor the running state of the designated application to be backed up, and when the cluster controller in the first cluster monitors that the designated application to be backed up fails, the external access flow of the designated application to be backed up, which is transmitted to the first cluster by load balancing, is immediately and automatically cut off; after determining the new disaster backup primary system, the new disaster backup primary system instructs the load balancing component to forward the external access traffic of the application to the second cluster. That is, after determining the new primary cluster, the load balancing component is instructed by the cluster controller in the new primary cluster to forward the external access traffic of the application to the new application in the new primary cluster.

Therefore, by implementing or periodically synchronizing the application to be backed up specified in the main cluster to the specified standby cluster, when the application specified in the main cluster fails, the access flow is seamlessly switched to the new application (namely, the application copy corresponding to the original application) in the standby cluster, and the new application replaces the application in the first cluster to respond to the access flow, so that the automatic switching of the external access flow between the application in the first cluster and the new application is realized, the time required by the failure from the occurrence to the recovery is almost 0, and the high availability of the application is realized.

In addition, in the embodiment of the application, the running state of the specified application to be backed up is monitored, when the application is monitored to have a fault, only the application backup system corresponding to the application and the external access traffic pointing to the application are switched without causing any influence on other applications on the same cluster, and a single application is used as a reference unit for fault processing, so that the fineness of a fault switching mechanism and a traffic switching mechanism is effectively improved.

After the application in the first cluster is recovered to normal, the cluster administrator can make the first cluster exit the maintenance state corresponding to the application by modifying the cluster management resource object again; or the cluster controller can automatically quit the maintenance state corresponding to the application after detecting that the application in the first cluster is recovered to be normal; the first cluster automatically becomes a standby cluster of the new main cluster after exiting from the maintenance state corresponding to the application, and the new disaster recovery system synchronizes the running state of the new application to the first cluster.

Based on the foregoing, when an application fails, the disaster recovery main system in the first cluster becomes the disaster recovery sub-system, the disaster recovery systems of all clusters in the second application cluster mark the first cluster as the maintenance state corresponding to the application, and in response to the application returning to normal, the disaster recovery sub-system notifies the disaster recovery systems of all clusters in the second application cluster, the disaster recovery systems of all clusters in the second application cluster remove the maintenance state mark corresponding to the application of the first cluster, and the new disaster recovery main system backs up the new application to the first cluster. That is to say, in the embodiment, the relationship between the primary cluster and the backup cluster among the multiple clusters is relative, once an application in the primary cluster fails, a certain backup cluster corresponding to the application is changed into a new primary cluster, and after the application that fails is recovered, the original primary cluster is automatically changed into the backup cluster of the new primary cluster. Therefore, the main cluster and the standby cluster are mutually backed up, and cross disaster tolerance among the clusters is realized.

Exemplary System

FIG. 6 is a schematic diagram of a system for handling application failures in multiple clusters according to some embodiments of the present disclosure; as shown in fig. 6, in the processing system for application failure under multiple clusters, applications are deployed in a first cluster, the first cluster belongs to a first application cluster group, and other clusters except the first cluster in the first application cluster group form a second application cluster group, each cluster in the first application cluster group is deployed with a disaster recovery system, and the disaster recovery system of the first cluster serves as a disaster recovery main system for the applications and is used for backing up the applications to all clusters in the second application cluster group; the system for processing the application fault under the multiple clusters comprises: a stopping unit 601 configured to stop the disaster recovery main system from backing up the application in response to the application failing; a backup unit 602 configured to determine a new disaster recovery main system in the disaster recovery systems of all clusters in the second application cluster group; the new disaster recovery main system is deployed in the second cluster, an application copy corresponding to the application in the second cluster becomes a new application, and the new disaster recovery main system is used for backing up the new application to other clusters except the second cluster in the second application cluster group.

The system for processing application faults under multiple clusters provided in the embodiment of the present application can implement the steps and flows of any one of the embodiments of the method for processing application faults under multiple clusters, and achieve the same technical effects, which are not described in detail herein.

Exemplary device

FIG. 7 is a schematic structural diagram of an electronic device provided in accordance with some embodiments of the present application; as shown in fig. 7, the electronic apparatus includes:

one or more processors 701;

a computer readable medium may be configured to store one or more programs 702, which when executed by one or more processors 701 perform the steps of: in response to the application failure, the disaster recovery main system stops backing up the application; determining a new disaster recovery main system in the disaster recovery systems of all clusters in the second application cluster group; the new disaster recovery main system is deployed in a second cluster, application copies corresponding to applications in the second cluster become new applications, and the new disaster recovery main system is used for backing up the new applications to other clusters except the second cluster in the second application cluster group; the application is deployed in a first cluster, the first cluster belongs to a first application cluster group, and other clusters except the first cluster in the first application cluster group form a second application cluster group; and each cluster in the first application cluster group is provided with a disaster recovery backup system, and the disaster recovery backup system of the first cluster is used as a disaster recovery backup main system of the application and used for backing up the application to all clusters in the second application cluster group.

FIG. 8 is a hardware architecture of an electronic device provided in accordance with some embodiments of the present application; as shown in fig. 8, the hardware structure of the electronic device may include: a processor 801, a communications interface 802, a computer-readable medium 803, and a communications bus 804.

The processor 801, the communication interface 802, and the computer-readable storage medium 803 communicate with each other via a communication bus 804.

Alternatively, the communication interface 802 may be an interface of a communication module, such as an interface of a GSM module.

The processor 801 may be specifically configured to: in response to the application failure, the disaster recovery main system stops backing up the application; determining a new disaster recovery main system in the disaster recovery systems of all clusters in the second application cluster group; the new disaster recovery main system is deployed in a second cluster, application copies corresponding to applications in the second cluster become new applications, and the new disaster recovery main system is used for backing up the new applications to other clusters except the second cluster in the second application cluster group; the application is deployed in a first cluster, the first cluster belongs to a first application cluster group, and other clusters except the first cluster in the first application cluster group form a second application cluster group; and each cluster in the first application cluster group is provided with a disaster recovery backup system, and the disaster recovery backup system of the first cluster is used as a disaster recovery backup main system of the application and used for backing up the application to all clusters in the second application cluster group.

The Processor 801 may be a general-purpose Processor including a Central Processing Unit (CPU), a Network Processor (NP), and the like, and may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) a mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice, data communications. Such terminals include: smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc., such as ipads.

(3) A portable entertainment device: such devices can display and play multimedia content. This type of device comprises: audio and video players (e.g., iPod), handheld game players, electronic books, and smart toys and portable car navigation devices.

(4) A server: the device for providing the computing service comprises a processor, a hard disk, a memory, a system bus and the like, and the server is similar to a general computer architecture, but has higher requirements on processing capacity, stability, reliability, safety, expandability, manageability and the like because of the need of providing high-reliability service.

(5) And other electronic devices with data interaction functions.

It should be noted that, according to the implementation requirement, each component/step described in the embodiment of the present application may be divided into more components/steps, or two or more components/steps or partial operations of the components/steps may be combined into a new component/step to achieve the purpose of the embodiment of the present application.

The above-described methods according to embodiments of the present application may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine storage medium and to be stored in a local recording medium downloaded through a network, so that the methods described herein may be stored in such software processes on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as an ASIC or FPGA. It is understood that the computer, processor, microprocessor controller or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the processing method for application failures in multiple clusters described herein. Further, when a general-purpose computer accesses code for implementing the methods illustrated herein, execution of the code transforms the general-purpose computer into a special-purpose computer for performing the methods illustrated herein.

Those of ordinary skill in the art will appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the particular application of the solution and the constraints involved. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus and system embodiments, since they are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points.

The above-described embodiments of the apparatus and system are merely illustrative, and elements not shown as separate may or may not be physically separate, and elements not shown as unit hints may or may not be physical elements, may be located in one place, or may be distributed over multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. The method for processing the application fault under the multiple clusters is characterized in that an application is deployed in a first cluster, the first cluster belongs to a first application cluster group, and other clusters except the first cluster in the first application cluster group form a second application cluster group; each cluster in the first application cluster group is deployed with a disaster recovery backup system, the disaster recovery backup system of the first cluster serves as a disaster recovery backup main system of the application and is used for backing up the application to all clusters in the second application cluster group, and the method for processing the multi-cluster application faults comprises the following steps:

responding to the application failure, and stopping the backup of the application by the disaster recovery main system;

determining a new disaster recovery main system in the disaster recovery systems of all clusters in the second application cluster group; the new disaster recovery main system is deployed in a second cluster, an application copy corresponding to the application in the second cluster becomes a new application, and the new disaster recovery main system is configured to backup the new application to other clusters except the second cluster in the second application cluster group.

2. The method for handling application failures under multiple clusters according to claim 1,

the disaster recovery system of the first cluster becomes a disaster recovery main system of the application based on the received application management resource object, and determines all clusters in the first application cluster group according to the content of the application management resource object.

3. The method for handling application failure under multiple clusters according to claim 2, after determining all clusters in the first application cluster group according to the content of the application management resource object, further comprising:

the disaster recovery main system of the application synchronizes the content of the application management resource object to the disaster recovery systems of all clusters in the second application cluster group;

and the disaster recovery system of each cluster in the second application cluster group acquires all clusters in the first application cluster group according to the received content of the application management resource object.

4. The method for processing application failures in multiple clusters according to claim 1, wherein the step of stopping backup of the application by the disaster recovery main system comprises:

and modifying the content of the cluster management resource object in the disaster recovery main system so that the disaster recovery main system loses the operation authority of the application.

5. The method for processing application failures in multiple clusters according to claim 1, further comprising, after the disaster recovery main system stops backing up the application:

the disaster backup main system becomes a disaster backup auxiliary system and informs disaster backup systems of all clusters in the second application cluster group;

the disaster recovery systems of all clusters in the second application cluster group mark the first cluster as a maintenance state.

6. The method for processing application failures under multiple clusters according to claim 5, further comprising:

responding to the application to recover to normal, and informing disaster recovery backup systems of all clusters in the second application cluster group by the disaster recovery backup auxiliary system;

the disaster recovery systems of all the clusters in the second application cluster group remove the maintenance state marks of the first cluster;

and the new disaster recovery main system backs up the new application to the first cluster.

7. The method for processing application failures in multiple clusters according to claim 1, further comprising:

in response to the application failing, the disaster recovery main system instructs a load balancing component to stop forwarding the external access traffic of the application to the first cluster;

after determining a new disaster recovery main system, the new disaster recovery main system instructs the load balancing component to forward the external access traffic of the application to the second cluster.

8. The utility model provides a processing system of application fault under many clusters, it is characterized in that, the application is disposed in first cluster, said first cluster belongs to first application cluster group, other clusters except said first cluster in said first application cluster group constitute the second application cluster group, every cluster in said first application cluster group has deployed disaster recovery system, the disaster recovery system of said first cluster is as the disaster recovery primary system of said application, be used for backing up said application to all clusters in said second application cluster group, the processing system of application fault under many clusters includes:

the stopping unit is configured to respond to the application failure, and the disaster recovery main system stops backing up the application;

the backup unit is configured to determine a new disaster recovery main system in the disaster recovery systems of all the clusters in the second application cluster group; the new disaster recovery main system is deployed in a second cluster, an application copy corresponding to the application in the second cluster becomes a new application, and the new disaster recovery main system is used for backing up the new application to other clusters except the second cluster in the second application cluster group.

9. A computer-readable storage medium having stored thereon a computer program, wherein the computer program is a method for handling application failures in multiple clusters according to any of claims 1-7.

10. An electronic device, comprising: a memory, a processor, and a program stored in the memory and executable on the processor, the processor implementing the method for handling application failures in multiple clusters according to any one of claims 1 to 7 when executing the program.