CN111309515A

CN111309515A - Disaster recovery control method, device and system

Info

Publication number: CN111309515A
Application number: CN201811513686.9A
Authority: CN
Inventors: 赵洪锟; 钱义勇; 岳晓明; 王晓伟
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2018-12-11
Filing date: 2018-12-11
Publication date: 2020-06-19
Anticipated expiration: 2038-12-11
Also published as: CN111309515B

Abstract

A disaster recovery control method, device and system are used for solving the problem that when all instances of the same service in a main site are in failure, neither the main site nor a standby site can continuously provide the service for customers. The method comprises the following steps: for a first service provided by a primary site, determining an operating state of the first service at each instance of a plurality of virtual machines in the primary site. And when the number of instances with failure working states in all the instances of the first service in the primary site meets a failure policy, determining a first decision result, wherein the first decision result indicates that a standby site takes over the service of the primary site. And sending the first decision result to a standby site.

Description

Disaster recovery control method, device and system

Technical Field

The present application relates to the field of computer technologies, and in particular, to a disaster recovery control method, device, and system.

Background

With the development of cloud computing, software systems are evolving in the direction of service, and a large-scale software system of a customer is often composed of a plurality of services or components, such as a database, message middleware, business applications, and the like. In a data center, a private cloud or a public cloud environment, business shutdown can be caused when an instance deployed for a service in a site fails, and the business shutdown can bring huge economic loss and reputation influence to customers.

A currently common site disaster recovery scheme is to establish two sites, where the two sites are peer-to-peer, one of the sites serves as a primary site to provide services to customers, and the other site provides backup capability for a backup site to the primary site. The standby site monitors whether the main site fails or not by monitoring heartbeat messages with the main site. When the main site is damaged by disasters such as earthquake, fire, network chain breakage and the like, the heartbeat message between the main site and the standby site is interrupted, so that the standby site can take over the service of the main site when monitoring the interruption of the heartbeat message between the standby site and the main site, and continue to provide service for customers.

However, when all instances deployed for a certain service in the primary site fail to cause service interruption, the heartbeat messages between the primary site and the standby site are still normal, and therefore neither the primary site nor the standby site can provide the service for the customer.

Disclosure of Invention

The application provides a disaster recovery control method, device and system, which are used for solving the problem that when all instances of the same service in a master site have faults, the master site and a standby site can not continuously provide the service for a client.

In a first aspect, the present application provides a disaster recovery control method, including: for a first service provided by a primary site, determining an operating state of the first service at each instance of a plurality of virtual machines in the primary site. And when the number of instances with failure working states in all the instances of the first service in the primary site meets a failure policy, determining a first decision result, wherein the first decision result indicates that a standby site takes over the service of the primary site. And sending the first decision result to a standby site. In this embodiment of the present application, the primary site may monitor a working state of each instance of the first service in the plurality of virtual machines in the primary site, and then may switch the service to the backup site according to whether the working state of each instance indeed changes, for example, when the primary site monitors that all instances of a certain service fail, the backup site is instructed to take over the service of the primary site, so that the backup site may continue to provide services for the customer, thereby reducing economic loss and reputation influence of the customer.

In one possible design, the failure policy is tailored for the first service.

In one possible design, when determining the operating state of each instance of the first service in the plurality of virtual machines in the primary site, a failure start time of the instance may be determined for each instance of the first service in the plurality of virtual machines in the primary site. And then determining the fault duration according to the fault starting time. And if the fault duration is greater than the fault time threshold, determining that the working state of the example is a fault. And if the fault duration is less than or equal to the fault time threshold, determining that the working state of the instance is not faulted. In the design, whether the instance fails or not can be accurately judged through the failure duration time of the instance, so that the accuracy of a decision result can be improved.

In one possible design, when the failure start time of the instance is determined, the application health of the instance may be received and recorded. And if the received application health condition of the instance is abnormal and the application health condition of the instance recorded last time is normal or no record of the application health condition of the instance exists, determining the fault starting time of the instance to be the current time. When the application health condition of the example is changed from normal jump to abnormal, the jump moment can be determined as the time of starting the fault of the example, so that the design can obtain more accurate fault starting time, and the accuracy of a decision result can be improved.

In a possible design, when determining the failure start time of the instance, if the application health condition of the instance reported by a first virtual machine is not received, it may be determined that the failure start time of the instance is the current time, and the first virtual machine is a virtual machine in which the instance is deployed in the primary site. If the application health condition of the instance reported by the first virtual machine is not received, the first virtual machine can be considered to have a fault, and therefore the instance on the first virtual machine can be considered to have a fault, and therefore the fault starting time of the instance can be determined in time through the design, and the accuracy of the decision result can be improved.

In one possible design, when determining the operating state of each instance of the first service in the plurality of virtual machines in the primary site, the time at which the application health of the instance was last received may also be determined for each instance of the first service in the plurality of virtual machines in the primary site. Determining an interruption time according to the time of the last receiving of the application health condition of the instance. And if the interruption time is greater than a fault time threshold value, determining that the working state of the instance is a fault. And if the interruption time is less than or equal to a fault time threshold value, determining that the working state of the instance is not faulted. If the application health condition of the instance reported by the first virtual machine is not received for a long time, the first virtual machine can be considered to have a fault, and therefore the instance on the first virtual machine can be considered to have a fault, and therefore the interruption time reported by the first virtual machine can be considered to be the fault duration of the virtual machine, that is, the instance on the first virtual machine at least has a fault for the interruption time, and therefore whether the working state of the instance has a fault or not can be determined according to the interruption time and time, and therefore the accuracy of the decision result can be improved.

In one possible design, the primary site may determine the time to failure threshold by parsing the failure policy.

In a possible design, after determining the working state of each instance of the first service in the plurality of virtual machines in the primary site, if the number of instances whose working states are failures in all instances of the first service in the primary site does not satisfy the failure policy, a second decision result may be determined, where the second decision result is that the secondary site is not instructed to take over the traffic of the primary site. In the design, if the instance of the first service does not satisfy the failure policy, it may be determined that the standby site is not instructed to take over the service of the primary site, so that the primary site may continue to provide services for the customer.

In a possible design, when the first decision result is sent to the standby station, the first decision result may be sent to the standby station through an arbitration service. In the design, the standby station can accurately acquire the decision result of the main station when the heartbeat network between the standby station and the main station is interrupted, so that the service of the main station can be taken over in time, and the risk of service halt can be further realized.

In a possible design, sending the first decision result to the backup site through the arbitration service may be implemented by writing the first decision result into an instance of the arbitration service of the primary site, where the instance of the arbitration service may be deployed in the primary site, the backup site, and the arbitration site. In the above design, the instance of the arbitration service in the primary site, the instance of the arbitration service in the backup site, and the instance of the arbitration service in the arbitration site form a cluster, so that the first decision result written by the instance of the arbitration service in the primary site can be shared in the cluster. Therefore, the standby site can obtain the first decision result in the cluster in a query mode.

In a second aspect, the present application provides a primary site, comprising: the system comprises a plurality of virtual machines, and a working state unit, a decision unit and a sending unit which are deployed in a first virtual machine. The working state unit is used for determining the working state of each instance of the first service in a plurality of virtual machines in a main site aiming at the first service provided by the main site. The decision unit is configured to determine a first decision result when the number of instances in which the working state of the first service in all instances in the plurality of virtual machines in the primary site is a fault satisfies a fault policy, where the first decision result indicates that the backup site takes over the service of the primary site. The sending unit is used for sending the first decision result to the standby station.

In one possible design, the operating state unit may be specifically configured to: determining, for each instance of the first service in a plurality of virtual machines in the primary site, a failure start time for the instance; determining the fault duration according to the fault starting time; if the fault duration is greater than the fault time threshold, determining that the working state of the instance is a fault; and if the fault duration is less than or equal to the fault time threshold, determining that the working state of the instance is not faulted.

In one possible design, the operating state unit, when determining the failure start time of the instance, may be specifically configured to: receiving and recording the application health condition of the instance; and if the received application health condition of the instance is abnormal and the application health condition of the instance recorded last time is normal or no record of the application health condition of the instance exists, determining the fault starting time of the instance to be the current time.

In one possible design, the operating state unit, when determining the failure start time of the instance, may be specifically configured to: if the application health condition of the instance reported by the first virtual machine is not received, determining that the failure starting time of the instance is the current time, and the first virtual machine is a virtual machine in the main site for deploying the instance.

In one possible design, the operating state unit may be specifically configured to: determining, for each instance of the first service in a plurality of virtual machines in the primary site, a time of last receipt of an application health of the instance; determining an interruption time according to the time of the last receiving of the application health condition of the instance; if the interruption time is greater than a fault time threshold value, determining that the working state of the instance is a fault; and if the interruption time is less than or equal to a fault time threshold value, determining that the working state of the instance is not faulted.

In a possible design, the decision unit may be further configured to: and when the number of the instances with the working states of faults in all the instances of the first service does not meet the fault strategy, determining a second decision result, wherein the second decision result does not indicate the standby site to take over the service of the main site.

In one possible design, the sending unit may be specifically configured to: and sending the first decision result to the standby station through arbitration service.

In one possible design, the first virtual machine is one of the plurality of virtual machines, or the first virtual machine is not one of the plurality of virtual machines.

In a third aspect, the present application provides a host site running a plurality of virtual machines, the plurality of virtual machines including an instance of a first service. The main site comprises a disaster tolerance service module. The disaster recovery service module is configured to determine a working state of each instance of the first service in the plurality of virtual machines; if the number of instances with the working states of faults in all the instances of the first service meets a fault strategy, determining a first decision result, wherein the first decision result indicates that a standby site takes over the service of the main site; and sending the first decision result to the standby station.

In a possible design, when determining the working state of the first service in each instance of the plurality of virtual machines, the disaster recovery service module may be specifically configured to: determining, for each instance of the first service in a plurality of virtual machines in the primary site, a failure start time for the instance; determining the fault duration according to the fault starting time; if the fault duration is greater than the fault time threshold, determining that the working state of the instance is a fault; and if the fault duration is less than or equal to the fault time threshold, determining that the working state of the instance is not faulted.

In a possible design, each of the plurality of virtual machines may include a disaster recovery agent module, and each disaster recovery agent module is configured to report an application health condition of the instance of the first service on the virtual machine where the disaster recovery agent module is located to the disaster recovery service module. The disaster recovery service module, when determining the failure start time of the instance, may be specifically configured to: receiving and recording the application health condition of the instance reported by a first disaster recovery agent module, wherein the first disaster recovery agent module is a disaster recovery agent module included in a virtual machine for deploying the instance; and if the received application health condition of the instance is abnormal and the application health condition of the instance recorded last time is normal or no record of the application health condition of the instance exists, determining the fault starting time of the instance to be the current time.

In one possible design, each of the plurality of virtual machines may include a disaster recovery agent module, and each disaster recovery agent module is configured to report an application health condition of an instance of the first service on the virtual machine where the disaster recovery agent module is located to the disaster recovery service module; the disaster recovery service module, when determining the failure start time of the instance, may be specifically configured to: if the application health condition of the instance reported by the first disaster recovery agent module is not received, determining that the failure starting time of the instance is the current time, wherein the first disaster recovery agent module is a disaster recovery agent module included in a virtual machine for deploying the instance.

In one possible design, each of the plurality of virtual machines may include a disaster recovery agent module, and each disaster recovery agent module is configured to report an application health condition of an instance of the first service on the virtual machine where the disaster recovery agent module is located to the disaster recovery service module; the disaster recovery service module, when determining the working state of the first service in each instance of the plurality of virtual machines, may specifically be configured to: determining, for each instance of the first service in the plurality of virtual machines in the primary site, a time at which a first disaster recovery agent module reports the application health condition of the instance last time, where the first disaster recovery agent module is a disaster recovery agent module deployed on a virtual machine including the instance in the primary site; determining interruption time according to the time when the first disaster recovery agent module reports the application health condition of the instance last time; if the interruption time is greater than a fault time threshold value, determining that the working state of the instance is a fault; and if the interruption time is less than or equal to a fault time threshold value, determining that the working state of the instance is not faulted.

In a possible design, the master station may further include a heartbeat interface, and the heartbeat interface is configured to send and receive heartbeat messages between the master station and the slave station. The disaster recovery service module, when sending the first decision result to the backup site, may specifically be configured to: and sending the first decision result to the standby station through the heartbeat interface.

In one possible design, the disaster recovery service module may be further configured to: and if the number of the instances with the working states of faults in all the instances of the first service does not meet the fault strategy, determining a second decision result, wherein the second decision result does not indicate that the standby site takes over the service of the main site.

In one possible design, the master station may further include an arbitration service module, and the arbitration service module is configured to provide arbitration services. The disaster recovery service module, when sending the first decision result to the backup site, may specifically be configured to: and sending the first decision result to the standby station through the arbitration service provided by the arbitration service module.

In one possible design, the number of the disaster recovery service modules in the primary site may be two, where one of the disaster recovery service modules serves as a primary service and the other serves as a backup service. And the disaster recovery service module as the standby service is used for taking over the service of the disaster recovery service module as the main service when the disaster recovery service module as the main service fails.

In a fourth aspect, the present application provides a station, where the station includes a processor, a memory, a communication interface, and a bus, where the processor, the memory, and the communication interface are connected by the bus and perform mutual communication, where the memory is used to store computer execution instructions, and when the apparatus is running, the processor executes the computer execution instructions in the memory to perform the operation steps of the method in the first aspect or any one of the possible implementations of the first aspect by using hardware resources in the apparatus.

In a fifth aspect, the present application provides a disaster recovery system, including the primary site and the backup site described in the second aspect or any one of the designs of the second aspect.

In one possible design, the disaster recovery system may further include an arbitration site. The arbitration site is used for providing arbitration service for the main site and the standby site.

In one possible design, the primary site, the backup site, and the arbitration site each include an arbitration service module. The arbitration service module of the arbitration site is used for providing arbitration service for the main site and the standby site; the arbitration service module of the primary site is used for sending the decision result of the primary site to the standby site through arbitration service; and the arbitration service module of the standby site is used for acquiring the decision result of the main site through arbitration service.

In a sixth aspect, the present application provides a computer-readable storage medium having stored therein instructions, which, when executed on a computer, cause the computer to perform the method of the first aspect or any one of the possible implementations of the first aspect.

In a seventh aspect, the present application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of the first aspect or any one of the possible implementations of the first aspect.

The present application can further combine to provide more implementations on the basis of the implementations provided by the above aspects.

Drawings

Fig. 1 is a schematic diagram of a station protection scheme provided in an embodiment of the present application;

fig. 2 is a schematic diagram of a disaster recovery protection scheme provided in an embodiment of the present application;

fig. 3 is a schematic diagram of an architecture of a disaster recovery system according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a primary site according to an embodiment of the present application;

fig. 5 is a schematic flowchart illustrating a process of updating a failure start time of a first instance by a first disaster recovery service module according to an embodiment of the present application;

fig. 6 is a schematic flowchart of a first disaster recovery service module detecting a working state of a first example according to an embodiment of the present application;

fig. 7 is a schematic flow chart illustrating a disaster recovery switching performed by the first disaster recovery service module according to an embodiment of the present application;

fig. 8 is a schematic flowchart of a disaster recovery control method according to an embodiment of the present application;

fig. 9 is a schematic diagram of a disaster recovery handover process according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a primary site according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a master site according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention.

Aiming at the problem that the service is stopped when the instance fails, a single site protection scheme can be adopted. Currently, the commonly used single site protection schemes are:

(1) clustering scheme: multiple instances are deployed in a site for the same service to form a cluster. When some instances fail, other instances in the cluster may continue to provide service. For example, referring to fig. 1, 3 instances are deployed in a site for application a, and when one of the 3 instances fails, the service of application a can be provided by the other two instances.

(2) Double-live (cold-standby) protocol: two instances are deployed in the site for the same service, where one instance is in a running state and one instance is in a stopped state. By the aid of the auxiliary monitoring means, when the running instance fails, the stopped instance is pulled up, and service is continuously provided. For example, as shown in fig. 1, for a Web server/reverse proxy server and an email proxy server (Engine x, Nginx), 2 Nginx are deployed in a site, where one Nginx is in an operating state and another Nginx is in a stopped state, and when a Nginx in the operating state fails, the Nginx in the stopped state is pulled up to provide services of the Nginx.

(3) Double-live (hot-standby) protocol: two instances are deployed in the site for the same service, where both instances are running, one as a primary service and the other as a backup service. And when the main service instance fails, the standby service instance becomes the main service and continues to provide the service. For example, as shown in fig. 1, 2 DBs are deployed in a site for a Database (DB), wherein one DB is a main service and the other DB is a standby service, and when the DB serving as the main service fails, the DB serving as the standby service becomes the main service to continue providing services.

A single site protection solution addresses the problem of failure of the same service portion instance within a single site. If all the same service instances in a single site fail, the service is interrupted and service cannot be provided continuously.

Aiming at the problem that business is stopped when a site fails, a disaster recovery protection scheme can be adopted. The disaster recovery protection scheme is as follows: two sites are established, wherein the two sites are peer-to-peer, one site serves the customer as a primary site, and the other site provides backup capability for the primary site for a backup site. The standby site monitors whether the main site fails or not by monitoring heartbeat messages with the main site. When the main site is damaged by an earthquake, a fire, a network link failure, or other disasters, the heartbeat message between the main site and the standby site is interrupted, so that the standby site can take over the service of the main site and continue to provide service for the customer when monitoring the interruption of the heartbeat message with the main site, as shown in fig. 2. The disaster recovery protection scheme can be divided into remote disaster recovery and same-city disaster recovery, wherein the remote disaster recovery means that two sites are deployed in different cities, and the same-city disaster recovery means that the two sites are deployed in different places of the same city.

Based on this, in the embodiment of the present application, by monitoring the working state of each instance in the site, and then, whether to actually switch the service to the standby site according to the working state of each instance, compared with the switching to the standby site after the power failure or the failure of the entire master site in the prior art, the embodiment of the present application can implement monitoring with a smaller granularity, so that the service can be switched to the standby site when some application cannot provide the service in the site, for example, when all instances of some service fail, the service can be switched to the standby site, so that the standby site can continue to provide the service for the client, and further, the economic loss and reputation influence of the client are reduced.

The plural in the present application means two or more. Reference to at least one in this application means one, or more than one, i.e. including one, two, three and more.

In addition, it is to be understood that the terms first, second, etc. in the description of the present application are used for distinguishing between the descriptions and not necessarily for describing a sequential or chronological order.

The disaster recovery control method provided in the embodiment of the present application may be applied to the disaster recovery system shown in fig. 3, where the disaster recovery system may include a primary site and a backup site, and may further include an arbitration site, where the primary site is used to provide services for clients. The main site can also determine a decision result according to the working state of each instance, wherein the decision result is a first decision result or a second decision result, the first decision result indicates that the standby site takes over the service of the main site, and the second decision result does not indicate that the standby site takes over the service of the main site. The standby site is used for taking over the service of the main site under the indication of the main site. The arbitration station is used for providing arbitration service for the main station and the standby station, the main station is also used for sending a decision result to the standby station through the arbitration service, and the standby station is also used for obtaining the decision result of the main station through the arbitration service.

In one embodiment, as shown in fig. 4, a primary site may include at least one disaster recovery agent module and a first disaster recovery service module, where each disaster recovery agent module is independently deployed on one virtual machine operated by the primary site, and the first disaster recovery service module may be deployed on any virtual machine in the primary site. It should be understood that fig. 4 is only an exemplary illustration, and the number of virtual machines, disaster recovery proxy modules, and first disaster recovery service modules included in the primary site is not specifically limited.

And each disaster tolerance agent module is used for reporting the application health condition of each instance on the virtual machine where the disaster tolerance agent module is located to the first disaster tolerance service module.

The disaster recovery agent module may report the application health condition of each instance on the virtual machine where the disaster recovery agent module is located through steps a1 to a 4:

and A1, reading the deployment information of the application on the virtual machine.

And A2, determining the instances included in the virtual machine according to the deployment information, such as a key business management service, a database, message middleware and the like.

A3, collecting the application health condition of each instance on the virtual machine. For example, the disaster tolerance agent module may collect the application health condition of each instance on the virtual machine in a polling-like manner, for example, the disaster tolerance agent module may collect the application health condition of each instance by querying a process, a service state, a keep-alive interface, and the like of each instance.

A4, reporting the application health condition of each instance to the first disaster recovery service module. Illustratively, the disaster tolerance agent module may report the health information of the application by calling an interface provided by the first disaster tolerance service module, for example, the interface may be an interface defined by yaml.

And the first disaster recovery service module is used for determining a decision result based on the application health condition reported by each disaster recovery agent module and sending the first decision result to the standby site.

As a possible implementation, the first disaster recovery service module may determine the decision result through steps B1 to B4:

b1, the first disaster recovery service module parses the failure policy configuration file. The fault policy profile may include fault policies tailored for all services, or the fault policy profile may also include fault policies tailored for a plurality of services, e.g., the fault policy profile may include a fault policy tailored for a first service, a fault policy tailored for a second service, etc.

And B2, the first disaster recovery service module receives the application health condition reported by the disaster recovery agent module, and updates the failure starting time of the application in the cache according to the application health condition of the instance.

For each instance on the virtual machine, if the received application health condition of the instance is abnormal, and the application health condition of the instance recorded last in the cache is normal or no record of the application health condition of the instance, the first disaster recovery service module sets the failure start time of the instance in the cache as the current time.

If the received application health condition of the instance is normal, the first disaster recovery service module may set the application health condition of the instance to be normal in the cache, and may also set the failure start time of the instance to be 0.

Another exemplary illustration shows that, if the application health condition of the instance reported by the first virtual machine is not received, the first disaster recovery service module may consider that the first virtual machine fails, that is, the instance fails, so that the failure start time of the instance may be set as the current time in the cache.

For better understanding of the embodiment of the present application, the process of step B2 is described in detail below with reference to a specific application scenario. The following description will take a first example as an example, and refer to fig. 5 to illustrate a process of updating the failure start time of the first example for the first disaster recovery service module:

s501, the first disaster recovery service module receives the application health condition of the first instance reported by the first disaster recovery agent module, where the first disaster recovery agent module is a disaster recovery agent module deployed on a virtual machine where the first instance is located. Step S502 is performed.

S502, the first disaster recovery service module determines whether the cache has the failure starting time of the first instance. If yes, go to step S503; if not, go to step S507.

S503, the first disaster recovery service module determines whether the application health condition of the first instance in the cache is normal. If yes, go to step S504; if not, go to step S511.

S504, the first disaster recovery service module determines whether the application health condition of the first instance reported by the first disaster recovery agent module is normal. If so, go to step S505, otherwise, go to step S506.

And S505, the first disaster recovery service module does not update the application health condition and the failure starting time of the first instance in the cache.

S506, the first disaster recovery service module sets the start time of the failure of the first instance in the cache as the current time.

S507, the first disaster recovery service module adds a record of the first instance to the cache. Step S508 is performed.

S508, the first disaster recovery service module determines whether the health condition of the application of the first instance reported by the first disaster recovery agent module is normal. If so, go to step S509, otherwise, go to step S510.

S509, the first disaster recovery service module sets the application health condition of the first instance to be normal in the cache, and sets the failure start time of the first instance to be 0.

S510, the first disaster recovery service module sets the failure start time of the first instance as the current time in the cache.

S511, the first disaster recovery service module determines whether the application health condition of the first instance reported by the first disaster recovery agent module is normal. If yes, go to step S512, otherwise go to step S513.

S512, the first disaster recovery service module sets the application health condition of the first instance to be normal in the cache, and updates the failure starting time of the first instance to be 0.

S513, the first disaster recovery service module does not update the application health condition and the failure start time of the first instance in the cache.

And B3, periodically detecting the working state of each instance according to the failure starting time of each instance in the cache.

In one implementation, for each instance, when the failure start time of the instance exists in the cache, if the failure start time of the instance is 0 or the application health condition of the instance in the cache is normal, it may be determined that the working state of the instance is not failed. If the fault start time of the instance is not 0 or the application health condition of the instance in the cache is not normal, the fault duration may be determined according to the fault start time. For example, Δ t1 is t1-t2, where Δ t1 is the fault duration, t1 is the current time, and t2 is the fault start time. When the fault duration is greater than a fault time threshold, the operational status of the instance may be determined to be a fault. When the fault duration is less than or equal to the fault time threshold, the operational status of the instance may be determined to be non-faulty.

If the fault starting time of the instance does not exist in the cache and the time of the application health condition of the instance is never received, the fault starting time of the instance can be determined as the current time, and the fault starting time of the instance is recorded in the cache.

In another implementation, a time of last receipt of the application health of the instance is determined. Determining an interruption time according to the time of the last receiving of the application health condition of the instance. Illustratively, Δ t2 ═ t1-t3, where Δ t2 is the fault duration, t1 is the current time, and t3 is the time of the most recent receipt of the application health of the instance. If the interruption time is greater than a fault time threshold, the working state of the instance may be determined to be a fault. If the interruption time is less than or equal to the fault time threshold, the working state of the instance may be determined to be non-faulty.

For example, the time to failure threshold may be configured in a failure policy. Therefore, the first disaster recovery service module can parse the failure policy configuration file to determine the failure time threshold through step B1.

For better understanding of the embodiment of the present application, the process of step B3 is described in detail below with reference to a specific application scenario. And the first disaster recovery service module traverses each instance and detects the working state of each instance. The following description will be given by taking a first example as an example, where the first example belongs to a first service, and refer to fig. 6, which shows a process of detecting an operating state of the first example for a first disaster recovery service module:

s601, the first disaster recovery service module determines whether the application health condition of the first instance from the first disaster recovery agent module is received. If yes, go to step S602. If not, go to step S608.

S602, the first disaster recovery service module determines an interruption time of the first instance, where the interruption time of the first instance is a difference between a current time and a time of receiving the application health condition of the first instance from the first disaster recovery service module last time. Step S603 is performed.

S603, the first disaster recovery service module determines whether the interruption time of the first instance is greater than a failure time threshold in a failure policy corresponding to the first service. If yes, go to step S604. If not, go to step S605.

S604, the first disaster recovery service module determines that the working state of the first instance is a fault.

S605, the first disaster recovery service module determines a failure duration of the first instance, where the failure duration of the first instance is a difference between the current time and a failure start time of the first instance. Step S606 is performed.

S606, the first disaster recovery service module determines whether the failure duration of the first instance is greater than the failure time threshold in the failure policy corresponding to the first service. If yes, go to step S604. If not, go to step S607.

S607, the first disaster recovery service module determines that the working status of the first instance is not failed.

S608, the first disaster recovery service module determines whether a failure start time of the first instance exists in the cache. If yes, go to step S605. If not, go to step S609.

S609, the first disaster recovery service module sets the failure start time of the first instance as the current time in the cache.

B4, determining whether the number of instances with failure status in all instances of the service satisfies the failure policy. If yes, determining a first decision result, namely indicating the standby site to take over the service of the main site. If not, determining a second decision result, and not indicating the standby site to take over the service of the main site.

For example, the failure policy may include a threshold number of failed instances, and thus when the operating state of the service is that the number of failed instances is greater than the threshold number, it may be determined that the failure policy is satisfied, and conversely, it may be determined that the failure policy is not satisfied. For example, the failure policy is that the threshold value of the number of instances with failures is 4, so it may be determined that the failure policy is satisfied when the operating state of the service is determined that the number of instances with failures is greater than 4, and conversely, it is determined that the failure policy is not satisfied.

Or, the failure policy may also include a weight threshold of the failed instance, so that when the operating state of the service is that the weight of the failed instance to all instances of the service is greater than the weight threshold, it may be determined that the failure policy is satisfied, and conversely, it may be determined that the failure policy is not satisfied. For example, the failure policy is more than half of the instances with the failure in the working state, so that it may be determined that the failure policy is satisfied when the instances with the failure in the working state of the service account for more than 50% of all the instances of the service, and conversely, it may be determined that the failure policy is not satisfied. For another example, the failure policy is that all instances fail, so it may be determined that the failure policy is satisfied when it is determined that all instances of the service fail, and conversely, it is determined that the failure policy is not satisfied.

And if the fault strategy configuration file comprises fault strategies formulated aiming at all the services, determining whether the number of the instances with the fault working states in all the instances of the service meets the fault strategies. And if the fault strategy configuration file comprises fault strategies formulated aiming at the plurality of services respectively, determining whether the number of the instances with the fault working states in all the instances of the service meets the fault strategy formulated aiming at the service.

The main site may further include a second disaster recovery service module, where the second disaster recovery service module deploys any virtual machine in the main site. And the second disaster recovery service module is used for taking over the service of the first disaster recovery service module when the first disaster recovery service module fails. When the first disaster recovery service module runs, the second disaster recovery service module may be in a stop state or in a running state, which is not specifically limited in this embodiment of the present application.

In one implementation, the master station may further include a heartbeat interface, and the heartbeat interface is used to receive and send heartbeat messages between the master station and the slave station.

In an exemplary illustration, when the first disaster recovery service module sends the first decision result to the backup site, the first decision result may be sent to the backup site through the heartbeat interface.

The main site can further comprise an arbitration service module, the arbitration service module can be deployed on any virtual machine of the main site, and the arbitration service module is used for storing decision results. The first disaster recovery service module is further configured to write the decision result into the arbitration service module after determining the decision result.

The standby station and the arbitration station can also comprise arbitration service modules. The arbitration service module in the main site, the arbitration service module in the standby site and the arbitration service module in the arbitration site can form a cluster, and the arbitration service module of the main site can share the decision result in the cluster after writing the decision result, so that the standby site can obtain the decision result in a query mode.

In another exemplary illustration, when the first disaster recovery service module sends the first decision result to the backup station, the first decision result may also be stored in an arbitration service module, so that the first decision result is sent to the backup station through an arbitration service.

The main site may include two arbitration service modules, one of which serves as a main service to provide arbitration services, the other of which serves as a standby service, and when the arbitration service module serving as the main service fails, the arbitration service module serving as the standby service serves as the main service to continue arbitration services. When the arbitration service module serving as the main service runs, the arbitration service module serving as the standby service may be in a stop state or a running state, which is not specifically limited in the present application.

In order to better understand the embodiment of the present application, the following describes in detail a process of performing disaster recovery switching on the first disaster recovery service module in combination with a specific application scenario. Referring to FIG. 7:

s701, the first disaster recovery service module traverses each service and determines whether each service meets a corresponding fault strategy or not. If yes, go to step S702. If not, go to step S703.

S702, the first disaster recovery service module obtains a second decision result. Step S707 is executed.

S703, the first disaster recovery service module obtains a first decision result. Step S704 is performed.

S704, the first disaster recovery service module determines whether the site is a main site. If yes, go to step S705. If not, the process is ended.

S705, the first disaster recovery service module determines whether the heartbeat with the backup station is normal. If yes, go to step S706. If not, step S707 is executed.

S706, the first disaster recovery service module sends a first decision result to the standby station through the heartbeat interface to indicate that the standby station is upgraded to the primary station. And after the standby site is upgraded to the main site, the service of the original main site is taken over. Step S707 is executed.

S707, the first disaster recovery service module writes the first decision result into the arbitration service module.

In another embodiment, the structures of the backup site and the arbitration site in the disaster recovery system shown in fig. 3 may refer to the structure of the primary site shown in fig. 4, and are not repeated here.

The first disaster recovery service module in the standby station can also be used for receiving a first decision result from the main station through a heartbeat interface of the standby station and executing processing for taking over the service of the main station.

The first disaster recovery service module of the backup site may be further configured to obtain a decision result of the primary site through the arbitration service, and if the obtained decision result is the first decision result, perform processing of taking over the service of the primary site. For example, the first disaster recovery service module of the backup site may periodically query the arbitration service module in the backup site to obtain the decision result of the primary site.

In one embodiment, after receiving the first decision result from the primary site through the heartbeat interface of the backup site, the first disaster recovery service module of the backup site processes the first decision result only once if the decision result obtained by the arbitration service is the first decision result.

The disaster recovery control method provided in the embodiment of the present application is further described below with reference to the disaster recovery system shown in fig. 3. Referring to fig. 8, a flowchart of a method of the disaster recovery control method provided in the present application is shown. The method can be implemented by the primary site and the standby site in fig. 3, and the method can include the following steps:

s801, a primary site determines the working state of each instance of a first service in a plurality of virtual machines in the primary site aiming at the first service provided by the primary site. The first service may be any service provided by the primary site, or the first service may also be a key service in the primary site, such as a database, a key service management service, and the like.

In one embodiment, the determining, by the primary site, the operating state of each instance of the first service in the plurality of virtual machines in the primary site may be implemented through steps C1 to C4:

c1, for each instance of the first service in the plurality of virtual machines in the primary site, the primary site determining a failure start time of the instance.

An exemplary illustration, a home site can receive and record application health of the instance. And if the received application health condition of the instance is abnormal and the application health condition of the instance recorded last time is normal or no record of the application health condition of the instance exists, determining the fault starting time of the instance to be the current time. The application health condition of the instance may be reported by a first virtual machine, where the first virtual machine is a virtual machine in the host site, where the instance is deployed. Further, the first virtual machine may periodically report the application health of the instance.

Another exemplary illustration shows that, if the application health condition of the instance reported by the first virtual machine is not received, the primary site may consider that the first virtual machine fails, that is, the instance fails, so that it may be determined that the failure start time of the instance is the current time.

C2, the master site determining the fault duration according to the fault start time. For example, Δ t1 is t1-t2, where Δ t1 is the fault duration, t1 is the current time, and t2 is the fault start time.

C3, if the failure duration is greater than the failure time threshold, the master site determines that the working state of the instance is failure.

C4, if the failure duration is less than or equal to the failure time threshold, the primary site determines that the working state of the instance is not failed.

In another embodiment, the determining, by the primary site, the operating state of each instance of the first service in the plurality of virtual machines in the primary site may be further implemented by steps D1 to D4:

d1, determining, for each instance of the first service in the plurality of virtual machines in the primary site, a time at which the application health of the instance was last received;

d2, determining the interruption time according to the time of the application health condition of the last received instance. Illustratively, Δ t2 ═ t1-t3, where Δ t2 is the fault duration, t1 is the current time, and t3 is the time of the most recent receipt of the application health of the instance.

D3, if the interruption time is larger than the fault time threshold, determining the working state of the instance as fault.

D4, if the interruption time is less than or equal to the fault time threshold, determining that the working state of the instance is not fault.

In implementation, the disaster tolerance agent module of the first service in the plurality of virtual machines in the primary site may report the application health condition of each instance of the first service to the first disaster tolerance service module of the primary site. The first disaster recovery service module can then determine an operating state of each instance of the first service based on the application health of each instance of the first service.

The process that the disaster recovery agent module of the first service in the multiple virtual machines in the primary site reports the application health condition of each instance of the first service to the first disaster recovery service module of the primary site may refer to steps a1 to a4, which is not described herein again. The process of determining the working state of each instance of the first service by the first disaster recovery service module may refer to step B3, and details are not repeated here.

S802, when the number of instances in which the working status is a failure in all instances of the first service in the primary site satisfies a failure policy, the primary site determines a first decision result, where the first decision result indicates that the backup site takes over the service of the primary site.

Or, if the number of instances whose working states are failures in all the instances of the first service in the primary site does not satisfy the failure policy, the primary site may determine a second decision result, where the second decision result does not indicate that the backup site takes over the service of the primary site.

Specifically, the failure policy may be a policy formulated for the first service, or the failure policy may also be a policy formulated for all services, which is not specifically limited in the present application.

In an implementation, step S802 may be performed by a first disaster recovery service module in the primary site. The process of determining the decision result by the first disaster recovery service module may refer to step B4, which is not repeated herein.

And S803, the primary site sends the first decision result to the standby site.

In one implementation, the primary site may send the first decision result directly to the backup site. For example, the primary site may send the first decision result to the backup site through a heartbeat network between the primary site and the backup site.

In another implementation manner, when the primary site sends the first decision result to the backup site, the primary site may send the first decision result to the backup site through an arbitration service. For example, the primary site may write the first decision result into an instance of the arbitration service of the primary site, where the instance of the arbitration service may be deployed in the primary site, the backup site, and the arbitration site. The instance of the arbitration service in the primary site, the instance of the arbitration service in the backup site, and the instance of the arbitration service in the arbitration site form a cluster, so that the first decision result written by the instance of the arbitration service in the primary site can be shared within the cluster. Therefore, the standby site can obtain the first decision result in the cluster in a query mode.

In implementation, step S803 may be performed by the first disaster recovery service module of the primary site.

S804, the standby site takes over the service of the main site after receiving the first decision result. The standby site can take over all traffic of the primary site. Alternatively, the backup site may take over the primary site's first service. The present application is not particularly limited.

In an implementation, step S804 may be performed by the first disaster recovery service module of the backup site.

In the embodiment of the application, the master site may monitor the working state of each instance, and then may indeed switch the service to the standby site according to the working state of each instance, for example, when the master site monitors that all instances of a certain service are failed, the standby site is instructed to take over the service of the master site, so that the standby site may continue to provide services for the customer, thereby reducing economic loss and reputation influence of the customer.

For better understanding of the embodiment of the present application, the following describes in detail a disaster recovery handover process with reference to the disaster recovery system shown in fig. 4. The process of disaster recovery handover is shown in fig. 9:

s901, the disaster recovery agent module of the main site collects the application health condition of each instance on the virtual machine.

For example, the disaster recovery agent module collects application health conditions of each instance on the virtual machine where the disaster recovery agent module is located, which may refer to steps a1 to A3 described above, and details are not repeated here.

And S902, the disaster recovery agent module of the main site reports the acquired application health condition to the first disaster recovery service module of the main site.

And S903, the first disaster recovery service module of the primary site analyzes the fault strategy configuration file.

And S904, the first disaster recovery service module of the main site summarizes the application health condition reported by the disaster recovery agent module of the main site, and makes a decision result.

For example, the process of summarizing the application health condition reported by the disaster recovery agent module of the primary site and making the decision result by the first disaster recovery service module may refer to the above steps B2 to B4, which are not repeated herein.

S905, the first disaster recovery service module of the primary site writes the decision result into the arbitration service module of the primary site.

S906, if the decision result made by the first disaster recovery service module of the primary site indicates that the backup site takes over the service of the primary site, and the heartbeat between the primary site and the backup site is normal, the first disaster recovery service module of the primary site calls a heartbeat interface of the primary site to send the decision result through a heartbeat network between the primary site and the backup site.

And S907, after the first disaster recovery service module of the standby site receives the decision result indicating that the standby site takes over the service of the main site, the first disaster recovery service module executes the operation of taking over the service of the main site.

S908, the first disaster recovery service module of the backup site periodically queries the decision result of the primary site in the arbitration service module of the backup site.

S909, if the decision result queried in the arbitration service module of the backup site indicates that the backup site takes over the service of the primary site, the first disaster recovery service module of the backup site performs an operation of taking over the service of the primary site.

In one implementation, before step S909, if the first disaster recovery service module of the backup site receives the decision result indicating that the backup site takes over the service of the primary site from the primary site, the first disaster recovery service module of the backup site does not perform an operation of repeatedly taking over the service of the primary site when the decision result queried in the arbitration service module of the backup site is the service indicating that the backup site takes over the service of the primary site.

Based on the same inventive concept as the above embodiment, the embodiment of the present invention provides a master station 100, which is specifically configured to implement the method described in the embodiment illustrated in fig. 8. The primary site 100 runs a plurality of virtual machines on which instances of the first service may run. A working state unit 101, a decision unit 102, and a sending unit 103 are deployed on a first virtual machine run by a master site, where the first virtual machine may be one of the multiple virtual machines, or the first virtual machine may not be one of the multiple virtual machines. Taking the first virtual machine as an example, which is not one of the virtual machines, the main site 100 may be configured as shown in fig. 10. It should be understood that fig. 10 is only an exemplary illustration of the structure of the primary site, and does not specifically limit the number of virtual machines included in the primary site, the number and types of services provided by the primary site, the relationship between the first virtual machine and the plurality of virtual machines, and the like.

The operating state unit 101 is configured to determine, for a first service provided by a primary site, an operating state of each instance of a plurality of virtual machines in the primary site for the first service. The decision unit 102 is configured to determine a first decision result when the number of instances in which the working state of the first service in all instances in the plurality of virtual machines in the primary site is a fault satisfies a fault policy, where the first decision result is a service indicating that the primary site is taken over by the backup site. The sending unit 103 is configured to send the first decision result to a standby station.

In one implementation, the operating state unit 101 may be specifically configured to: determining, for each instance of the first service in a plurality of virtual machines in the primary site, a failure start time for the instance; determining the fault duration according to the fault starting time; if the fault duration is greater than the fault time threshold, determining that the working state of the instance is a fault; and if the fault duration is less than or equal to the fault time threshold, determining that the working state of the instance is not faulted.

For an exemplary illustration, the operation status unit 101, when determining the failure start time of the instance, may specifically be configured to: receiving and recording the application health condition of the instance; and if the received application health condition of the instance is abnormal and the application health condition of the instance recorded last time is normal or no record of the application health condition of the instance exists, determining the fault starting time of the instance to be the current time.

In another exemplary illustration, when determining the start time of the fault of the instance, the operating state unit 101 may be further specifically configured to: if the application health condition of the instance reported by the first virtual machine is not received, determining that the failure starting time of the instance is the current time, and the first virtual machine is a virtual machine in the main site for deploying the instance.

In another implementation manner, the operating state unit 101 may be further specifically configured to: determining, for each instance of the first service in a plurality of virtual machines in the primary site, a time of last receipt of an application health of the instance; determining an interruption time according to the time of the last receiving of the application health condition of the instance; if the interruption time is greater than a fault time threshold value, determining that the working state of the instance is a fault; and if the interruption time is less than or equal to a fault time threshold value, determining that the working state of the instance is not faulted.

The decision unit 102 may be further configured to: and when the number of the instances with the working states of faults in all the instances of the first service does not meet the fault strategy, determining a second decision result, wherein the second decision result does not indicate the standby site to take over the service of the main site.

In a possible implementation, the sending unit 103 may be specifically configured to: and sending the first decision result to the standby station through arbitration service.

The master site 100 may be the master site in the embodiment corresponding to fig. 3 or fig. 4 for performing the operations performed by the master site in the embodiment corresponding to fig. 5-9. The operation state unit 101, the decision unit 102 and the sending unit 103 in the master site 100 may be software units in the first disaster recovery service module shown in fig. 4.

The division of the modules in the embodiments of the present application is schematic, and only one logical function division is provided, and in actual implementation, there may be another division manner, and in addition, each functional module in each embodiment of the present application may be integrated in one processor, may also exist alone physically, or may also be integrated in one module by two or more modules. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

Where the integrated module can be implemented in hardware, as shown in fig. 11, the host site may include a processor 802. Multiple virtual machines may run on the processor 802. The hardware of the entity corresponding to the above modules may be the processor 802. The processor 802 may be a Central Processing Unit (CPU), a digital processing module, or the like. The primary site may further include communication interfaces 801A and 801B, and the processor 802 may send and receive messages to and from the secondary site through the communication interface 801A, where the communication interface 801A may be a heartbeat interface. The processor 802 may send and receive messages to arbitration stations via the communication interface 801B. The primary site further includes: a memory 803 for storing programs executed by the processor 802. The memory 803 may be a nonvolatile memory such as a Hard Disk Drive (HDD) or a solid-state drive (SSD), and may also be a volatile memory (RAM), for example, a random-access memory (RAM). The memory 803 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such.

The processor 802 is configured to execute the program codes stored in the memory 803, and is specifically configured to execute corresponding functions of the disaster recovery proxy module and the first disaster recovery processing module.

In the embodiment of the present application, the specific connection medium between the communication interface 801A, the communication interface 801B, the processor 802, and the memory 803 is not limited. In the embodiment of the present application, the memory 803, the processor 802, the communication interface 801A, and the communication interface 801B are connected by the bus 804 in fig. 11, the bus is shown by a thick line in fig. 11, and the connection manner between other components is merely illustrative and not limited. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 11, but this is not intended to represent only one bus or type of bus.

The primary site shown in fig. 11 may be the primary site in embodiments corresponding to fig. 3 or fig. 4. Processor 802 in the master site, executing computer readable instructions in memory 803, may cause the master site to perform the operations performed by the master site in the embodiments corresponding to fig. 5-9. The memory 803 stores therein a Linux, Unix, or Windows based operating system, and virtual machine software instructions for generating a virtual machine on the operating system. Based on the operating system, the processor 802 executes the virtual machine software instructions, and may obtain a master site including multiple virtual machines as shown in fig. 4 or fig. 10 on the master site shown in fig. 11.

The embodiment of the present invention further provides a computer-readable storage medium, which is used for storing computer software instructions required to be executed for executing the processor, and which contains a program required to be executed for executing the processor.

The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded or executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more collections of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a Solid State Drive (SSD).

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

Claims

1. A disaster recovery control method, characterized in that the method comprises:

for a first service provided by a primary site, determining the working state of each instance of the first service in a plurality of virtual machines in the primary site;

when the number of instances with working states of faults in all instances of the first service in the plurality of virtual machines in the primary site meets a fault strategy, determining a first decision result, wherein the first decision result indicates that a standby site takes over the service of the primary site;

and sending the first decision result to a standby site.

2. The method of claim 1, wherein determining an operating state of each instance of the first service in the plurality of virtual machines in the primary site comprises:

determining, for each instance of the first service in a plurality of virtual machines in the primary site, a failure start time for the instance;

determining the fault duration according to the fault starting time;

if the fault duration is greater than the fault time threshold, determining that the working state of the instance is a fault;

and if the fault duration is less than or equal to the fault time threshold, determining that the working state of the instance is not faulted.

3. The method of claim 2, wherein determining the fault start time of the instance comprises:

receiving and recording the application health condition of the instance;

and if the received application health condition of the instance is abnormal and the application health condition of the instance recorded last time is normal or no record of the application health condition of the instance exists, determining the fault starting time of the instance to be the current time.

4. The method of claim 2, wherein determining the fault start time of the instance comprises:

if the application health condition of the instance reported by the first virtual machine is not received, determining that the failure starting time of the instance is the current time, and the first virtual machine is a virtual machine in the main site for deploying the instance.

5. The method of claim 1, wherein determining an operating state of each instance of the first service in the plurality of virtual machines in the primary site comprises:

determining, for each instance of the first service in a plurality of virtual machines in the primary site, a time of last receipt of an application health of the instance;

determining an interruption time according to the time of the application health condition of the last receiving of the instance;

if the interruption time is greater than a fault time threshold value, determining that the working state of the instance is a fault;

and if the interruption time is less than or equal to a fault time threshold value, determining that the working state of the instance is not faulted.

6. The method of any of claims 1 to 5, wherein after determining the operating state of each instance of the first service in the plurality of virtual machines in the primary site, the method further comprises:

if the number of instances with the failure working states in all the instances of the first service in the primary site does not satisfy the failure policy, determining a second decision result, wherein the second decision result does not indicate that the backup site takes over the service of the primary site.

7. The method of claim 1, wherein the sending the first decision result to the backup site comprises:

and sending the first decision result to the standby station through arbitration service.

8. A master site, comprising: a plurality of virtual machines, and a working state unit, a decision unit and a sending unit which are deployed in a first virtual machine, wherein

The working state unit is used for determining the working state of each instance of a plurality of virtual machines of a first service in a main site aiming at the first service provided by the main site;

the decision unit is configured to determine a first decision result when the number of instances in which the working state of the first service in all instances in the plurality of virtual machines in the primary site is a fault satisfies a fault policy, where the first decision result indicates that the backup site takes over the service of the primary site;

the sending unit is used for sending the first decision result to the standby station.

9. The primary site of claim 8, wherein the operating state unit is specifically configured to:

determining the fault duration according to the fault starting time;

10. The primary site of claim 9, wherein the operating state unit, when determining the failure start time of the instance, is specifically configured to:

receiving and recording the application health condition of the instance;

11. The primary site of claim 9, wherein the operating state unit, when determining the failure start time of the instance, is specifically configured to:

12. The primary site of claim 8, wherein the operating state unit is specifically configured to:

determining an interruption time according to the time of the last receiving of the application health condition of the instance;

13. The primary site of any one of claims 8 to 12, wherein the decision unit is further configured to:

and when the number of the instances with the working states of faults in all the instances of the first service does not meet the fault strategy, determining a second decision result, wherein the second decision result does not indicate the standby site to take over the service of the main site.

14. The primary site of claim 8, wherein the sending unit is specifically configured to:

15. The master site of claim 8, wherein the first virtual machine is one of the plurality of virtual machines or the first virtual machine is not one of the plurality of virtual machines.

16. A disaster recovery system comprising a primary site according to any one of claims 8 to 14 and a backup site.

17. The disaster recovery system of claim 16 further comprising an arbitration site;

the arbitration station is used for providing arbitration service for the main station and the standby station.