WO2012155630A1

WO2012155630A1 - Method, device, and system for disaster recovery

Info

Publication number: WO2012155630A1
Application number: PCT/CN2012/072357
Authority: WO
Inventors: 邵金龙; 景伟东; 卢勤元
Original assignee: 中兴通讯股份有限公司
Priority date: 2011-09-01
Filing date: 2012-03-15
Publication date: 2012-11-22
Also published as: CN102291262B; CN102291262A

Abstract

A method and system for disaster recovery, the method comprising: configuring correspondences between operation devices and back-up devices, and configuring for each group of operating device and back-up device status message identifying whether an operation device or a back-up device is in operation; searching for the operation device in malfunction, modifying the status message corresponding to the operation device; transmitting the modified status message to all devices in the disaster recovery system, so that in the process of establishing links, each operation device or back-up device in the recovery system selects a target device for link establishment according to the operation status message.

Description

Method, device and system for disaster tolerance

Technical field

The invention relates to a disaster recovery method, device and system in an intelligent network disaster tolerance system, and more specifically, how to automatically switch to a disaster recovery site and recover the system in time when some equipment in the disaster recovery system is abnormal The method of operation. This method can be extended to other application scenarios in a disaster-tolerant environment, not just for intelligent network applications.

Background technique

The disaster recovery system is designed to avoid fatal losses caused by severe disasters such as earthquakes and power outages. Therefore, two identical systems are established in two cities or in distant places. When a disaster such as an earthquake causes the production system to be completely unavailable, the disaster recovery system can be enabled to restore the business in time and minimize the damage caused by the disaster.

In theory, in the event of a serious natural disaster such as a major earthquake, the production system will be completely damaged. Therefore, when the disaster recovery site is restored, it will be the overall switchover and the equipment of the disaster recovery site will be used. In the disaster-tolerant environment shown in Figure 1, the production sites will all be shut down, and services will be enabled on the disaster-tolerant site.

However, the overall switching of the disaster-tolerant site is costly, and there are indeed various objective situations that are not suitable for overall switching. For example, a small fire causes some equipment in the equipment room to be damaged, and most of the remaining equipment is normal. In this case, the overall switching is performed. It will cause more damage, so partial switching is required. That is, as shown in Figure 2, if the service control point (SCP) 2 is damaged and cannot be repaired in time, the device SCP2 will be replaced by the corresponding device SCP2B on the disaster recovery site. In fact, the system will be interfaced by the production site. (Interface Machine Point, IMP) 1. SCP1, Servcie Management Point (SMP) and SCP2B of the disaster recovery site. In this way, the function of disaster tolerance is realized, and the existing normal equipment is not affected by the overall switching, and the maximum possible operation of the system is ensured. However, due to the flexibility of its design, it also brings complexity in configuration. Take the disaster recovery system shown in Figure 1 as an example. Figure 1 shows a complete disaster recovery system including a production site and a disaster recovery site. The production sites include: devices IMP1, SCP1, SCP2, and SMP. The disaster recovery site devices include: IMP1B, SCP1B, SCP2B, SMPB, where the production site These devices are the same as those in the disaster recovery site. Correspondingly, if SCP1 in the production system fails, you can use SCP1B in the disaster recovery site to replace it. When replacing, other devices related to the device need to be updated at the same time. This is the process of switching a disaster recovery device.

In the system shown in Figure 1, the device IMP1 acts as the client and needs to establish a link with the SCP1, SCP2, and SMP devices to communicate. SCP1 and SCP2 respectively act as clients, and need to actively communicate with the SMP chain. When the device SCP2 fails, you can enable the corresponding device SCP2B to be replaced by the disaster recovery site. The final network connection is as shown in Figure 2. The corresponding action is to ensure that the affected client device IMP1 can be linked with SCP2B. SCP2B itself also needs to be successfully chained with SMP. Therefore, you need to modify the configuration file on the IMP1 device, change the information originally connected to SCP2 to connect to SCP2B, and restart the program to make it effective. At the same time, SCP2B is to modify the configuration to connect to the SMP address, and restart the application. .

In the above example, the environment is relatively simple. If there are two device failures (SCP2 and SMP), then the final network connection is shown in Figure 3. All the client programs involved need to be modified and restarted, including IMP1 and SCP1. And SCP2B.

In actual situations, there are more devices at the production site, and the network connection is more complicated. If one or some devices need to be switched, one of the results is that the operation is complicated. You need to log in to the relevant device through telnet one by one. Manually edit the configuration file and restart the device program; the other is that the operator must make correct modifications to the various scenarios of the switch (any device failure, how to operate must be clear), the requirements are extremely high.

Summary of the invention

In the prior art, in the prior art, the original disaster recovery system needs to manually edit the related configuration of the affected device and restart the application, and the operation is complicated, and at the same time, the switching scenario is more caused by the operator, and the technical problem to be solved by the embodiment of the present invention is A disaster recovery method, device and system are provided to automatically switch to a standby device by a simple operation in the event of a failure.

In order to solve the above technical problem, the embodiment of the present invention provides a disaster tolerance method, including: Configuring a correspondence between the working device and the standby device, and setting a status information for each group of the working device and the standby device to be marked by the working device or the standby device;

Find the working device that has failed, and modify the status information corresponding to the working device;

The changed status information is sent to all the devices in the disaster recovery system. In the process of establishing a link, each working device or the standby device in the disaster recovery system selects a target device for establishing a link according to the status information. .

The above method can also have the following characteristics:

After the configuration of the corresponding relationship between the working device and the standby device, and the setting of a status information indicating that the working device or the standby device is working for each pair of the working device and the standby device, the method further includes: storing the corresponding relationship and the status information.

The embodiment of the invention further provides a server, including:

The configuration module is configured to: configure a correspondence between the working device and the standby device, and set a status information for each group of the working device and the standby device to be marked by the working device or the standby device;

a lookup module, which is set to: find a faulty work device;

Modifying a module, which is configured to: modify state information corresponding to the faulty working device; and

The delivery module is configured to: send the latest correspondence and status information to all devices in the disaster recovery system, so that each working device or backup device in the disaster recovery system is in the process of establishing a link, according to The status information selects a target device that establishes a link.

The above server may also include

a storage module, configured to: store the correspondence and the status information. The embodiment of the invention further provides a disaster tolerance method, including:

Corresponding relationship information between the working device and the standby device in the disaster recovery system delivered by the server, where the correspondence relationship information includes status information indicating that the working device or the standby device works;

In the process of establishing a link with the target device, selecting the target according to the corresponding relationship information The target device and the device working in the standby device corresponding to the target device establish a link. The above method can also have the following characteristics:

After receiving the correspondence information between the working device and the standby device in the disaster recovery system delivered by the server, the method further includes:

The corresponding relationship information is stored or updated. An embodiment of the present invention further provides an apparatus, including:

The proxy module is configured to: receive correspondence information of the working device and the standby device in the disaster recovery system delivered by the server, where the correspondence relationship information includes status information indicating that the working device or the standby device works;

The application module is configured to: in the process of establishing a link with the target device, select the target device and the device working in the standby device corresponding to the target device to establish a link according to the corresponding relationship information.

The above device may further include:

a storage module, configured to: store or update the correspondence information. The embodiment of the invention further provides a disaster tolerance system, including the foregoing server and multiple devices.

In summary, the method, device, and system for disaster tolerance of the embodiment of the present invention enable the operator to perform disaster recovery switching only by executing a simple command, and can implement fast automatic switching, thereby avoiding complicated manual operations and reducing possible occurrences. Operation errors, improve the efficiency of disaster recovery.

BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a connection diagram of an intelligent network disaster tolerance system in the prior art;

Figure 2 is a connection diagram of the device SCP2 after failover in the intelligent network disaster tolerance system;

3 is a connection diagram formed after a device SCP2 and an SMP failover in an intelligent network disaster tolerance system; FIG. 4 is a schematic structural diagram of a disaster tolerant system according to an embodiment of the present invention;

FIG. 5 is a flowchart of a disaster tolerance method according to Embodiment 1 of the present invention; FIG.

6 is a flowchart of a disaster tolerance method according to Embodiment 2 of the present invention; FIG. 3 is a flowchart of an operation of switching a device fault of an intelligent network disaster tolerance system according to an embodiment of the present invention. Preferred embodiment of the invention

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be noted that, in the case of no conflict, the features in the embodiments and the embodiments in the present application may be arbitrarily combined with each other.

The embodiment of the present invention mainly utilizes the correspondence between the devices on the production site and the devices on the disaster recovery site in the disaster tolerant system, and constructs a system, so that the application first passes the device corresponding to the device before establishing the link. Relationship information finds the truly usable device, and then builds a chain with it, so that the disaster recovery device does not need to be manually operated when switching, just simple operation.

In this embodiment, the information about the production site device and the corresponding disaster recovery site device is referred to as a domain name, and a system is provided, which is called a domain name service system, and the system includes a disaster recovery system composed of a production site and a disaster recovery site. a domain name management server running a domain name management program; the production site and the disaster recovery site of the disaster recovery system are each composed of one or more intelligent network devices running an agent and an intelligent network application, and the devices are different in function, Can be divided into SMP, SCP, IMP and so on.

4 is a schematic structural diagram of a disaster tolerance system according to an embodiment of the present invention. As shown in the figure, the system includes a domain name management server and multiple intelligent network devices, where:

The domain name management server runs a domain name management program, and is connected to each intelligent network device through the management program, thereby realizing the configuration, modification, and distribution of the specific instruction information, and transmitting the correspondence information of the production site and the disaster recovery site device to each On the device.

The domain name management server can include:

The configuration module is configured to configure a correspondence between the working device and the standby device, and set a working state for each group of working devices and the standby device;

Find module, set to find the faulty work device;

The modification module is configured to modify the working state corresponding to the faulty working device; the sending module is configured to send the information about the corresponding relationship and the working state to all devices in the disaster recovery system, Each working device or backup device in the system is establishing a link. In the process, the target device that establishes the link is selected according to the information of the working state.

The domain name server may further include:

The storage module is configured to store the correspondence and the information of the working state.

The intelligent network device is a server running on an intelligent network service, and is installed with an agent, and the device supports a current mainstream operating system, including an operating system such as linux, aix, hpux, and Solaris, and the agent is distributed in each On the ASON device, the device is responsible for interacting with the domain name management server and executing commands issued by the domain name management server to save the device correspondence information to each device.

As can be seen from the structure of FIG. 4, the device correspondence information is first sent by the domain name management server to each device in the disaster recovery system, and after receiving the message, the agent of each device updates the information to the shared memory of the maintenance; The intelligent network application runs on the same device together with the agent, and has the right to access the specified shared memory, so the information in the shared memory can be queried, and the finally active device is determined accordingly, and the automatic chain building is realized.

The domain name management server is configured to manage and manage the status of the production site device and the corresponding disaster recovery site device in the disaster tolerant system, and the production site device and the corresponding disaster recovery site device. Or standby, can be represented by two states 1 and 2, 1 means that the devices at the production site in this group of devices are active and require the device; 2 indicates the devices in the disaster recovery site of the group. It is active, requires the use of the device, and other key information, namely the domain name, and sends these domain information to all devices for storage in real time.

If any device status changes, for example, if one device A fails and needs to switch to the disaster recovery site device B, then the change information should be sent to all other devices, so that other devices with device A as the server can Reconnect to device B in this group of devices (A and B) whose current state is active.

The proxy module on the device is responsible for receiving correspondence information between the working device and the standby device in the disaster recovery system delivered by the domain name management server, where the correspondence information includes a mark by the working device or by the standby device. Status information of the work; and the received information can be saved to the shared memory of the device for use by the application module running on the device. The agent module may include the following modules:

a communication module, configured to communicate with a domain name management server, to receive an instruction from the domain name management server;

An analysis module, configured to decompress the received message and parse it into usable information according to a specified format;

The execution module is configured to update and save the parsed information into the shared memory for access by other applications;

The core module, set to coordinate the work between the various modules.

The application module, including a processing function of an intelligent network related service, may be configured to select a link establishment with the target device according to status information corresponding to the specified target device during the establishment of the link, or select and target the target The backup device corresponding to the device establishes a link. It can include the following modules:

a communication module, configured to communicate with other device servers or other devices;

The database module, set to log in to the database on this server or other server, and perform related database operations.

The apparatus may further include a storage module configured to perform operations of storing and updating data, such as storing information of a correspondence between the working device and the standby device and an operational state in a database, or updating data in the database.

The various parts of the above systems work together to finally realize the function of fast and automatic switching of devices in the disaster recovery system.

The embodiment of the present invention provides a method, in which a backup device corresponding to the disaster recovery site can work immediately after one or a few devices in the production site fail, and other devices connected to the faulty device are It will automatically connect to the backup device corresponding to the faulty device in the disaster recovery site, thus achieving fast and automatic switching.

FIG. 5 is a flowchart of a disaster tolerance method according to Embodiment 1 of the present invention. The method is applicable to the domain name management server, and includes the following steps:

S51. Configure a correspondence between the working device and the standby device, and set a working state for each group of the working device and the standby device. 552. Search for a faulty working device, and modify a working state corresponding to the working device.

553. Send the changed working status information to all the devices in the disaster recovery system. In the process of establishing a link, each working device or standby device in the disaster recovery system selects a link according to the working status information. Target device.

FIG. 6 is a flowchart of a disaster tolerance method according to Embodiment 2 of the present invention. The method is applicable to the foregoing apparatus, and includes the following steps:

S61. Correspondence information of the working device and the standby device in the disaster recovery system delivered by the server, where the correspondence information includes a status information indicating that the working device or the standby device works;

S62. In the process of establishing a link, select a link establishment with the target device according to status information corresponding to the target device, or select a backup device corresponding to the target device to establish a link.

The main purpose of this embodiment is to be able to change the existing manual operation to be automatically recognized by the intelligent network application. The key is how to manage, transfer, and save related device correspondence information.

The following is a flow of an application example of the present invention, which includes at least the following steps:

S10. In the domain name management program interface, the device correspondence information between the production site device and the disaster recovery site device in the disaster recovery system is entered into the database system for storage;

Specific information includes: production site device information, disaster recovery site device information, device status (indicating which is currently the active device, which is the backup device); and distributing the domain name information of all devices to the agent processing of all devices. The simplest one is used as an example. The device site information is the IP address of device A. The device information of the disaster recovery site is the IP address of device B. The device state has two values. 1 indicates that the current production site device is The active device and 2 indicate that the disaster recovery site device is an active device.

Thus, when a client program wants to connect to a server, it first searches the domain name information for the IP address, and determines that it is device A (assuming this information in the above example), and then looks at the device status, the current device. The status is 2, then the current device A is backed up (may be a fault), device B is active, so the actual connection needs to be connected to device B, so for the client, it is completely automatic The function of establishing a link. S20. When the production site equipment is faulty, select the faulty device information in the domain name management program interface, change the status of the backup device to the activity, and distribute the updated device correspondence information to all devices.

S30. Each device receives device correspondence information from the domain name management server, and saves the information update to the shared memory after parsing.

S40. The application of each device finds the information of the corresponding active device in the shared memory according to the information (ie, the IP address) of the device to be built, which is read from the configuration file, when the link is established with other devices, and Finally, a link is established with the active device. When the device is faulty, the client that is connected to the device is automatically reconnected. The link to the standby device at the disaster recovery site is established according to the above steps.

In the application example, before the step S10, the corresponding relationship between all the production site devices and the disaster recovery site devices in the disaster recovery system has been completed; the agents are deployed on all the production sites and the disaster recovery sites, and are running normally. .

In the application example, step S10 includes: entering, in the domain name management server, correspondence information of all devices, including the status of the device, etc.; performing saving after completion, saving the data to the own database; performing distribution, all the The device correspondence information is distributed to the selected device for synchronous update.

In the application example, step S20 includes: obtaining, in the domain name management server, the latest device correspondence information from its own database; selecting a record of the faulty device from the record, and changing the state of the corresponding backup device to an activity. , enable the disaster recovery site device; perform distribution, distribute all updated device correspondence information to all devices, and perform synchronous update.

In the application example, step S30 includes: the device server is in a listening state; receiving a message from the domain name management server; parsing the message, and determining to be a synchronization message; updating the specified shared memory on the server, and updating the latest device correspondence The information is saved to the shared memory; the success or failure result message is returned to the domain name management server.

In the application example, step S40 includes: when a device fails, the client device chained with the device attempts to reconnect due to a link problem; determines the IP address of the faulty device, if the client device is linked with the faulty device Restart, read the IP address of the faulty device from the configuration file; The IP address of the device is a parameter, and the application programming interface (API) function provided by the calling agent obtains the IP address of the device whose current state is active, that is, the address of the backup device that is taken to the disaster recovery site; The socket function establishes a link to the device at the disaster tolerant site.

In the above application example, the domain name management server, the proxy module on each device, and the intelligent network application module on each device complete the collection, transmission, and use of the domain name information, and truly construct a complete domain name service. The system simplifies the process and operation steps of device switching in the disaster recovery system and greatly improves the efficiency.

FIG. 7 is a flowchart of an operation of switching a device in an intelligent network disaster recovery system according to an embodiment of the present invention. The operation steps are as follows: Step 701: If the device A is faulty, the state of the disaster recovery site device corresponding to the device A needs to be modified. The disaster recovery site device starts the disaster recovery device switching operation.

Step 711: The user selects a query in the domain name management server to list the correspondence information of all devices in the current disaster recovery system.

Step 712: In the domain name management server, select the domain name information corresponding to the device A, and modify the state of the production site device and the disaster recovery site device, and change the state of the disaster recovery site device to the activity.

Step 713: In the domain name management server, when the delivery is selected, the latest device correspondence information is automatically sent to all devices.

Step 721: After receiving the message from the domain name management server, the device A parses the message, and determines that it is a synchronous operation, and obtains the latest device correspondence information.

The device in this embodiment can run two sets of programs, one is an agent program, and the other is an intelligent network application; the agent program interacts with the domain name management server, and the service provider is an intelligent network application, so that device A is faulty. It means that the intelligent network application is faulty, the service cannot be provided, and the agent is normal.

Of course, there is another situation where device A shuts down due to a failure, and neither the agent nor the intelligent network application can be used. In this case, the domain name information cannot be delivered to the device. However, for other devices, it is possible to receive messages normally, knowing that device A is The former is not available. When you connect to device A later, you can know that the device you need to connect is device B by query.

Step 722: Device A updates the shared memory and saves the latest device correspondence information to it.

Step 723: Device A also saves the latest device correspondence information to the specified file of the server to ensure that the information can be restored when the agent restarts abnormally.

Step 724: The result of the delivery operation is returned. If any of the steps 721, 722, and 723 fails, the domain name operation fails. If all the steps are successful, the domain name operation succeeds.

Step 731: After receiving the message from the domain name management server, the device B parses the message, and determines that it is a synchronous operation, and obtains the latest device correspondence information.

Step 732: Device B updates the shared memory of the server where it resides, and saves the latest device correspondence information to it.

Step 733: Device B also saves the latest device correspondence information to the specified file of the server to ensure that the information can be restored when the agent restarts abnormally.

Step 734: The device B returns the result of the delivery operation. If any of the steps 731, 732, and 733 fails, the domain name operation fails. If all the steps are successful, the domain name operation succeeds.

According to the number of devices in the disaster recovery system, the domain name service system automatically repeats the operations of step 713, step 721, step 722, step 723, and step 724.

Step 702: After the sending operation of all the devices is completed, the user completes the switching action of the device A.

One of ordinary skill in the art will appreciate that all or a portion of the above steps may be performed by a program to instruct the associated hardware, such as a read only memory, a magnetic disk, or an optical disk. Alternatively, all or part of the steps of the above embodiments may also be implemented using one or more integrated circuits. Correspondingly, each module/unit in the foregoing embodiment may be implemented in the form of hardware, or may be implemented in the form of a software function module. The invention is not limited to any What is the combination of specific forms of hardware and software.

The above is only a preferred embodiment of the present invention, and of course, the present invention may be embodied in various other embodiments without departing from the spirit and scope of the invention. Various changes and modifications may be made without departing from the scope of the appended claims.

INDUSTRIAL APPLICABILITY The method, device, and system for disaster tolerance of the embodiments of the present invention enable an operator to perform disaster recovery switching by simply executing a simple command, thereby enabling fast automatic switching, thereby avoiding complicated manual operations and reducing possible occurrences. Operation errors, improve the efficiency of disaster recovery.

Claims

Claim

1. A method of disaster recovery, including:

Configuring a correspondence between the working device and the standby device, and setting a status information for each group of the working device and the standby device to be marked by the working device or the standby device;

2. The method according to claim 1, wherein the configuring a correspondence between the working device and the standby device, and setting a status information for each pair of the working device and the standby device to be operated by the working device or by the standby device After that, it also includes:

The correspondence relationship and the status information are stored.

3. A server, including:

a search module, which is configured to: find a faulty work device; modify a module, which is set to: modify state information corresponding to the faulty work device; and

4. The server of claim 3, further comprising

a storage module, configured to: store the correspondence and the status information.

5. A method of disaster tolerance, including:

Corresponding relationship information between the working device and the standby device in the disaster recovery system delivered by the server, where the correspondence relationship information includes a status message indicating that the working device or the standby device works Interest rate

In the process of establishing a link with the target device, the target device and the device working in the standby device corresponding to the target device are selected to perform link establishment according to the corresponding relationship information.

The method according to claim 5, wherein, after receiving the correspondence information between the working device and the standby device in the disaster-tolerant system delivered by the server, the method further includes:

The corresponding relationship information is stored or updated.

7. A device comprising:

8. The device of claim 7, further comprising:

a storage module, configured to: store or update the correspondence information.

9. A disaster tolerant system, comprising: the server of claim 3 and the plurality of devices of claim 7.