WO2012155630A1 - Method, device, and system for disaster recovery - Google Patents

Method, device, and system for disaster recovery Download PDF

Info

Publication number
WO2012155630A1
WO2012155630A1 PCT/CN2012/072357 CN2012072357W WO2012155630A1 WO 2012155630 A1 WO2012155630 A1 WO 2012155630A1 CN 2012072357 W CN2012072357 W CN 2012072357W WO 2012155630 A1 WO2012155630 A1 WO 2012155630A1
Authority
WO
WIPO (PCT)
Prior art keywords
working
standby
information
disaster recovery
correspondence
Prior art date
Application number
PCT/CN2012/072357
Other languages
French (fr)
Chinese (zh)
Inventor
邵金龙
景伟东
卢勤元
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2012155630A1 publication Critical patent/WO2012155630A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0668Management of faults, events, alarms or notifications using network fault recovery by dynamic selection of recovery network elements, e.g. replacement by the most appropriate element after failure

Definitions

  • the invention relates to a disaster recovery method, device and system in an intelligent network disaster tolerance system, and more specifically, how to automatically switch to a disaster recovery site and recover the system in time when some equipment in the disaster recovery system is abnormal
  • the method of operation This method can be extended to other application scenarios in a disaster-tolerant environment, not just for intelligent network applications.
  • the disaster recovery system is designed to avoid fatal losses caused by severe disasters such as earthquakes and power outages. Therefore, two identical systems are established in two cities or in distant places. When a disaster such as an earthquake causes the production system to be completely unavailable, the disaster recovery system can be enabled to restore the business in time and minimize the damage caused by the disaster.
  • the overall switching of the disaster-tolerant site is costly, and there are indeed various objective situations that are not suitable for overall switching. For example, a small fire causes some equipment in the equipment room to be damaged, and most of the remaining equipment is normal. In this case, the overall switching is performed. It will cause more damage, so partial switching is required. That is, as shown in Figure 2, if the service control point (SCP) 2 is damaged and cannot be repaired in time, the device SCP2 will be replaced by the corresponding device SCP2B on the disaster recovery site. In fact, the system will be interfaced by the production site. (Interface Machine Point, IMP) 1. SCP1, Servcie Management Point (SMP) and SCP2B of the disaster recovery site.
  • SCP Service control point
  • SMP Servcie Management Point
  • Figure 1 shows a complete disaster recovery system including a production site and a disaster recovery site.
  • the production sites include: devices IMP1, SCP1, SCP2, and SMP.
  • the disaster recovery site devices include: IMP1B, SCP1B, SCP2B, SMPB, where the production site These devices are the same as those in the disaster recovery site.
  • SCP1 in the production system fails, you can use SCP1B in the disaster recovery site to replace it. When replacing, other devices related to the device need to be updated at the same time. This is the process of switching a disaster recovery device.
  • the device IMP1 acts as the client and needs to establish a link with the SCP1, SCP2, and SMP devices to communicate.
  • SCP1 and SCP2 respectively act as clients, and need to actively communicate with the SMP chain.
  • the final network connection is as shown in Figure 2.
  • the corresponding action is to ensure that the affected client device IMP1 can be linked with SCP2B.
  • SCP2B itself also needs to be successfully chained with SMP. Therefore, you need to modify the configuration file on the IMP1 device, change the information originally connected to SCP2 to connect to SCP2B, and restart the program to make it effective.
  • SCP2B is to modify the configuration to connect to the SMP address, and restart the application. .
  • the original disaster recovery system needs to manually edit the related configuration of the affected device and restart the application, and the operation is complicated, and at the same time, the switching scenario is more caused by the operator, and the technical problem to be solved by the embodiment of the present invention is A disaster recovery method, device and system are provided to automatically switch to a standby device by a simple operation in the event of a failure.
  • the embodiment of the present invention provides a disaster tolerance method, including: Configuring a correspondence between the working device and the standby device, and setting a status information for each group of the working device and the standby device to be marked by the working device or the standby device;
  • the changed status information is sent to all the devices in the disaster recovery system.
  • each working device or the standby device in the disaster recovery system selects a target device for establishing a link according to the status information. .
  • the method further includes: storing the corresponding relationship and the status information.
  • the embodiment of the invention further provides a server, including:
  • the configuration module is configured to: configure a correspondence between the working device and the standby device, and set a status information for each group of the working device and the standby device to be marked by the working device or the standby device;
  • a lookup module which is set to: find a faulty work device
  • Modifying a module which is configured to: modify state information corresponding to the faulty working device;
  • the delivery module is configured to: send the latest correspondence and status information to all devices in the disaster recovery system, so that each working device or backup device in the disaster recovery system is in the process of establishing a link, according to The status information selects a target device that establishes a link.
  • the above server may also include
  • a storage module configured to: store the correspondence and the status information.
  • the embodiment of the invention further provides a disaster tolerance method, including:
  • Corresponding relationship information between the working device and the standby device in the disaster recovery system delivered by the server where the correspondence relationship information includes status information indicating that the working device or the standby device works;
  • the target device and the device working in the standby device corresponding to the target device establish a link.
  • the above method can also have the following characteristics:
  • the method further includes:
  • An embodiment of the present invention further provides an apparatus, including:
  • the proxy module is configured to: receive correspondence information of the working device and the standby device in the disaster recovery system delivered by the server, where the correspondence relationship information includes status information indicating that the working device or the standby device works;
  • the application module is configured to: in the process of establishing a link with the target device, select the target device and the device working in the standby device corresponding to the target device to establish a link according to the corresponding relationship information.
  • the above device may further include:
  • a storage module configured to: store or update the correspondence information.
  • the embodiment of the invention further provides a disaster tolerance system, including the foregoing server and multiple devices.
  • the method, device, and system for disaster tolerance of the embodiment of the present invention enable the operator to perform disaster recovery switching only by executing a simple command, and can implement fast automatic switching, thereby avoiding complicated manual operations and reducing possible occurrences. Operation errors, improve the efficiency of disaster recovery.
  • FIG. 1 is a connection diagram of an intelligent network disaster tolerance system in the prior art
  • Figure 2 is a connection diagram of the device SCP2 after failover in the intelligent network disaster tolerance system
  • FIG. 3 is a connection diagram formed after a device SCP2 and an SMP failover in an intelligent network disaster tolerance system;
  • FIG. 4 is a schematic structural diagram of a disaster tolerant system according to an embodiment of the present invention.
  • FIG. 5 is a flowchart of a disaster tolerance method according to Embodiment 1 of the present invention.
  • FIG. 6 is a flowchart of a disaster tolerance method according to Embodiment 2 of the present invention
  • FIG. 3 is a flowchart of an operation of switching a device fault of an intelligent network disaster tolerance system according to an embodiment of the present invention.
  • the embodiment of the present invention mainly utilizes the correspondence between the devices on the production site and the devices on the disaster recovery site in the disaster tolerant system, and constructs a system, so that the application first passes the device corresponding to the device before establishing the link. Relationship information finds the truly usable device, and then builds a chain with it, so that the disaster recovery device does not need to be manually operated when switching, just simple operation.
  • the information about the production site device and the corresponding disaster recovery site device is referred to as a domain name
  • a system which is called a domain name service system
  • the system includes a disaster recovery system composed of a production site and a disaster recovery site.
  • a domain name management server running a domain name management program;
  • the production site and the disaster recovery site of the disaster recovery system are each composed of one or more intelligent network devices running an agent and an intelligent network application, and the devices are different in function, Can be divided into SMP, SCP, IMP and so on.
  • FIG. 4 is a schematic structural diagram of a disaster tolerance system according to an embodiment of the present invention. As shown in the figure, the system includes a domain name management server and multiple intelligent network devices, where:
  • the domain name management server runs a domain name management program, and is connected to each intelligent network device through the management program, thereby realizing the configuration, modification, and distribution of the specific instruction information, and transmitting the correspondence information of the production site and the disaster recovery site device to each On the device.
  • the domain name management server can include:
  • the configuration module is configured to configure a correspondence between the working device and the standby device, and set a working state for each group of working devices and the standby device;
  • Find module set to find the faulty work device
  • the modification module is configured to modify the working state corresponding to the faulty working device; the sending module is configured to send the information about the corresponding relationship and the working state to all devices in the disaster recovery system, Each working device or backup device in the system is establishing a link. In the process, the target device that establishes the link is selected according to the information of the working state.
  • the domain name server may further include:
  • the storage module is configured to store the correspondence and the information of the working state.
  • the intelligent network device is a server running on an intelligent network service, and is installed with an agent, and the device supports a current mainstream operating system, including an operating system such as linux, aix, hpux, and Solaris, and the agent is distributed in each On the ASON device, the device is responsible for interacting with the domain name management server and executing commands issued by the domain name management server to save the device correspondence information to each device.
  • a current mainstream operating system including an operating system such as linux, aix, hpux, and Solaris
  • the device correspondence information is first sent by the domain name management server to each device in the disaster recovery system, and after receiving the message, the agent of each device updates the information to the shared memory of the maintenance;
  • the intelligent network application runs on the same device together with the agent, and has the right to access the specified shared memory, so the information in the shared memory can be queried, and the finally active device is determined accordingly, and the automatic chain building is realized.
  • the domain name management server is configured to manage and manage the status of the production site device and the corresponding disaster recovery site device in the disaster tolerant system, and the production site device and the corresponding disaster recovery site device. Or standby, can be represented by two states 1 and 2, 1 means that the devices at the production site in this group of devices are active and require the device; 2 indicates the devices in the disaster recovery site of the group. It is active, requires the use of the device, and other key information, namely the domain name, and sends these domain information to all devices for storage in real time.
  • any device status changes for example, if one device A fails and needs to switch to the disaster recovery site device B, then the change information should be sent to all other devices, so that other devices with device A as the server can Reconnect to device B in this group of devices (A and B) whose current state is active.
  • the proxy module on the device is responsible for receiving correspondence information between the working device and the standby device in the disaster recovery system delivered by the domain name management server, where the correspondence information includes a mark by the working device or by the standby device. Status information of the work; and the received information can be saved to the shared memory of the device for use by the application module running on the device.
  • the agent module may include the following modules:
  • a communication module configured to communicate with a domain name management server, to receive an instruction from the domain name management server;
  • An analysis module configured to decompress the received message and parse it into usable information according to a specified format
  • the execution module is configured to update and save the parsed information into the shared memory for access by other applications;
  • the core module set to coordinate the work between the various modules.
  • the application module including a processing function of an intelligent network related service, may be configured to select a link establishment with the target device according to status information corresponding to the specified target device during the establishment of the link, or select and target the target
  • the backup device corresponding to the device establishes a link. It can include the following modules:
  • a communication module configured to communicate with other device servers or other devices
  • the database module set to log in to the database on this server or other server, and perform related database operations.
  • the apparatus may further include a storage module configured to perform operations of storing and updating data, such as storing information of a correspondence between the working device and the standby device and an operational state in a database, or updating data in the database.
  • a storage module configured to perform operations of storing and updating data, such as storing information of a correspondence between the working device and the standby device and an operational state in a database, or updating data in the database.
  • the embodiment of the present invention provides a method, in which a backup device corresponding to the disaster recovery site can work immediately after one or a few devices in the production site fail, and other devices connected to the faulty device are It will automatically connect to the backup device corresponding to the faulty device in the disaster recovery site, thus achieving fast and automatic switching.
  • FIG. 5 is a flowchart of a disaster tolerance method according to Embodiment 1 of the present invention. The method is applicable to the domain name management server, and includes the following steps:
  • S51 Configure a correspondence between the working device and the standby device, and set a working state for each group of the working device and the standby device. 552. Search for a faulty working device, and modify a working state corresponding to the working device.
  • each working device or standby device in the disaster recovery system selects a link according to the working status information.
  • Target device
  • FIG. 6 is a flowchart of a disaster tolerance method according to Embodiment 2 of the present invention. The method is applicable to the foregoing apparatus, and includes the following steps:
  • the main purpose of this embodiment is to be able to change the existing manual operation to be automatically recognized by the intelligent network application.
  • the key is how to manage, transfer, and save related device correspondence information.
  • the device correspondence information between the production site device and the disaster recovery site device in the disaster recovery system is entered into the database system for storage;
  • Specific information includes: production site device information, disaster recovery site device information, device status (indicating which is currently the active device, which is the backup device); and distributing the domain name information of all devices to the agent processing of all devices.
  • the simplest one is used as an example.
  • the device site information is the IP address of device A.
  • the device information of the disaster recovery site is the IP address of device B.
  • the device state has two values. 1 indicates that the current production site device is The active device and 2 indicate that the disaster recovery site device is an active device.
  • a client program when a client program wants to connect to a server, it first searches the domain name information for the IP address, and determines that it is device A (assuming this information in the above example), and then looks at the device status, the current device. The status is 2, then the current device A is backed up (may be a fault), device B is active, so the actual connection needs to be connected to device B, so for the client, it is completely automatic The function of establishing a link. S20. When the production site equipment is faulty, select the faulty device information in the domain name management program interface, change the status of the backup device to the activity, and distribute the updated device correspondence information to all devices.
  • Each device receives device correspondence information from the domain name management server, and saves the information update to the shared memory after parsing.
  • each device finds the information of the corresponding active device in the shared memory according to the information (ie, the IP address) of the device to be built, which is read from the configuration file, when the link is established with other devices, and Finally, a link is established with the active device.
  • the client that is connected to the device is automatically reconnected.
  • the link to the standby device at the disaster recovery site is established according to the above steps.
  • the agents are deployed on all the production sites and the disaster recovery sites, and are running normally. .
  • step S10 includes: entering, in the domain name management server, correspondence information of all devices, including the status of the device, etc.; performing saving after completion, saving the data to the own database; performing distribution, all the The device correspondence information is distributed to the selected device for synchronous update.
  • step S20 includes: obtaining, in the domain name management server, the latest device correspondence information from its own database; selecting a record of the faulty device from the record, and changing the state of the corresponding backup device to an activity. , enable the disaster recovery site device; perform distribution, distribute all updated device correspondence information to all devices, and perform synchronous update.
  • step S30 includes: the device server is in a listening state; receiving a message from the domain name management server; parsing the message, and determining to be a synchronization message; updating the specified shared memory on the server, and updating the latest device correspondence The information is saved to the shared memory; the success or failure result message is returned to the domain name management server.
  • step S40 includes: when a device fails, the client device chained with the device attempts to reconnect due to a link problem; determines the IP address of the faulty device, if the client device is linked with the faulty device Restart, read the IP address of the faulty device from the configuration file;
  • the IP address of the device is a parameter, and the application programming interface (API) function provided by the calling agent obtains the IP address of the device whose current state is active, that is, the address of the backup device that is taken to the disaster recovery site;
  • the socket function establishes a link to the device at the disaster tolerant site.
  • API application programming interface
  • the domain name management server, the proxy module on each device, and the intelligent network application module on each device complete the collection, transmission, and use of the domain name information, and truly construct a complete domain name service.
  • FIG. 7 is a flowchart of an operation of switching a device in an intelligent network disaster recovery system according to an embodiment of the present invention.
  • the operation steps are as follows: Step 701: If the device A is faulty, the state of the disaster recovery site device corresponding to the device A needs to be modified. The disaster recovery site device starts the disaster recovery device switching operation.
  • Step 711 The user selects a query in the domain name management server to list the correspondence information of all devices in the current disaster recovery system.
  • Step 712 In the domain name management server, select the domain name information corresponding to the device A, and modify the state of the production site device and the disaster recovery site device, and change the state of the disaster recovery site device to the activity.
  • Step 713 In the domain name management server, when the delivery is selected, the latest device correspondence information is automatically sent to all devices.
  • Step 721 After receiving the message from the domain name management server, the device A parses the message, and determines that it is a synchronous operation, and obtains the latest device correspondence information.
  • the device in this embodiment can run two sets of programs, one is an agent program, and the other is an intelligent network application; the agent program interacts with the domain name management server, and the service provider is an intelligent network application, so that device A is faulty. It means that the intelligent network application is faulty, the service cannot be provided, and the agent is normal.
  • Step 722 Device A updates the shared memory and saves the latest device correspondence information to it.
  • Step 723 Device A also saves the latest device correspondence information to the specified file of the server to ensure that the information can be restored when the agent restarts abnormally.
  • Step 724 The result of the delivery operation is returned. If any of the steps 721, 722, and 723 fails, the domain name operation fails. If all the steps are successful, the domain name operation succeeds.
  • Step 731 After receiving the message from the domain name management server, the device B parses the message, and determines that it is a synchronous operation, and obtains the latest device correspondence information.
  • Step 732 Device B updates the shared memory of the server where it resides, and saves the latest device correspondence information to it.
  • Step 733 Device B also saves the latest device correspondence information to the specified file of the server to ensure that the information can be restored when the agent restarts abnormally.
  • Step 734 The device B returns the result of the delivery operation. If any of the steps 731, 732, and 733 fails, the domain name operation fails. If all the steps are successful, the domain name operation succeeds.
  • the domain name service system automatically repeats the operations of step 713, step 721, step 722, step 723, and step 724.
  • Step 702 After the sending operation of all the devices is completed, the user completes the switching action of the device A.
  • the method, device, and system for disaster tolerance of the embodiments of the present invention enable an operator to perform disaster recovery switching by simply executing a simple command, thereby enabling fast automatic switching, thereby avoiding complicated manual operations and reducing possible occurrences. Operation errors, improve the efficiency of disaster recovery.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Hardware Redundancy (AREA)
  • Computer And Data Communications (AREA)

Abstract

A method and system for disaster recovery, the method comprising: configuring correspondences between operation devices and back-up devices, and configuring for each group of operating device and back-up device status message identifying whether an operation device or a back-up device is in operation; searching for the operation device in malfunction, modifying the status message corresponding to the operation device; transmitting the modified status message to all devices in the disaster recovery system, so that in the process of establishing links, each operation device or back-up device in the recovery system selects a target device for link establishment according to the operation status message.

Description

一种容灾的方法、 装置及系统  Method, device and system for disaster tolerance
技术领域 Technical field
本发明涉及到智能网容灾系统中的一种容灾的方法、 装置及系统, 更具 体的说, 是容灾系统中某些设备异常时如何快速的自动切换到容灾站点、 及 时恢复系统运行的方法。 这种方法, 可以推广到容灾环境中的其他应用场景, 而并非仅仅针对智能网应用。  The invention relates to a disaster recovery method, device and system in an intelligent network disaster tolerance system, and more specifically, how to automatically switch to a disaster recovery site and recover the system in time when some equipment in the disaster recovery system is abnormal The method of operation. This method can be extended to other application scenarios in a disaster-tolerant environment, not just for intelligent network applications.
背景技术 Background technique
容灾系统是为了避免地震、 电力中断等严重灾害造成致命的损失, 所以 在两个城市、 或者相距较远的地方建立两套完全一样的系统。 当出现地震等 灾难导致生产系统完全不可用时, 可以启用容灾系统, 从而及时恢复业务, 将灾难导致的损失降至最低。  The disaster recovery system is designed to avoid fatal losses caused by severe disasters such as earthquakes and power outages. Therefore, two identical systems are established in two cities or in distant places. When a disaster such as an earthquake causes the production system to be completely unavailable, the disaster recovery system can be enabled to restore the business in time and minimize the damage caused by the disaster.
从理论上讲, 出现大地震等严重自然灾害时, 生产系统将是完全损坏了, 因此启用容灾站点恢复业务的时候, 将是整体切换, 完全使用容灾站点的设 备。 如图 1所示的容灾环境中, 生产站点将会全部关闭, 完全在容灾站点上 启用业务。  In theory, in the event of a serious natural disaster such as a major earthquake, the production system will be completely damaged. Therefore, when the disaster recovery site is restored, it will be the overall switchover and the equipment of the disaster recovery site will be used. In the disaster-tolerant environment shown in Figure 1, the production sites will all be shut down, and services will be enabled on the disaster-tolerant site.
但是, 由于容灾站点进行整体切换代价高昂, 而且确实存在各种客观情 况不适合整体切换, 比如小型火灾导致机房的部分设备损坏了, 而剩余大部 分设备正常, 在这种情况下进行整体切换会造成更大的损失, 所以要求进行 部分切换。 即如图 2所示,假设设备业务控制点(Service Control Point, SCP ) 2损坏, 无法及时修复, 那么设备 SCP2将由容灾站点上对应的设备 SCP2B 来替换, 实际上系统将由生产站点的接口机( Interface Machine Point, IMP ) 1、 SCP1、业务管理点( Servcie Management Point, SMP )和容灾站点的 SCP2B 来组成。 这样的话, 既实现了容灾的功能, 也不会因为进行整体切换而影响 现有正常的设备, 最大可能的保证了系统的稳定运行。 但是, 相应的由于其 设计的灵活性, 也带来了配置方面的复杂性。 以图 1所示的容灾系统为例来 说明。 图 1 中显示一个完整的容灾系统包括生产站点和容灾站点, 其中生产站 点包括:设备 IMP1、 SCP1、 SCP2和 SMP,容灾站点设备包括: IMP1B、 SCP1B、 SCP2B、 SMPB, 其中, 生产站点和容灾站点的这些设备的都是——对应的, 如果生产系统中的 SCP1故障, 那么可以使用容灾站点中的 SCP1B来替换, 替换时, 与该设备相关的其他设备同时需要做更新, 这就是一个容灾设备切 换的过程。 However, the overall switching of the disaster-tolerant site is costly, and there are indeed various objective situations that are not suitable for overall switching. For example, a small fire causes some equipment in the equipment room to be damaged, and most of the remaining equipment is normal. In this case, the overall switching is performed. It will cause more damage, so partial switching is required. That is, as shown in Figure 2, if the service control point (SCP) 2 is damaged and cannot be repaired in time, the device SCP2 will be replaced by the corresponding device SCP2B on the disaster recovery site. In fact, the system will be interfaced by the production site. (Interface Machine Point, IMP) 1. SCP1, Servcie Management Point (SMP) and SCP2B of the disaster recovery site. In this way, the function of disaster tolerance is realized, and the existing normal equipment is not affected by the overall switching, and the maximum possible operation of the system is ensured. However, due to the flexibility of its design, it also brings complexity in configuration. Take the disaster recovery system shown in Figure 1 as an example. Figure 1 shows a complete disaster recovery system including a production site and a disaster recovery site. The production sites include: devices IMP1, SCP1, SCP2, and SMP. The disaster recovery site devices include: IMP1B, SCP1B, SCP2B, SMPB, where the production site These devices are the same as those in the disaster recovery site. Correspondingly, if SCP1 in the production system fails, you can use SCP1B in the disaster recovery site to replace it. When replacing, other devices related to the device need to be updated at the same time. This is the process of switching a disaster recovery device.
在图 1所示的系统中,设备 IMP1作为客户端,需要主动跟 SCP1、 SCP2、 SMP三个设备建立链路进行通信, SCP1、 SCP2分别作为客户端, 需要主动 跟 SMP建链进行通信。 当设备 SCP2出现故障时, 可以启用容灾站点上对应 的设备 SCP2B来替换, 最终的网络连接如图 2所示, 相应要做的动作就是保 证受到影响的客户端设备 IMP1 能够跟 SCP2B建链、 SCP2B本身也需要跟 SMP成功建链。因此,需要在 IMP1设备上修改配置文件,将原先连接到 SCP2 的信息改成连到 SCP2B, 并重启该程序使之生效; 同时 SCP2B上则是修改配 置连到 SMP的地址, 并重启该应用程序。  In the system shown in Figure 1, the device IMP1 acts as the client and needs to establish a link with the SCP1, SCP2, and SMP devices to communicate. SCP1 and SCP2 respectively act as clients, and need to actively communicate with the SMP chain. When the device SCP2 fails, you can enable the corresponding device SCP2B to be replaced by the disaster recovery site. The final network connection is as shown in Figure 2. The corresponding action is to ensure that the affected client device IMP1 can be linked with SCP2B. SCP2B itself also needs to be successfully chained with SMP. Therefore, you need to modify the configuration file on the IMP1 device, change the information originally connected to SCP2 to connect to SCP2B, and restart the program to make it effective. At the same time, SCP2B is to modify the configuration to connect to the SMP address, and restart the application. .
上面的例子, 环境相对简单, 如果有两个设备故障 ( SCP2和 SMP ) , 那么最终的网络连接如图 3所示,涉及到的所有客户端程序都需要同步修改、 并重启, 包括 IMP1、 SCP1和 SCP2B。  In the above example, the environment is relatively simple. If there are two device failures (SCP2 and SMP), then the final network connection is shown in Figure 3. All the client programs involved need to be modified and restarted, including IMP1 and SCP1. And SCP2B.
实际情况中, 生产站点的设备更多, 网络连接更加复杂, 那么某个或者 某些设备故障需要切换时, 导致的一个结果就是操作复杂, 需要逐个通过远 程登录(telnet )方式登录到相关设备上手工编辑配置文件、 并重启该设备程 序; 另外一个是操作者必须要对切换的各种场景 (任何一个设备故障, 应该 怎么操作都必须了解清楚)都能做出正确的修改, 要求极高。  In actual situations, there are more devices at the production site, and the network connection is more complicated. If one or some devices need to be switched, one of the results is that the operation is complicated. You need to log in to the relevant device through telnet one by one. Manually edit the configuration file and restart the device program; the other is that the operator must make correct modifications to the various scenarios of the switch (any device failure, how to operate must be clear), the requirements are extremely high.
发明内容 Summary of the invention
鉴于现有技术中原有容灾系统切换时需要手工编辑受影响设备的相关配 置并重启应用, 操作复杂, 同时切换场景较多导致对于操作人员要求较高, 本发明实施例要解决的技术问题是提供一种容灾的方法、 装置及系统, 以实 现在出现故障时通过简单操作就可以自动切换到备用设备。  In the prior art, in the prior art, the original disaster recovery system needs to manually edit the related configuration of the affected device and restart the application, and the operation is complicated, and at the same time, the switching scenario is more caused by the operator, and the technical problem to be solved by the embodiment of the present invention is A disaster recovery method, device and system are provided to automatically switch to a standby device by a simple operation in the event of a failure.
为了解决上述技术问题, 本发明实施例提供了一种容灾的方法, 包括: 配置工作设备与备用设备的对应关系, 并为每一组工作设备和备用设备 设置一标记由工作设备或者由备用设备进行工作的状态信息; In order to solve the above technical problem, the embodiment of the present invention provides a disaster tolerance method, including: Configuring a correspondence between the working device and the standby device, and setting a status information for each group of the working device and the standby device to be marked by the working device or the standby device;
查找出现故障的工作设备, 修改该工作设备对应的状态信息;  Find the working device that has failed, and modify the status information corresponding to the working device;
将更改后的状态信息发送给容灾系统内的所有设备, 由所述容灾系统内 的各个工作设备或备用设备在建立链路的过程中, 根据所述状态信息选择建 立链路的目标设备。  The changed status information is sent to all the devices in the disaster recovery system. In the process of establishing a link, each working device or the standby device in the disaster recovery system selects a target device for establishing a link according to the status information. .
上述方法还可具有以下特点:  The above method can also have the following characteristics:
所述配置工作设备与备用设备的对应关系, 并为每一对工作设备和备用 设备设置一标记由工作设备或者由备用设备进行工作的状态信息后,还包括: 存储所述对应关系和所述状态信息。  After the configuration of the corresponding relationship between the working device and the standby device, and the setting of a status information indicating that the working device or the standby device is working for each pair of the working device and the standby device, the method further includes: storing the corresponding relationship and the status information.
本发明实施例还提供一种服务器, 包括:  The embodiment of the invention further provides a server, including:
配置模块, 其设置为: 配置工作设备与备用设备的对应关系, 并为每一 组工作设备与备用设备设置一标记由工作设备或者由备用设备进行工作的状 态信息;  The configuration module is configured to: configure a correspondence between the working device and the standby device, and set a status information for each group of the working device and the standby device to be marked by the working device or the standby device;
查找模块, 其设置为: 查找出现故障的工作设备;  a lookup module, which is set to: find a faulty work device;
修改模块,其设置为:修改所述出现故障的工作设备对应的状态信息; 以 及  Modifying a module, which is configured to: modify state information corresponding to the faulty working device; and
下发模块, 其设置为: 将最新的对应关系和状态信息发送给容灾系统内 的所有设备, 从而使得所述容灾系统内的各个工作设备或备用设备在建立链 路的过程中, 根据所述状态信息选择建立链路的目标设备。  The delivery module is configured to: send the latest correspondence and status information to all devices in the disaster recovery system, so that each working device or backup device in the disaster recovery system is in the process of establishing a link, according to The status information selects a target device that establishes a link.
上述服务器还可包括,  The above server may also include
存储模块, 其设置为: 存储所述对应关系和所述状态信息。 本发明实施例还提供一种容灾的方法, 包括:  a storage module, configured to: store the correspondence and the status information. The embodiment of the invention further provides a disaster tolerance method, including:
接收服务器下发的容灾系统内的工作设备和备用设备的对应关系信息, 所述对应关系信息包含一标记由工作设备或者由备用设备进行工作的状态信 息;  Corresponding relationship information between the working device and the standby device in the disaster recovery system delivered by the server, where the correspondence relationship information includes status information indicating that the working device or the standby device works;
在与目标设备建立链路的过程中, 根据所述对应关系信息, 选择所述目 标设备和与该目标设备对应的备用设备中进行工作的设备进行链路的建立。 上述方法还可具有以下特点: In the process of establishing a link with the target device, selecting the target according to the corresponding relationship information The target device and the device working in the standby device corresponding to the target device establish a link. The above method can also have the following characteristics:
所述接收服务器下发的容灾系统内的工作设备和备用设备的对应关系信 息后, 还包括:  After receiving the correspondence information between the working device and the standby device in the disaster recovery system delivered by the server, the method further includes:
存储或更新所述对应关系信息。 本发明实施例还提供一种设备, 其包括:  The corresponding relationship information is stored or updated. An embodiment of the present invention further provides an apparatus, including:
代理模块, 其设置为: 接收服务器下发的容灾系统内的工作设备和备用 设备的对应关系信息, 所述对应关系信息包含一标记由工作设备或者由备用 设备进行工作的状态信息; 以及  The proxy module is configured to: receive correspondence information of the working device and the standby device in the disaster recovery system delivered by the server, where the correspondence relationship information includes status information indicating that the working device or the standby device works;
应用模块, 其设置为: 在与目标设备建立链路的过程中, 根据所述对应 关系信息, 选择所述目标设备和与该目标设备对应的备用设备中进行工作的 设备进行链路的建立。  The application module is configured to: in the process of establishing a link with the target device, select the target device and the device working in the standby device corresponding to the target device to establish a link according to the corresponding relationship information.
上述设备还可包括:  The above device may further include:
存储模块, 其设置为: 存储或更新所述对应关系信息。 本发明实施例还提供一种容灾系统, 包括上述服务器和多个设备。  a storage module, configured to: store or update the correspondence information. The embodiment of the invention further provides a disaster tolerance system, including the foregoing server and multiple devices.
综上, 本发明实施例的容灾的方法、 装置及系统, 使得操作者只需要执 行简单的命令即可完成容灾切换, 可以实现快速自动切换, 从而避免复杂的 手工操作, 减少可能出现的操作失误, 提升容灾倒换效率。  In summary, the method, device, and system for disaster tolerance of the embodiment of the present invention enable the operator to perform disaster recovery switching only by executing a simple command, and can implement fast automatic switching, thereby avoiding complicated manual operations and reducing possible occurrences. Operation errors, improve the efficiency of disaster recovery.
附图概述 图 1是现有技术中的智能网容灾系统的连接图; BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a connection diagram of an intelligent network disaster tolerance system in the prior art;
图 2是智能网容灾系统中设备 SCP2故障切换后的连接图;  Figure 2 is a connection diagram of the device SCP2 after failover in the intelligent network disaster tolerance system;
图 3是智能网容灾系统中设备 SCP2、 SMP故障切换后形成的连接图; 图 4是本发明实施例的容灾系统的结构示意图;  3 is a connection diagram formed after a device SCP2 and an SMP failover in an intelligent network disaster tolerance system; FIG. 4 is a schematic structural diagram of a disaster tolerant system according to an embodiment of the present invention;
图 5为本发明实施例一的容灾的方法的流程图;  FIG. 5 is a flowchart of a disaster tolerance method according to Embodiment 1 of the present invention; FIG.
图 6为本发明实施例二的容灾的方法的流程图; 图 Ί为本发明实施例中智能网容灾系统某设备故障进行切换的运行流程 图。 本发明的较佳实施方式 6 is a flowchart of a disaster tolerance method according to Embodiment 2 of the present invention; FIG. 3 is a flowchart of an operation of switching a device fault of an intelligent network disaster tolerance system according to an embodiment of the present invention. Preferred embodiment of the invention
下文中将结合附图对本发明的实施例进行详细说明。 需要说明的是, 在 不冲突的情况下, 本申请中的实施例及实施例中的特征可以相互任意组合。  Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be noted that, in the case of no conflict, the features in the embodiments and the embodiments in the present application may be arbitrarily combined with each other.
本发明实施例主要是利用容灾系统中生产站点上的设备和容灾站点上的 设备之间存在着的对应关系, 构造一个系统, 使得应用程序在建立链路之前, 先通过这组设备对应关系信息找到真正可用的设备, 然后与之建链, 这样容 灾设备切换时就不需要手工操作, 只需简单操作即可。  The embodiment of the present invention mainly utilizes the correspondence between the devices on the production site and the devices on the disaster recovery site in the disaster tolerant system, and constructs a system, so that the application first passes the device corresponding to the device before establishing the link. Relationship information finds the truly usable device, and then builds a chain with it, so that the disaster recovery device does not need to be manually operated when switching, just simple operation.
本实施例将生产站点设备、 与之对应的容灾站点设备的信息称作域名, 并提供了一个系统, 称为域名服务系统, 该系统包括一个由生产站点和容灾 站点组成的容灾系统、 一个运行着域名管理程序的域名管理服务器; 容灾系 统的生产站点和容灾站点, 均由一个或者多个运行着代理程序和智能网应用 程序的智能网设备组成, 这些设备视功能不同, 可以分为 SMP、 SCP、 IMP 等。  In this embodiment, the information about the production site device and the corresponding disaster recovery site device is referred to as a domain name, and a system is provided, which is called a domain name service system, and the system includes a disaster recovery system composed of a production site and a disaster recovery site. a domain name management server running a domain name management program; the production site and the disaster recovery site of the disaster recovery system are each composed of one or more intelligent network devices running an agent and an intelligent network application, and the devices are different in function, Can be divided into SMP, SCP, IMP and so on.
图 4是本发明实施例的容灾系统的结构示意图, 如图所示, 该系统包括 一个域名管理服务器、 多个智能网设备, 其中:  4 is a schematic structural diagram of a disaster tolerance system according to an embodiment of the present invention. As shown in the figure, the system includes a domain name management server and multiple intelligent network devices, where:
域名管理服务器, 运行有域名管理程序, 通过该管理程序与每个智能网 设备相连, 从而实现具体指令信息的配置、 修改、 分发, 将生产站点、 容灾 站点设备的对应关系信息传递到每个设备上。  The domain name management server runs a domain name management program, and is connected to each intelligent network device through the management program, thereby realizing the configuration, modification, and distribution of the specific instruction information, and transmitting the correspondence information of the production site and the disaster recovery site device to each On the device.
该域名管理服务器可以包括:  The domain name management server can include:
配置模块, 设置为配置工作设备与备用设备的对应关系, 并为每一组工 作设备与备用设备设置一工作状态;  The configuration module is configured to configure a correspondence between the working device and the standby device, and set a working state for each group of working devices and the standby device;
查找模块, 设置为查找出现故障的工作设备;  Find module, set to find the faulty work device;
修改模块, 设置为修改所述出现故障的工作设备对应的工作状态; 下发模块, 设置为将所述对应关系和所述工作状态的信息发送给本容灾 系统内的所有设备, 由本容灾系统内的各个工作设备或备用设备在建立链路 的过程中, 根据所述工作状态的信息选择建立链路的目标设备。 The modification module is configured to modify the working state corresponding to the faulty working device; the sending module is configured to send the information about the corresponding relationship and the working state to all devices in the disaster recovery system, Each working device or backup device in the system is establishing a link. In the process, the target device that establishes the link is selected according to the information of the working state.
其中, 该域名服务器还可以包括:  The domain name server may further include:
存储模块, 设置为存储所述对应关系和所述工作状态的信息。  The storage module is configured to store the correspondence and the information of the working state.
所述智能网设备, 为智能网业务运行的服务器, 且安装有代理程序, 所 述设备支持目前主流的操作系统, 包括 linux、 aix、 hpux、 Solaris等操作系统, 所述代理程序, 分布在每台智能网设备上, 负责与域名管理服务器进行交互, 并执行域名管理服务器下发的命令, 从而实现将设备对应关系信息进行同步 保存到各个设备的目的。  The intelligent network device is a server running on an intelligent network service, and is installed with an agent, and the device supports a current mainstream operating system, including an operating system such as linux, aix, hpux, and Solaris, and the agent is distributed in each On the ASON device, the device is responsible for interacting with the domain name management server and executing commands issued by the domain name management server to save the device correspondence information to each device.
从图 4的结构可以看出, 首先由域名管理服务器将设备对应关系信息发 送到容灾系统中的各个设备, 各设备的代理程序接收到消息之后, 将信息更 新到维护的共享内存中; 所述智能网应用程序与所述代理程序共同运行在同 一台设备上, 且具有访问指定共享内存的权限, 所以可以查询共享内存中的 信息, 并据此确定最终活动的设备, 实现自动建链。  As can be seen from the structure of FIG. 4, the device correspondence information is first sent by the domain name management server to each device in the disaster recovery system, and after receiving the message, the agent of each device updates the information to the shared memory of the maintenance; The intelligent network application runs on the same device together with the agent, and has the right to access the specified shared memory, so the information in the shared memory can be queried, and the finally active device is determined accordingly, and the automatic chain building is realized.
其中, 域名管理服务器, 设置为配置、 管理容灾系统中生产站点设备和 与之对应的容灾站点设备、 以及生产站点设备和与之对应的容灾站点设备这 一组设备的状态 (是活动的、 还是备用的, 可以用两种状态 1 和 2表示, 1 表示目前这组设备中生产站点的设备是 active (活动) 的, 要求使用该设备; 2表示这组设备中容灾站点的设备是 active的,要求使用该设备)等关键信息, 即域名, 并将这些域名信息实时发送到所有设备上进行保存。  The domain name management server is configured to manage and manage the status of the production site device and the corresponding disaster recovery site device in the disaster tolerant system, and the production site device and the corresponding disaster recovery site device. Or standby, can be represented by two states 1 and 2, 1 means that the devices at the production site in this group of devices are active and require the device; 2 indicates the devices in the disaster recovery site of the group. It is active, requires the use of the device, and other key information, namely the domain name, and sends these domain information to all devices for storage in real time.
任何一个设备状态的变化, 比如有一个设备 A发生故障, 要求切换到容 灾站点设备 B, 那么应该把这种变化信息发到所有其他设备上, 这样以该设 备 A为 server端的其他设备就可以重新连接到这组设备( A和 B ) 中当前状 态为 active的设备 B。  If any device status changes, for example, if one device A fails and needs to switch to the disaster recovery site device B, then the change information should be sent to all other devices, so that other devices with device A as the server can Reconnect to device B in this group of devices (A and B) whose current state is active.
所述设备上的代理模块, 负责接收来自所述域名管理服务器下发的容灾 系统内的工作设备和备用设备的对应关系信息, 所述对应关系信息包含一标 记由工作设备或者由备用设备进行工作的状态信息; 并可以将接收到的信息 保存到本设备的共享内存中, 供本设备上运行的应用模块来获取。 所述代理模块, 可以包括如下几个模块: The proxy module on the device is responsible for receiving correspondence information between the working device and the standby device in the disaster recovery system delivered by the domain name management server, where the correspondence information includes a mark by the working device or by the standby device. Status information of the work; and the received information can be saved to the shared memory of the device for use by the application module running on the device. The agent module may include the following modules:
通信模块, 设置为与域名管理服务器通信, 接收来自该域名管理服务器 的指令;  a communication module, configured to communicate with a domain name management server, to receive an instruction from the domain name management server;
分析模块, 设置为将接收到的消息解压, 根据指定格式解析为可用的信 息;  An analysis module, configured to decompress the received message and parse it into usable information according to a specified format;
执行模块, 设置为将解析后的信息更新、 保存到共享内存中, 供其他应 用程序访问获取;  The execution module is configured to update and save the parsed information into the shared memory for access by other applications;
核心模块, 设置为协调各个模块之间的工作。  The core module, set to coordinate the work between the various modules.
所述应用模块, 包括智能网相关业务的处理功能, 可设置为在建立链路 的过程中, 根据指定目标设备对应的状态信息, 选择与该目标设备进行链路 的建立, 或者选择与该目标设备对应的备用设备进行链路的建立。 可以包括 如下几个模块:  The application module, including a processing function of an intelligent network related service, may be configured to select a link establishment with the target device according to status information corresponding to the specified target device during the establishment of the link, or select and target the target The backup device corresponding to the device establishes a link. It can include the following modules:
通信模块, 设置为与其他设备服务器或者其他设备进行通信;  a communication module, configured to communicate with other device servers or other devices;
数据库模块, 设置为登录本服务器或者其他服务器上的数据库、 并执行 相关的数据库操作。  The database module, set to log in to the database on this server or other server, and perform related database operations.
所述设备还可以包括一存储模块, 设置为执行存储和更新数据的操作, 如将工作设备与备用设备之间的对应关系和工作状态的信息存储在数据库 中, 或更新数据库中的数据。  The apparatus may further include a storage module configured to perform operations of storing and updating data, such as storing information of a correspondence between the working device and the standby device and an operational state in a database, or updating data in the database.
上述系统之间的各个部分共同作用, 最终实现容灾系统中的设备实现快 速自动切换的功能。  The various parts of the above systems work together to finally realize the function of fast and automatic switching of devices in the disaster recovery system.
本发明实施例提供了一种方法, 使得生产站点中某个或者某几个设备出 现故障后, 容灾站点中与之对应的备份设备可以立即工作, 而原先与故障设 备相连的其他设备, 则会自动的连接到容灾站点中该故障设备对应的备份设 备上, 从而实现快速自动切换。  The embodiment of the present invention provides a method, in which a backup device corresponding to the disaster recovery site can work immediately after one or a few devices in the production site fail, and other devices connected to the faulty device are It will automatically connect to the backup device corresponding to the faulty device in the disaster recovery site, thus achieving fast and automatic switching.
图 5为本发明实施例一的容灾的方法的流程图, 该方法适用于上述的域 名管理服务器, 包括下面步骤:  FIG. 5 is a flowchart of a disaster tolerance method according to Embodiment 1 of the present invention. The method is applicable to the domain name management server, and includes the following steps:
S51、 配置工作设备与备用设备的对应关系, 并为每一组工作设备和备用 设备设置一工作状态; 552、 查找出现故障的工作设备, 修改该工作设备对应的工作状态;S51. Configure a correspondence between the working device and the standby device, and set a working state for each group of the working device and the standby device. 552. Search for a faulty working device, and modify a working state corresponding to the working device.
553、将更改后的工作状态信息发送给本容灾系统内的所有设备, 由本容 灾系统内的各个工作设备或备用设备在建立链路的过程中, 根据所述工作状 态信息选择建立链路的目标设备。 553. Send the changed working status information to all the devices in the disaster recovery system. In the process of establishing a link, each working device or standby device in the disaster recovery system selects a link according to the working status information. Target device.
图 6为本发明实施例二的容灾的方法的流程图, 该方法适用于上述的设 备, 包括下面步骤:  FIG. 6 is a flowchart of a disaster tolerance method according to Embodiment 2 of the present invention. The method is applicable to the foregoing apparatus, and includes the following steps:
S61、接收服务器下发的容灾系统内的工作设备和备用设备的对应关系信 息, 所述对应关系信息包含一标记由工作设备或者由备用设备进行工作的状 态信息;  S61. Correspondence information of the working device and the standby device in the disaster recovery system delivered by the server, where the correspondence information includes a status information indicating that the working device or the standby device works;
S62、 在建立链路的过程中, 根据指定目标设备对应的状态信息, 选择与 该目标设备进行链路的建立, 或者选择与该目标设备对应的备用设备进行链 路的建立。  S62. In the process of establishing a link, select a link establishment with the target device according to status information corresponding to the target device, or select a backup device corresponding to the target device to establish a link.
本实施例的主要目的是能够将现有的手工操作改成由智能网应用程序自 动识别, 其关键就在于如何管理、 传递、 保存相关的设备对应关系信息。  The main purpose of this embodiment is to be able to change the existing manual operation to be automatically recognized by the intelligent network application. The key is how to manage, transfer, and save related device correspondence information.
下面为本发明一应用示例的流程, 至少包括下面步骤:  The following is a flow of an application example of the present invention, which includes at least the following steps:
S10、在域名管理程序界面中,将容灾系统中生产站点设备和容灾站点设 备之间的设备对应关系信息录入数据库系统中进行保存;  S10. In the domain name management program interface, the device correspondence information between the production site device and the disaster recovery site device in the disaster recovery system is entered into the database system for storage;
具体信息包括: 生产站点设备信息、 容灾站点设备信息、 设备状态 (表 明当前哪个是活动设备、 哪个是备份设备) ; 并将所有设备的域名信息分发 到所有设备的代理程序处理。 以最简单的一种为例进行说明, 生产站点设备 信息即设备 A的 IP地址, 容灾站点设备信息为设备 B的 IP地址, 设备状态 是有两种取值, 1表示当前生产站点设备是活动设备、 2表示容灾站点设备是 活动设备。  Specific information includes: production site device information, disaster recovery site device information, device status (indicating which is currently the active device, which is the backup device); and distributing the domain name information of all devices to the agent processing of all devices. The simplest one is used as an example. The device site information is the IP address of device A. The device information of the disaster recovery site is the IP address of device B. The device state has two values. 1 indicates that the current production site device is The active device and 2 indicate that the disaster recovery site device is an active device.
这样, 当一个客户端程序, 要连接到某个服务器的时候, 先到域名信息 中搜索 IP地址, 并确定是设备 A (假设是上面例子中的这个信息) , 然后一 看设备状态, 当前设备的状态是 2, 那么说明当前设备 A是备份的 (可能是 故障了) , 设备 B才是活动的, 所以实际连接的时候需要连到设备 B, 这样 对于客户端来说, 完全就实现了自动建立链路的功能。 S20、 生产站点设备故障时, 在域名管理程序界面中选择故障设备信息, 将其备份设备状态改为活动, 并将更新后的设备对应关系信息分发到所有设 备。 Thus, when a client program wants to connect to a server, it first searches the domain name information for the IP address, and determines that it is device A (assuming this information in the above example), and then looks at the device status, the current device. The status is 2, then the current device A is backed up (may be a fault), device B is active, so the actual connection needs to be connected to device B, so for the client, it is completely automatic The function of establishing a link. S20. When the production site equipment is faulty, select the faulty device information in the domain name management program interface, change the status of the backup device to the activity, and distribute the updated device correspondence information to all devices.
S30、各个设备收到来自所述域名管理服务器的设备对应关系信息,解析 之后将这些信息更新保存到共享内存中。  S30. Each device receives device correspondence information from the domain name management server, and saves the information update to the shared memory after parsing.
S40、各个设备的应用程序在跟其他设备建链的时候,根据从配置文件读 取到的待建链设备的信息(即 IP地址) , 在共享内存中查找到对应的活动设 备的信息, 并最终与该活动设备建立链接。 当设备故障时, 与该设备建链的 客户端会自动重连, 从而会按照上述步骤建立到容灾站点备用设备的链路, 完成自动切换。  S40. The application of each device finds the information of the corresponding active device in the shared memory according to the information (ie, the IP address) of the device to be built, which is read from the configuration file, when the link is established with other devices, and Finally, a link is established with the active device. When the device is faulty, the client that is connected to the device is automatically reconnected. The link to the standby device at the disaster recovery site is established according to the above steps.
该应用示例中, 步骤 S10之前包括: 容灾系统中所有生产站点设备、 容 灾站点设备的对应关系已经整理完毕; 所有生产站点、 容灾站点的设备上都 已经部署了代理程序, 并且正常运行。  In the application example, before the step S10, the corresponding relationship between all the production site devices and the disaster recovery site devices in the disaster recovery system has been completed; the agents are deployed on all the production sites and the disaster recovery sites, and are running normally. .
该应用示例中, 步骤 S10包括: 在域名管理服务器中, 录入所有设备的 对应关系信息, 包括设备的状态等; 完成后执行保存, 将数据保存到自带的 数据库中; 执行分发, 将所有的设备对应关系信息分发到被选中设备, 进行 同步更新。  In the application example, step S10 includes: entering, in the domain name management server, correspondence information of all devices, including the status of the device, etc.; performing saving after completion, saving the data to the own database; performing distribution, all the The device correspondence information is distributed to the selected device for synchronous update.
该应用示例中, 步骤 S20包括: 在域名管理服务器中, 从其自带的数据 库中获取最新的设备对应关系信息; 从其中选择故障设备的记录, 并将其对 应的备份设备的状态改为活动, 启用容灾站点设备; 执行分发, 将更新后的 所有的设备对应关系信息分发到所有设备, 进行同步更新。  In the application example, step S20 includes: obtaining, in the domain name management server, the latest device correspondence information from its own database; selecting a record of the faulty device from the record, and changing the state of the corresponding backup device to an activity. , enable the disaster recovery site device; perform distribution, distribute all updated device correspondence information to all devices, and perform synchronous update.
该应用示例中, 步骤 S30包括: 设备服务器处于侦听状态; 接收到来自 域名管理服务器的消息; 解析该消息, 并确定是同步消息; 更新所在服务器 上的指定共享内存, 将最新的设备对应关系信息保存到共享内存中; 返回成 功或者失败的结果消息给域名管理服务器。  In the application example, step S30 includes: the device server is in a listening state; receiving a message from the domain name management server; parsing the message, and determining to be a synchronization message; updating the specified shared memory on the server, and updating the latest device correspondence The information is saved to the shared memory; the success or failure result message is returned to the domain name management server.
该应用示例中, 步骤 S40包括: 当某设备故障时, 与该设备建链的客户 端设备因为链路问题而试图重连; 确定故障设备的 IP地址, 如果与故障设备 建链的客户端设备重启, 则从配置文件中读取故障设备的 IP地址; 以该故障 设备的 IP地址为参数, 调用代理程序提供的应用程序编程接口 (Application Programming Interface, API )函数获取当前状态为活动的设备的 IP地址, 即取 到容灾站点的备份设备的地址; 通过套接字(socket )函数建立到达容灾站点 设备的链路。 In the application example, step S40 includes: when a device fails, the client device chained with the device attempts to reconnect due to a link problem; determines the IP address of the faulty device, if the client device is linked with the faulty device Restart, read the IP address of the faulty device from the configuration file; The IP address of the device is a parameter, and the application programming interface (API) function provided by the calling agent obtains the IP address of the device whose current state is active, that is, the address of the backup device that is taken to the disaster recovery site; The socket function establishes a link to the device at the disaster tolerant site.
在上述应用示例中, 域名管理服务器、 各个设备上的代理模块、 各个设 备上的智能网应用模块, 这三者之间完成了域名信息的收集、 传递和使用, 真正构建了一个完整的域名服务系统, 简化了容灾系统中设备切换的流程、 操作步骤, 极大的提高了效率。  In the above application example, the domain name management server, the proxy module on each device, and the intelligent network application module on each device complete the collection, transmission, and use of the domain name information, and truly construct a complete domain name service. The system simplifies the process and operation steps of device switching in the disaster recovery system and greatly improves the efficiency.
图 7为本发明实施例中智能网容灾系统某设备故障进行切换的运行流程 图, 其操作步骤如下: 步骤 701:设备 A故障,需要修改与设备 A对应的容灾站点设备的状态, 启用容灾站点设备, 也即开始容灾设备倒换操作; FIG. 7 is a flowchart of an operation of switching a device in an intelligent network disaster recovery system according to an embodiment of the present invention. The operation steps are as follows: Step 701: If the device A is faulty, the state of the disaster recovery site device corresponding to the device A needs to be modified. The disaster recovery site device starts the disaster recovery device switching operation.
步骤 711 : 用户在域名管理服务器中, 选择查询, 列出当前容灾系统中 所有设备的对应关系信息。  Step 711: The user selects a query in the domain name management server to list the correspondence information of all devices in the current disaster recovery system.
步骤 712: 在域名管理服务器中, 选择设备 A对应的域名信息, 并修改 其中生产站点设备、 容灾站点设备的状态, 将容灾站点设备的状态改为活动。  Step 712: In the domain name management server, select the domain name information corresponding to the device A, and modify the state of the production site device and the disaster recovery site device, and change the state of the disaster recovery site device to the activity.
步骤 713 : 在域名管理服务器中, 选择下发, 会自动向所有设备下发最 新设备对应关系信息。  Step 713: In the domain name management server, when the delivery is selected, the latest device correspondence information is automatically sent to all devices.
步骤 721 : 设备 A接收到来自域名管理服务器的消息后, 解析该消息, 并确定是同步操作, 取得最新的设备对应关系信息。  Step 721: After receiving the message from the domain name management server, the device A parses the message, and determines that it is a synchronous operation, and obtains the latest device correspondence information.
本实施例中的设备上可以运行两套程序, 一个是代理程序, 一个是智能 网应用程序; 代理程序与域名管理服务器交互, 而真正提供业务的是智能网 应用程序, 所以说设备 A故障, 指的就是智能网应用程序故障了, 不能提供 业务了, 而代理程序正常。  The device in this embodiment can run two sets of programs, one is an agent program, and the other is an intelligent network application; the agent program interacts with the domain name management server, and the service provider is an intelligent network application, so that device A is faulty. It means that the intelligent network application is faulty, the service cannot be provided, and the agent is normal.
当然, 还有一种情况, 就是设备 A因为故障而关机了, 那么代理程序和 智能网应用程序都不能用了。 这种情况下, 域名信息是无法正常下发到该设 备了。 但是, 对于其他设备来说, 都是可以正常接收消息的, 知道设备 A当 前是不可用, 以后再连接到设备 A时, 通过查询, 就知道实际需要连的设备 就是设备 B了。 Of course, there is another situation where device A shuts down due to a failure, and neither the agent nor the intelligent network application can be used. In this case, the domain name information cannot be delivered to the device. However, for other devices, it is possible to receive messages normally, knowing that device A is The former is not available. When you connect to device A later, you can know that the device you need to connect is device B by query.
步骤 722: 设备 A更新共享内存, 将最新的设备对应关系信息保存到其 中。  Step 722: Device A updates the shared memory and saves the latest device correspondence information to it.
步骤 723: 设备 A同时会将最新的设备对应关系信息保存到所在服务器 的指定文件中, 以保证代理程序异常重启时可以恢复信息。  Step 723: Device A also saves the latest device correspondence information to the specified file of the server to ensure that the information can be restored when the agent restarts abnormally.
步骤 724: 返回下发操作结果, 如果上述步骤 721、 步骤 722、 步骤 723 之间有任何一个步骤操作失败, 那么下发域名操作失败; 如果所有步骤都成 功, 那么下发域名操作成功。  Step 724: The result of the delivery operation is returned. If any of the steps 721, 722, and 723 fails, the domain name operation fails. If all the steps are successful, the domain name operation succeeds.
步骤 731 : 设备 B接收到来自域名管理服务器的消息后, 解析该消息, 并确定是同步操作, 取得最新的设备对应关系信息。  Step 731: After receiving the message from the domain name management server, the device B parses the message, and determines that it is a synchronous operation, and obtains the latest device correspondence information.
步骤 732: 设备 B更新所在服务器的共享内存, 将最新的设备对应关系 信息保存到其中。  Step 732: Device B updates the shared memory of the server where it resides, and saves the latest device correspondence information to it.
步骤 733: 设备 B同时会将最新的设备对应关系信息保存到所在服务器 的指定文件中, 以保证代理程序异常重启时可以恢复信息。  Step 733: Device B also saves the latest device correspondence information to the specified file of the server to ensure that the information can be restored when the agent restarts abnormally.
步骤 734: 设备 B返回下发操作结果, 如果上述步骤 731、 步骤 732、 步 骤 733之间有任何一个步骤操作失败, 那么下发域名操作失败; 如果所有步 骤都成功, 那么下发域名操作成功。  Step 734: The device B returns the result of the delivery operation. If any of the steps 731, 732, and 733 fails, the domain name operation fails. If all the steps are successful, the domain name operation succeeds.
根据容灾系统中存在设备的数目, 域名服务系统会自动重复上述步骤 713、 步骤 721、 步骤 722、 步骤 723、 步骤 724的操作。  According to the number of devices in the disaster recovery system, the domain name service system automatically repeats the operations of step 713, step 721, step 722, step 723, and step 724.
步骤 702: 待所有设备的下发操作完成之后, 用户完成了设备 A的切换 动作。  Step 702: After the sending operation of all the devices is completed, the user completes the switching action of the device A.
本领域普通技术人员可以理解上述方法中的全部或部分步骤可通过程序 来指令相关硬件完成, 所述程序可以存储于计算机可读存储介质中, 如只读 存储器、 磁盘或光盘等。 可选地, 上述实施例的全部或部分步骤也可以使用 一个或多个集成电路来实现。 相应地, 上述实施例中的各模块 /单元可以釆用 硬件的形式实现, 也可以釆用软件功能模块的形式实现。 本发明不限制于任 何特定形式的硬件和软件的结合。 One of ordinary skill in the art will appreciate that all or a portion of the above steps may be performed by a program to instruct the associated hardware, such as a read only memory, a magnetic disk, or an optical disk. Alternatively, all or part of the steps of the above embodiments may also be implemented using one or more integrated circuits. Correspondingly, each module/unit in the foregoing embodiment may be implemented in the form of hardware, or may be implemented in the form of a software function module. The invention is not limited to any What is the combination of specific forms of hardware and software.
以上仅为本发明的优选实施例, 当然, 本发明还可有其他多种实施例, 在不背离本发明精神及其实质的情况下, 熟悉本领域的技术人员当可根据本 发明的实施例作出各种相应的改变和变形, 但这些相应的改变和变形都应属 于本发明所附的权利要求的保护范围。  The above is only a preferred embodiment of the present invention, and of course, the present invention may be embodied in various other embodiments without departing from the spirit and scope of the invention. Various changes and modifications may be made without departing from the scope of the appended claims.
工业实用性 本发明实施例的容灾的方法、 装置及系统, 使得操作者只需要执行简单 的命令即可完成容灾切换, 可以实现快速自动切换, 从而避免复杂的手工操 作, 减少可能出现的操作失误, 提升容灾倒换效率。 INDUSTRIAL APPLICABILITY The method, device, and system for disaster tolerance of the embodiments of the present invention enable an operator to perform disaster recovery switching by simply executing a simple command, thereby enabling fast automatic switching, thereby avoiding complicated manual operations and reducing possible occurrences. Operation errors, improve the efficiency of disaster recovery.

Claims

权 利 要 求 书 Claim
1、 一种容灾的方法, 包括:  1. A method of disaster recovery, including:
配置工作设备与备用设备的对应关系, 并为每一组工作设备和备用设备 设置一标记由工作设备或者由备用设备进行工作的状态信息;  Configuring a correspondence between the working device and the standby device, and setting a status information for each group of the working device and the standby device to be marked by the working device or the standby device;
查找出现故障的工作设备, 修改该工作设备对应的状态信息;  Find the working device that has failed, and modify the status information corresponding to the working device;
将更改后的状态信息发送给容灾系统内的所有设备, 由所述容灾系统内 的各个工作设备或备用设备在建立链路的过程中, 根据所述状态信息选择建 立链路的目标设备。  The changed status information is sent to all the devices in the disaster recovery system. In the process of establishing a link, each working device or the standby device in the disaster recovery system selects a target device for establishing a link according to the status information. .
2、 如权利要求 1所述的方法, 其中, 所述配置工作设备与备用设备的对 应关系, 并为每一对工作设备和备用设备设置一标记由工作设备或者由备用 设备进行工作的状态信息后, 还包括:  2. The method according to claim 1, wherein the configuring a correspondence between the working device and the standby device, and setting a status information for each pair of the working device and the standby device to be operated by the working device or by the standby device After that, it also includes:
存储所述对应关系和所述状态信息。  The correspondence relationship and the status information are stored.
3、 一种服务器, 包括:  3. A server, including:
配置模块, 其设置为: 配置工作设备与备用设备的对应关系, 并为每一 组工作设备与备用设备设置一标记由工作设备或者由备用设备进行工作的状 态信息;  The configuration module is configured to: configure a correspondence between the working device and the standby device, and set a status information for each group of the working device and the standby device to be marked by the working device or the standby device;
查找模块, 其设置为: 查找出现故障的工作设备; 修改模块,其设置为:修改所述出现故障的工作设备对应的状态信息; 以 及  a search module, which is configured to: find a faulty work device; modify a module, which is set to: modify state information corresponding to the faulty work device; and
下发模块, 其设置为: 将最新的对应关系和状态信息发送给容灾系统内 的所有设备, 从而使得所述容灾系统内的各个工作设备或备用设备在建立链 路的过程中, 根据所述状态信息选择建立链路的目标设备。  The delivery module is configured to: send the latest correspondence and status information to all devices in the disaster recovery system, so that each working device or backup device in the disaster recovery system is in the process of establishing a link, according to The status information selects a target device that establishes a link.
4、 如权利要求 3所述的服务器, 其还包括,  4. The server of claim 3, further comprising
存储模块, 其设置为: 存储所述对应关系和所述状态信息。  a storage module, configured to: store the correspondence and the status information.
5、 一种容灾的方法, 包括: 5. A method of disaster tolerance, including:
接收服务器下发的容灾系统内的工作设备和备用设备的对应关系信息, 所述对应关系信息包含一标记由工作设备或者由备用设备进行工作的状态信 息; Corresponding relationship information between the working device and the standby device in the disaster recovery system delivered by the server, where the correspondence relationship information includes a status message indicating that the working device or the standby device works Interest rate
在与目标设备建立链路的过程中, 根据所述对应关系信息, 选择所述目 标设备和与该目标设备对应的备用设备中进行工作的设备进行链路的建立。  In the process of establishing a link with the target device, the target device and the device working in the standby device corresponding to the target device are selected to perform link establishment according to the corresponding relationship information.
6、 如权利要求 5所述的方法, 其中, 所述接收服务器下发的容灾系统内 的工作设备和备用设备的对应关系信息后, 还包括:  The method according to claim 5, wherein, after receiving the correspondence information between the working device and the standby device in the disaster-tolerant system delivered by the server, the method further includes:
存储或更新所述对应关系信息。  The corresponding relationship information is stored or updated.
7、 一种设备, 其包括:  7. A device comprising:
代理模块, 其设置为: 接收服务器下发的容灾系统内的工作设备和备用 设备的对应关系信息, 所述对应关系信息包含一标记由工作设备或者由备用 设备进行工作的状态信息; 以及  The proxy module is configured to: receive correspondence information of the working device and the standby device in the disaster recovery system delivered by the server, where the correspondence relationship information includes status information indicating that the working device or the standby device works;
应用模块, 其设置为: 在与目标设备建立链路的过程中, 根据所述对应 关系信息, 选择所述目标设备和与该目标设备对应的备用设备中进行工作的 设备进行链路的建立。  The application module is configured to: in the process of establishing a link with the target device, select the target device and the device working in the standby device corresponding to the target device to establish a link according to the corresponding relationship information.
8、 如权利要求 7所述的设备, 其还包括:  8. The device of claim 7, further comprising:
存储模块, 其设置为: 存储或更新所述对应关系信息。  a storage module, configured to: store or update the correspondence information.
9、 一种容灾系统, 包括: 如权利要求 3所述的服务器和多个如权利要求 7所述的设备。  9. A disaster tolerant system, comprising: the server of claim 3 and the plurality of devices of claim 7.
PCT/CN2012/072357 2011-09-01 2012-03-15 Method, device, and system for disaster recovery WO2012155630A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201110257015.2 2011-09-01
CN201110257015.2A CN102291262B (en) 2011-09-01 2011-09-01 The method, apparatus and system of a kind of disaster tolerance

Publications (1)

Publication Number Publication Date
WO2012155630A1 true WO2012155630A1 (en) 2012-11-22

Family

ID=45337385

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2012/072357 WO2012155630A1 (en) 2011-09-01 2012-03-15 Method, device, and system for disaster recovery

Country Status (2)

Country Link
CN (1) CN102291262B (en)
WO (1) WO2012155630A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109936462A (en) * 2017-12-15 2019-06-25 华为技术有限公司 Disaster recovery method and device

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102291262B (en) * 2011-09-01 2018-03-23 中兴通讯股份有限公司 The method, apparatus and system of a kind of disaster tolerance
CN103166797B (en) * 2013-03-18 2016-04-06 京信通信系统(中国)有限公司 A kind of management method of application module and device
CN104426693B (en) * 2013-08-27 2018-02-09 新华三技术有限公司 A kind of method and apparatus that forwarding unit port status management is exchanged in software defined network
CN104539462B (en) * 2015-01-09 2017-12-19 北京京东尚科信息技术有限公司 It is a kind of to switch to method and device of the calamity for application example
CN105429799B (en) * 2015-11-30 2019-06-11 浙江宇视科技有限公司 Server backup method and device
CN105528259B (en) * 2016-03-01 2018-08-21 浪潮天元通信信息系统有限公司 A kind of application redundancy automation switching control design method
CN109698769B (en) * 2019-02-18 2022-03-22 深信服科技股份有限公司 Application disaster tolerance device and method, terminal device and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101047487A (en) * 2007-04-24 2007-10-03 中控科技集团有限公司 Method and system for solving equipment redundant in industrial control network
CN101247213A (en) * 2007-02-16 2008-08-20 华为技术有限公司 Method and system for master/standby rearrangement
CN101426306A (en) * 2008-10-24 2009-05-06 中国移动通信集团山东有限公司 A disaster tolerance switching method, system and apparatus
CN102291262A (en) * 2011-09-01 2011-12-21 中兴通讯股份有限公司 Disaster recovery method, device and system

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101022363B (en) * 2007-03-23 2010-08-11 杭州华三通信技术有限公司 Network storage equipment fault protecting method and device
CN101330400A (en) * 2007-06-19 2008-12-24 中兴通讯股份有限公司 Share backup method for baseband collocation resource
CN101902361B (en) * 2010-07-26 2014-09-10 中兴通讯股份有限公司 Disaster tolerance business system and disaster tolerance method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101247213A (en) * 2007-02-16 2008-08-20 华为技术有限公司 Method and system for master/standby rearrangement
CN101047487A (en) * 2007-04-24 2007-10-03 中控科技集团有限公司 Method and system for solving equipment redundant in industrial control network
CN101426306A (en) * 2008-10-24 2009-05-06 中国移动通信集团山东有限公司 A disaster tolerance switching method, system and apparatus
CN102291262A (en) * 2011-09-01 2011-12-21 中兴通讯股份有限公司 Disaster recovery method, device and system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109936462A (en) * 2017-12-15 2019-06-25 华为技术有限公司 Disaster recovery method and device

Also Published As

Publication number Publication date
CN102291262B (en) 2018-03-23
CN102291262A (en) 2011-12-21

Similar Documents

Publication Publication Date Title
WO2012155630A1 (en) Method, device, and system for disaster recovery
CN104935672A (en) High available realizing method and equipment of load balancing service
CN103036719A (en) Cross-regional service disaster method and device based on main cluster servers
CN104038376A (en) Method and device for managing real servers and LVS clustering system
US20170116094A1 (en) Fault handling methods in a home service system, and associated household appliances and servers
CN111371625A (en) Method for realizing dual-computer hot standby
CN105426213A (en) Software update method and system
WO2013037314A1 (en) System and method for use in data processing center disaster backup
JP5285045B2 (en) Failure recovery method, server and program in virtual environment
CN102487332B (en) Fault processing method, apparatus thereof and system thereof
CN104125079A (en) Method and device for determining double-device hot-backup configuration information
CN111371680B (en) Route management method, device, equipment and storage medium for dual-computer hot standby
JP5285044B2 (en) Cluster system recovery method, server, and program
EP2456163B1 (en) Registering an internet protocol phone in a dual-link architecture
CN106817239B (en) site switching method, related device and system
CN114422335A (en) Communication method, communication device, server and storage medium
CN112948177A (en) Disaster recovery backup method and device, electronic equipment and storage medium
JPH10326208A (en) Failure restoration system and record medium
JP6856574B2 (en) Service continuation system and service continuation method
CN108259388B (en) Control method and device for managing Ethernet interface
CN111858193A (en) Method and system for realizing server pool service
CN104702422A (en) Method, device and system for realizing high availability of communication equipment
KR101401006B1 (en) Method and appratus for performing software upgrade in high availability system
JP2008204113A (en) Network monitoring system
CN115499296B (en) Cloud desktop hot standby management method, device and system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12785783

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12785783

Country of ref document: EP

Kind code of ref document: A1