WO2021136370A1

WO2021136370A1 - Service restoration method and system for distributed system

Info

Publication number: WO2021136370A1
Application number: PCT/CN2020/141371
Authority: WO
Inventors: 董友球; 杜铁军
Original assignee: 威创集团股份有限公司
Priority date: 2019-12-30
Filing date: 2020-12-30
Publication date: 2021-07-08
Also published as: CN111130899A

Abstract

A service restoration method and system for a distributed system. The method comprises: when it is detected that network connection of a node is in a disconnected state, determining that the node in the disconnected state is a faulty node, and recording an IP address of the faulty node; performing replacement processing on the faulty node, the replacement processing comprising: sending prompt information indicating that the faulty node is to be replaced with a backup node by means of physical replacement; sending a broadcast instruction for device discovery; receiving status information reported by the backup node after the backup node receives the broadcast instruction for device discovery; determining, according to the status information reported by the backup node, that the backup node is an available backup node; sending to the backup node a modification instruction to modify an IP address to the IP address of the faulty node; and re-establishing network connection to the backup node, and re-allocating a service allocated to the faulty node to the backup node. In the present invention, when a node of a distributed system is faulty, plug-and-play of the backup node is implemented, and a service of the distributed system can be quickly restored.

Description

Method and system for business recovery of distributed system

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office, the application number is 201911396738.3, and the invention title is "A method and system for business recovery of a distributed system" on December 30, 2019, the entire content of which is incorporated by reference In this application.

Technical field

The present invention relates to the technical field of distributed systems, and more specifically, to a method and system for business recovery of distributed systems.

Background technique

Distributed splicing display system and distributed network seat system are commonly used equipment in the control room. Among them, the distributed splicing system usually includes multiple distributed splicing nodes, each node is responsible for displaying a certain part of the screen of the splicing wall, and multiple nodes perform their duties to form a splicing display system. The distributed network seat system usually has a sending box and a receiving box, the sending box is connected to the business computer, the receiving box is connected to the keyboard, mouse and the monitor, and the audio, video and keyboard and mouse information are transmitted through the network, so as to achieve the effect of man-machine separation and one machine with multiple screens. .

Since the distributed system is composed of individual nodes, each node communicates with each other through IP addresses. Once a node fails, the existing technical means generally first find out the faulty IP address information, and then connect the backup node to the computer to make a backup The IP address of the node is set to the IP address of the original failed node, and the corresponding configuration is modified according to business needs, and then the backup node is replaced with the failed node. This series of operations will take at least a few minutes. If the control room is used in an emergency command system, the difference in emergency command conditions is far from perfect. Therefore, within a few minutes, some key information may be missed. Which leads to decision-making errors.

Summary of the invention

The present invention aims to overcome at least one defect (deficiency) of the above-mentioned prior art and provide a distributed system service recovery method and system, which are used to achieve the effect of quickly recovering system services.

The technical solution adopted by the present invention is a business recovery method of a distributed system, and the method includes:

When detecting that the network connection of a node is in a disconnected state, determine that the disconnected node is a faulty node, and record the IP address of the faulty node;

Perform replacement processing on the failed node, and the replacement processing includes:

Sending out a prompt message that the standby node replaces the failed node by means of physical replacement;

Send broadcast instructions discovered by the device;

Receiving the status information reported by the standby node after receiving the broadcast instruction discovered by the device;

Determining that the standby node is an available standby node according to the status information reported by the standby node;

Sending a modification instruction to the backup node to modify the IP address to the IP address of the failed node;

The network connection with the standby node is re-established and the services allocated to the failed node are re-allocated to the standby node.

When a faulty node is detected, the method of the present invention records the IP address of the faulty node, and then uses a physical replacement method to first replace the faulty node with the standby node, and then sends the broadcast instruction discovered by the device to make it in the standby state The standby node receives and reports its own status information. According to the reported status information, it can be determined that the standby node is an available standby node, and then the IP address of the standby node is modified to the IP address of the failed node by sending a modification instruction to establish a connection with The network connection of the backup node allows the backup node to replace the failed node to continue to perform the original business of the failed node. The method of the present invention solves the problem that after a node in the distributed system fails and a new node is replaced, the configuration modification can be automatically completed in an online manner without any manual configuration, so that the distributed system can restore the original working state. The backup node can be plug and play, which not only simplifies the operation steps, reduces the operation difficulty, but also greatly shortens the repair time of the fault.

The present invention also provides a business recovery system of a distributed system, and the system includes:

The monitoring module is used to monitor the network connection with all nodes. When it is detected that the network connection of a node is in a disconnected state, determine that the disconnected node is a faulty node, and record the IP address of the faulty node;

A prompt message issuing module, which is used to send out prompt information for replacing the faulty node with the standby node through a physical replacement method;

Broadcast module, used to send broadcast instructions for device discovery;

The receiving module is used to receive the status information reported by the standby node after receiving the broadcast instruction discovered by the device;

A spare node determining module, configured to determine that the spare node is an available spare node according to the status information reported by the spare node;

An IP address modification module, configured to send a modification instruction for modifying an IP address to the IP address of the failed node to the standby node;

The connection establishment module is used for re-establishing the network connection with the standby node and redistributing the services allocated to the failed node to the standby node.

The system of the present invention uses the monitoring module to detect the faulty node of the distributed system online. When a faulty node is detected, the IP address of the faulty node is recorded, and then the prompt message is sent out by the prompt message sending module to remind the physical replacement The standby node replaces the failed node on the physical connection, and then sends the broadcast instruction discovered by the device through the broadcast instruction module so that the standby node that has been physically connected to replace the failed node can receive the broadcast instruction discovered by the device, and the receiving module according to The status information reported by the standby node, and then the standby node determination module can determine that the standby node is an available standby node based on the status information, and then the IP address modification module modifies the IP address of the standby node to the IP address of the failed node by sending a modification instruction, and finally The connection establishment module is used to establish a network connection with the standby node, and the standby node replaces the failed node to continue to perform the original business of the failed node. The system of the present invention solves the problem that after a node in the distributed system fails and a new node is replaced, the configuration modification can be automatically completed in an online manner without any manual configuration, so that the distributed system can restore the original working state. The backup node can be plug and play, which not only simplifies the operation steps, reduces the operation difficulty, but also greatly shortens the repair time of the fault.

Compared with the prior art, the beneficial effects of the present invention are: the method and system of the present invention can be applied to a distributed system. When a node of the distributed system fails, the cost of redundant equipment is not increased. The standby node is plug-and-play, and the standby node does not need to do any manual configuration, which can achieve the effect of quickly restoring distributed system services.

Description of the drawings

Fig. 1 is a flowchart of a method for restoring a service in a distributed system according to Embodiment 1 of the present invention.

FIG. 2 is a flowchart of a method for restoring a service in a distributed system according to Embodiment 2 of the present invention.

FIG. 3 is a framework diagram of a service recovery system of a distributed system according to Embodiment 3 of the present invention.

Detailed ways

The drawings of the present invention are only used for exemplary description, and should not be construed as limiting the present invention. In order to better illustrate the following embodiments, some components in the drawings may be omitted, enlarged or reduced, and do not represent the size of the actual product; for those skilled in the art, some well-known structures in the drawings and their descriptions may be omitted. Understandable.

Example 1

As shown in FIG. 1, the service recovery method of the distributed system in this embodiment includes the following specific steps:

S101: When it is detected that the network connection of a node is in a disconnected state, determine that the disconnected node is a faulty node, and record the IP address of the faulty node;

S102: Send a prompt message that the standby node replaces the failed node by means of physical replacement;

S103: Send a broadcast instruction discovered by the device;

S104. Receive status information reported by the standby node after receiving the broadcast instruction discovered by the device;

S105: Determine that the standby node is an available standby node according to the status information reported by the standby node;

S106: Send a modification instruction for modifying the IP address to the IP address of the failed node to the backup node;

S107. Re-establish a network connection with the backup node and redistribute the services allocated to the failed node to the backup node.

In a specific implementation process, the method of this embodiment can be applied to a control module of a distributed system, and the method of this embodiment is implemented through the control module. The control module can be arranged on a server or on a certain node of a distributed system, and its purpose is to manage the logical relationship between the various nodes and the state between the nodes. The control module pre-establishes a network connection with all nodes in the main state in the distributed system. After the network connection is established, the control module executes all the steps S101-S107 above so that the distributed system is in the main state. When a node fails, the standby node in the standby state can be quickly inserted and online configuration can be quickly realized, so as to quickly restore the business of the distributed system.

In an optional implementation manner, in step S101 of this embodiment, when it is detected that the network connection with a node is in a disconnected state, alarm information is also sent to the control end of the distributed system. By sending out alarm information, the control end of the distributed system can be reminded so that it can make corresponding decisions.

In an optional implementation manner, the method of this embodiment prompts that the faulty node has failed by sending a prompt message in S103, so that the backup node can be physically replaced by the faulty node. Specifically, the replacement of the faulty node by the standby node by means of physical replacement refers to inserting the physical wiring external to the faulty node into the standby node, and the physical wiring may include the power cord, network cable, and video cable of the faulty node. , One or more of the USB cables. The prompt information may include the IP address of the failed node, and the failed node can be quickly traced in actual operations according to the IP address of the failed node, so that the backup node can quickly replace the failed node on the physical connection. However, the replacement in this step is only on the physical connection. At this time, the IP address and other related configuration information of the standby node are not modified to make it consistent with the failed node. In actual use, the standby node still cannot replace the failed node and cannot replace the failed node. To execute the original services of the failed node, the subsequent steps still need to be performed to make the configuration information of the standby node consistent with the failed node.

In an optional implementation manner, the status information may include the IP address and MAC address of the standby node, and status information about whether an end-to-end network connection has been established. The IP address of the standby node is a preset initial value. The IP address of the device-use node enables it to receive broadcast instructions after physical access, and the preset IP address can report status information. But even if the standby node has a preset IP address, it has not established an end-to-end connection with the control module of the distributed system or has not established an end-to-end connection with other control modules. Therefore, the control module can be based on the node Whether an end-to-end connection is established with it to determine whether the node reporting status information is an available standby node.

Further, the specific steps of step S105 include:

According to the status information reported by the standby node, determine whether the IP address of the standby node is the preset initial value and whether the standby node has not established an end-to-end network connection, if the IP address of the standby node is the preset initial value and the standby node If the end-to-end network connection is not established, the standby node is determined to be an available standby node.

The control module judges the received status information of the standby node, and if the IP address of the standby node is a preset initial value and the standby node has not established an end-to-end network connection, the standby node is determined to be an available standby node.

Based on the method of this embodiment, the distributed system executes all the above steps S101-S107 by setting the control module to complete the interaction with the standby node, so that the service can be quickly restored when the node of the distributed system fails. In the following, the method of this embodiment is further described in conjunction with the interaction mode between the control module and the standby node.

First, the control module pre-establishes a network connection with all nodes in the primary state in the distributed system, and the standby node in the standby state is preset with an initial IP address;

Then, the control module starts to monitor all nodes in the main state. When it is detected that the network connection between the node in the main state and the control module is disconnected, it determines that the node in the disconnected state is a faulty node, and records all the nodes in the main state. State the IP address of the failed node, and actively send alarm information to the control end of the distributed system;

Then, the control module sends out a prompt message, prompting that the standby node needs to be replaced by a physical replacement method for the failed node;

After the prompt information is found, manually insert the power cord, network cable, video cable, and USB of the faulty node into the standby node through manual operations, so that the standby node replaces the faulty node in a physical connection;

After the control module sends out the prompt information, it also sends the broadcast instruction of the device discovery;

After the backup node replaces the failed node on the physical connection, the backup node can receive the broadcast instruction by using the preset initial IP address. When the backup node receives the broadcast instruction, it actively reports its status information to the control module.

The control module receives the status information reported by the standby node, and then determines whether the IP address of the standby node is the preset initial value and whether the standby node has not established an end-to-end network connection with any control module according to the status information. If so, the standby node can be determined The node is an available spare node;

Then, the control module sends a modification instruction to the available backup node, the instruction instructs the available backup node to modify its own IP address to the IP address of the failed node;

After the available backup node receives the modification instruction, it immediately sets the IP address of the machine according to the relevant parameters in the modification instruction.

After the available backup node modifies the IP address, the control module re-establishes the network connection with the available backup node and redistributes the services allocated to the failed node to the backup node.

The method of the present invention realizes the effect of rapid service recovery of the distributed system through the plug and play of the standby node, simplifies the process of configuring the standby node, and greatly shortens the repair time of the fault.

Example 2

The difference from Embodiment 1 is that the service recovery method of a distributed system in this embodiment is a method to solve how to repair multiple failed nodes when more than one faulty node is detected using step S101 of Embodiment 1. .

As shown in FIG. 2, the specific steps of a method for business recovery of a distributed system in this embodiment include:

S201. When it is detected that the network connection of a node is in a disconnected state and there are multiple nodes in the disconnected state, determine that the multiple disconnected nodes are all faulty nodes, and record the IPs of the multiple faulty nodes address;

S202: Perform numbering and sorting of multiple faulty nodes according to a preset numbering rule;

S203: Perform the following steps S204-S209 cyclically according to the numbering of the failed nodes so that all the failed nodes are replaced by available spare nodes;

S204: Send a prompt message that the standby node replaces the failed node by means of physical replacement;

S205: Send the broadcast instruction discovered by the device;

S206: Receive the status information reported by the standby node after receiving the broadcast instruction discovered by the device;

S207: Determine that the standby node is an available standby node according to the status information reported by the standby node.

S208: Send a modification instruction for modifying the IP address to the IP address of the failed node to the standby node;

S209. Re-establish a network connection with the backup node and redistribute the service allocated to the failed node to the backup node.

In an optional implementation manner, the preset numbering rule may be to number the nodes according to the logical relationship of the nodes in the state in the distributed system, and then to sort the nodes according to the size of the node number of each faulty node.

Since multiple nodes fail at the same time, in order to avoid replacing the wrong equipment, a rule must be formulated. The rule can be formulated according to the characteristics of the system. For example, the wall-to-wall system can be sorted from left to right and from top to right. The node corresponding to each screen is numbered in the order below, the node 1 in the upper left corner is numbered, and the number is increased in sequence. The rule can be that each time the node with the smallest or largest number among the current failed nodes is replaced.

Example 3

The invention also provides a business recovery system of the distributed system. As shown in FIG. 3, a service recovery system of a distributed system in this embodiment specifically includes:

The monitoring module 301 is used to monitor the network connection with all nodes. When it is detected that the network connection of a node is in a disconnected state, determine that the disconnected node is a faulty node, and record the IP address of the faulty node;

The prompt message issuing module 302 is configured to send out prompt information for replacing the faulty node with the standby node through a physical replacement method;

The broadcast module 303 is used to send broadcast instructions for device discovery;

The receiving module 304 is configured to receive the status information reported by the standby node after receiving the broadcast instruction discovered by the device;

The standby node determining module 305 is configured to determine that the standby node is an available standby node according to the status information reported by the standby node;

The IP address modification module 306 is configured to send a modification instruction for modifying the IP address to the IP address of the failed node to the standby node;

The connection establishment module 307 is configured to re-establish a network connection with the backup node and redistribute the services allocated to the failed node to the backup node.

The system of the present invention uses the monitoring module 301 to detect the faulty nodes of the distributed system online. When a faulty node is detected, the IP address of the faulty node is recorded, and then the prompt message sending module 302 sends a prompt message to remind you to pass In the physical replacement method, the backup node replaces the failed node on the physical connection, and then sends the broadcast instruction of device discovery through the broadcast instruction module 303 so that the backup node that has been physically connected to replace the failed node can receive the broadcast instruction of device discovery. The receiving module 304 can determine that the standby node is an available standby node according to the status information reported by the standby node, and then the standby node determining module 305 can determine that the standby node is an available standby node based on the status information, and then the IP address modification module 306 modifies the IP address of the standby node by sending a modification instruction to The IP address of the failed node, and finally the connection establishment module 307 is used to establish a network connection with the standby node and the standby node replaces the failed node to continue to perform the original service of the failed node. The system of the present invention solves the problem that after a node in the distributed system fails and a new node is replaced, the configuration modification can be automatically completed in an online manner without any manual configuration, so that the distributed system can restore the original working state. The backup node can be plug and play, which not only simplifies the operation steps, reduces the operation difficulty, but also greatly shortens the repair time of the fault.

In a specific implementation process, the business recovery system of this embodiment can be applied to a control module of a distributed system, and the business recovery system is implemented by setting each module of this embodiment on the control module. In a specific implementation process, the control module can be arranged on a server or on a certain node of a distributed system, and its purpose is to manage the logical relationship between the nodes and the state between the nodes. By setting each module in the control module, when a node in the primary state fails, the distributed system can quickly insert the standby node in the standby state and quickly implement online configuration, thereby quickly restoring the service of the distributed system.

In an optional implementation manner, the service recovery system of this implementation further includes an alarm information sending module, and the alarm information sending module is used to send information to the distribution system when the monitoring module 301 detects that the network connection with the node is in a disconnected state. The control end of the integrated system sends alarm information. By issuing warning messages, the control end of the distributed system can be reminded so that it can make corresponding decisions.

In an optional implementation manner, in this embodiment, a prompt message is issued by the prompt message issuing module 302 to prompt that a faulty node has failed, so that the backup node can be physically replaced with the faulty node. Specifically, the replacement of the faulty node by the standby node by means of physical replacement refers to inserting the physical wiring external to the faulty node into the standby node, and the physical wiring may include the power cord, network cable, and video cable of the faulty node. , One or more of the USB cables. The prompt information may include the IP address of the failed node, and the failed node can be quickly traced in actual operations according to the IP address of the failed node, so that the backup node can quickly replace the failed node on the physical connection. However, this replacement is only on the physical connection. At this time, the IP address and other related configuration information of the standby node are not modified to make it consistent with the failed node. In actual use, the standby node still cannot replace the failed node and cannot replace the failed node to perform the failure. For the original business of the node, other modules still need to be used to complete the corresponding functions at this time to make the configuration information of the standby node consistent with the failed node.

In an optional implementation manner, the status information may include the IP address, MAC address of the standby node, and status information about whether an end-to-end network connection has been established. The IP address of the standby node is a preset initial value. The IP address of the device-use node enables it to receive broadcast instructions after physical access, and the preset IP address can report status information. However, even if the standby node has a preset IP address, it has not established an end-to-end connection with the business recovery system of the distributed system or has not established an end-to-end connection with other business recovery systems. Therefore, in the business recovery system The standby node determining module 305 can determine whether the node reporting the status information is an available standby node according to whether the node has established an end-to-end connection with it.

Further, the standby node determining module 305 is specifically configured to:

According to the status information reported by the backup node, determine whether the IP address of the backup node is the preset initial value and whether the backup node has not established an end-to-end network connection with the service recovery system, if the IP address of the backup node is preset If the initial value and the standby node have not established an end-to-end network connection with the service recovery system, it is determined that the standby node is an available standby node.

The standby node determining module 305 judges the status information of the standby node received, and if the IP address of the standby node is a preset initial value and the standby node has not established an end-to-end network connection, the standby node is determined to be an available standby node.

In an optional implementation manner, when the monitoring module 301 detects that there are multiple disconnected nodes, it determines that the multiple disconnected nodes are all faulty nodes, and records multiple faults. Node’s IP address, and control the prompt message issuing module 302 to send out the prompt information corresponding to each failed node one by one, so that each failed node can be replaced with a spare node one by one, so that the service recovery system can quickly repair.

In an optional implementation manner, the monitoring module 301 controls the prompt information issuing module 302 to send out the prompt information corresponding to each faulty node one by one, specifically:

The monitoring module 301 sorts the number of multiple faulty nodes according to the preset numbering rule, and any one by one according to the number sequence of the faulty node sends out the prompt information corresponding to each faulty node one by one so that all the faulty nodes are available as spare nodes replace.

The preset numbering rule may be that the nodes are numbered according to the logical relationship of the nodes in the state in the distributed system, and then sorted according to the size of the node number of each failed node.

Since multiple nodes fail at the same time, in order to avoid replacing the wrong equipment, a rule must be formulated. The rule can be formulated according to the characteristics of the system. The node corresponding to each screen is numbered in the order below, the node 1 in the upper left corner is numbered, and the number is increased in sequence. The rule can be that each time the node with the smallest or largest number among the current failed nodes is replaced.

Obviously, the above-mentioned embodiments of the present invention are merely examples to clearly illustrate the technical solutions of the present invention, and are not intended to limit the specific implementation manners of the present invention. Any modification, equivalent replacement and improvement made within the spirit and principle of the claims of the present invention shall be included in the protection scope of the claims of the present invention.

Claims

A business recovery method for a distributed system, characterized in that the method includes:

When detecting that the network connection of a node is in a disconnected state, determine that the disconnected node is a faulty node, and record the IP address of the faulty node;

Perform replacement processing on the failed node, and the replacement processing includes:

Sending out a prompt message that the standby node replaces the failed node by means of physical replacement;

Send broadcast instructions discovered by the device;

Receiving the status information reported by the standby node after receiving the broadcast instruction discovered by the device;

Determining that the standby node is an available standby node according to the status information reported by the standby node;

Sending a modification instruction to the backup node to modify the IP address to the IP address of the failed node;

The network connection with the standby node is re-established and the services allocated to the failed node are re-allocated to the standby node.
The service restoration method of a distributed system according to claim 1, wherein the method further comprises:

When it is detected that the network connection of a node is in a disconnected state, alarm information is also sent to the control end of the distributed system.
The method for restoring a distributed system service according to claim 1, wherein the physical replacement method is specifically:

Insert the physical wiring external to the faulty node into the standby node.
The service recovery method of a distributed system according to claim 1, wherein the status information includes the IP address, MAC address of the standby node, and status information about whether an end-to-end network connection has been established.
The service recovery method of the distributed system according to claim 4, characterized in that determining that the standby node is an available standby node according to the status information reported by the standby node is specifically:

According to the status information reported by the standby node, determine whether the IP address of the standby node is the preset initial value and whether the standby node has not established an end-to-end network connection, if the IP address of the standby node is the preset initial value and the standby node If the end-to-end network connection is not established, the standby node is determined to be an available standby node.
The service recovery method of a distributed system according to claim 1, wherein when it is detected that the network connection of a node is in a disconnected state and there are multiple nodes in the disconnected state, it is determined that the multiple nodes are in a disconnected state. The nodes in the state are all failed nodes, the IP addresses of multiple failed nodes are recorded, and replacement processing is performed on each of the failed nodes.
A business recovery system of a distributed system, which is characterized in that it includes:

The monitoring module is used to monitor the network connection with all nodes. When it is detected that the network connection of a node is in a disconnected state, determine that the disconnected node is a faulty node, and record the IP address of the faulty node;

A prompt message issuing module, which is used to send out prompt information for replacing the faulty node with the standby node through a physical replacement method;

Broadcast module, used to send broadcast instructions for device discovery;

The receiving module is used to receive the status information reported by the standby node after receiving the broadcast instruction discovered by the device;

A spare node determining module, configured to determine that the spare node is an available spare node according to the status information reported by the spare node;

An IP address modification module, configured to send a modification instruction for modifying an IP address to the IP address of the failed node to the standby node;

The connection establishment module is used for re-establishing the network connection with the standby node and redistributing the services allocated to the failed node to the standby node.
The service recovery system of the distributed system according to claim 7, wherein the status information includes the IP address, MAC address of the standby node, and status information about whether an end-to-end network connection has been established with the service recovery system .
The service recovery system of a distributed system according to claim 8, wherein the standby node determination module is specifically configured to:

According to the status information reported by the backup node, determine whether the IP address of the backup node is the preset initial value and whether the backup node has not established an end-to-end network connection with the service recovery system, if the IP address of the backup node is preset If the initial value and the standby node have not established an end-to-end network connection with the service recovery system, it is determined that the standby node is an available standby node.
The business recovery system of a distributed system according to claim 8, wherein the monitoring module is specifically configured to:

When it is detected that the network connection of a node is disconnected and there are multiple nodes in the disconnected state, it is determined that the multiple disconnected nodes are all faulty nodes, and the IP addresses of the multiple faulty nodes are recorded, And control the prompt information issuing module to send out the prompt information corresponding to each faulty node one by one.