CN111130899A - Service recovery method and system for distributed system - Google Patents

Service recovery method and system for distributed system Download PDF

Info

Publication number
CN111130899A
CN111130899A CN201911396738.3A CN201911396738A CN111130899A CN 111130899 A CN111130899 A CN 111130899A CN 201911396738 A CN201911396738 A CN 201911396738A CN 111130899 A CN111130899 A CN 111130899A
Authority
CN
China
Prior art keywords
node
standby node
standby
address
fault
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911396738.3A
Other languages
Chinese (zh)
Inventor
董友球
杜铁军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Vtron Group Co Ltd
Original Assignee
Vtron Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Vtron Group Co Ltd filed Critical Vtron Group Co Ltd
Priority to CN201911396738.3A priority Critical patent/CN111130899A/en
Publication of CN111130899A publication Critical patent/CN111130899A/en
Priority to PCT/CN2020/141371 priority patent/WO2021136370A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0663Performing the actions predefined by failover planning, e.g. switching to standby network elements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/50Address allocation
    • H04L61/5007Internet protocol [IP] addresses

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Hardware Redundancy (AREA)

Abstract

The invention relates to a service recovery method and a service recovery system for a distributed system. The method comprises the following steps: when the network connection with the node is detected to be in a disconnected state, determining the node in the disconnected state as a fault node, and recording the IP address of the fault node; performing replacement processing on the failed node, the replacement processing including: sending prompt information that the standby node replaces the fault node in a physical replacement mode; sending a broadcast instruction discovered by the equipment; receiving state information reported by a standby node after receiving a broadcast instruction discovered by equipment; determining the standby node as an available standby node according to the state information reported by the standby node; sending a modification instruction for modifying the IP address into the IP address of the fault node to the standby node; reestablishing a network connection with the backup node and reallocating traffic allocated to the failed node to the backup node. When the nodes of the distributed system fail, the invention realizes the plug and play of the standby nodes and quickly recovers the services of the distributed system.

Description

Service recovery method and system for distributed system
Technical Field
The present invention relates to the technical field of distributed systems, and in particular, to a method and a system for recovering services of a distributed system.
Background
The distributed spliced display system and the distributed network seat system are common equipment in a control room. The distributed splicing system generally comprises a plurality of distributed splicing nodes, each node is responsible for displaying a certain part of screens of the spliced wall, and the plurality of nodes play roles respectively to form the splicing display system. The distributed network seat system is generally provided with a sending box and a receiving box, wherein the sending box is connected with a service computer, the receiving box is connected with a keyboard and a mouse and a display, and audio and video and keyboard and mouse information are transmitted through a network, so that the effects of man-machine separation and one-machine multi-screen are achieved.
As the distributed system is composed of nodes, all the nodes are mutually linked through IP addresses, once a certain node has a fault, the prior technical means generally comprises the steps of firstly finding out the IP address information of the fault, then connecting a standby node with a computer, setting the IP address of the standby node as the IP address of the original fault node, and then replacing the fault node with the standby node after modifying corresponding configuration according to business needs. The series of operations takes at least several minutes, and if the control room is applied to the emergency command system, in the emergency command situation, the difference is spurious, so that the time of several minutes may cause a decision-making mistake due to missing some key information.
Disclosure of Invention
The present invention is directed to overcome at least one of the above-mentioned drawbacks (i.e., deficiencies) of the prior art, and to provide a method and a system for recovering services of a distributed system, so as to achieve the effect of quickly recovering services of the system.
The technical scheme adopted by the invention is a service recovery method of a distributed system, which comprises the following steps:
when the network connection with the node is detected to be in a disconnected state, determining the node in the disconnected state as a fault node, and recording the IP address of the fault node;
performing a replacement process on a failed node, the replacement process comprising:
sending prompt information that the standby node replaces the fault node in a physical replacement mode;
sending a broadcast instruction discovered by the equipment;
receiving state information reported by a standby node after receiving a broadcast instruction discovered by equipment;
determining the standby node as an available standby node according to the state information reported by the standby node;
sending a modification instruction for modifying the IP address into the IP address of the fault node to the standby node;
reestablishing a network connection with a backup node and reallocating traffic allocated to the failed node to the backup node.
The method of the invention records the IP address of the fault node when detecting the fault node, then uses the physical replacement mode to replace the fault node with the standby node, then uses the broadcast instruction found by the sending equipment to make the standby node in the standby state receive and report the state information of the standby node, and can determine that the standby node is the available standby node according to the reported state information, and then modifies the IP address of the standby node into the IP address of the fault node by sending the modification instruction, thereby establishing the network connection with the standby node and making the standby node replace the fault node to continue executing the original service of the fault node. The method of the invention solves the problem that after a node of the distributed system fails and is replaced by a new node, the configuration modification can be automatically completed in an on-line mode without any manual configuration, so that the distributed system recovers the original working state, the standby node can be plugged and used, the operation steps are simplified, the operation difficulty is reduced, and the fault repairing time is greatly shortened.
The invention also provides a service recovery system of the distributed system, which comprises:
the monitoring module is used for monitoring network connection with all nodes, determining the node in the disconnected state as a fault node when detecting that the network connection with the node is in the disconnected state, and recording the IP address of the fault node;
the prompt information sending module is used for sending prompt information for replacing the fault node with the standby node in a physical replacement mode;
the broadcast module is used for sending a broadcast instruction discovered by the equipment;
the receiving module is used for receiving the state information reported by the standby node after receiving the broadcast instruction discovered by the equipment;
the standby node determining module is used for determining the standby node as an available standby node according to the state information reported by the standby node;
the IP address modification module is used for sending a modification instruction that the IP address is modified into the IP address of the fault node to the standby node;
and the connection establishing module is used for reestablishing network connection with the standby node and reallocating the service distributed to the fault node to the standby node.
The system of the invention utilizes the monitoring module to detect the fault node of the distributed system on line, when the fault node is detected, the IP address of the fault node is recorded, then the prompt message sending module sends out prompt message to remind the standby node to replace the fault node on the physical connection through the physical replacement mode, then the broadcast instruction module sends the broadcast instruction found by the equipment to enable the standby node which has been physically connected to replace the fault node to receive the broadcast instruction found by the equipment, the receiving module determines that the standby node is the available standby node according to the state information, then the IP address modifying module modifies the IP address of the standby node into the IP address of the fault node by sending the modifying instruction, and finally, establishing network connection with the standby node by using the connection establishing module and enabling the standby node to replace the fault node to continuously execute the original service of the fault node. The system of the invention solves the problem that after the distributed system has a node failure and a new node is replaced, the configuration modification can be automatically completed by using an on-line mode without any manual configuration, so that the distributed system recovers the original working state, the standby node can be used in a plug-and-play manner, the operation steps are simplified, the operation difficulty is reduced, and the repair time of the failure is greatly shortened.
Compared with the prior art, the invention has the beneficial effects that: the method and the system can be applied to a distributed system, when the nodes of the distributed system fail, the standby nodes can be used in a plug-and-play mode under the condition that the cost of redundant equipment is not increased, the standby nodes do not need to be manually configured, and the effect of quickly recovering the service of the distributed system can be realized.
Drawings
Fig. 1 is a flowchart of a service recovery method for a distributed system according to embodiment 1 of the present invention.
Fig. 2 is a flowchart of a service recovery method for a distributed system according to embodiment 2 of the present invention.
Fig. 3 is a framework diagram of a service recovery system of a distributed system according to embodiment 3 of the present invention.
Detailed Description
The drawings are only for purposes of illustration and are not to be construed as limiting the invention. For a better understanding of the following embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
Example 1
As shown in fig. 1, a service recovery method for a distributed system in this embodiment includes the following specific steps:
s101, when the network connection with the node is detected to be in a disconnected state, determining the node in the disconnected state as a fault node, and recording the IP address of the fault node;
s102, sending out prompt information that the standby node replaces the fault node in a physical replacement mode;
s103, sending a broadcast instruction discovered by the equipment;
s104, receiving state information reported by the standby node after the standby node receives the broadcast instruction discovered by the equipment;
s105, determining the standby node to be an available standby node according to the state information reported by the standby node;
s106, sending a modification instruction for modifying the IP address into the IP address of the fault node to the standby node;
s107, network connection with the standby node is reestablished, and the service distributed to the fault node is redistributed to the standby node.
The method of the invention records the IP address of the fault node when detecting the fault node, then uses the physical replacement mode to replace the fault node with the standby node, then uses the broadcast instruction found by the sending equipment to make the standby node in the standby state receive and report the state information of the standby node, and can determine that the standby node is the available standby node according to the reported state information, and then modifies the IP address of the standby node into the IP address of the fault node by sending the modification instruction, thereby establishing the network connection with the standby node and making the standby node replace the fault node to continue executing the original service of the fault node. The method of the invention solves the problem that after a node of the distributed system fails and is replaced by a new node, the configuration modification can be automatically completed in an on-line mode without any manual configuration, so that the distributed system recovers the original working state, the standby node can be plugged and used, the operation steps are simplified, the operation difficulty is reduced, and the fault repairing time is greatly shortened.
In a specific implementation process, the method of this embodiment may be applied to a control module of a distributed system, and the method of this embodiment is implemented by the control module. The control module may be disposed on a server, may be disposed on a certain node of the distributed system, and aims to manage the logical relationship between the nodes and the state between the nodes. The control module establishes network connection with all nodes in the distributed system in a main state in advance, and after the network connection is established, the control module executes all the steps S101-S107 so that when the nodes in the main state of the distributed system are in failure, standby nodes in a standby state can be quickly inserted and online configuration can be quickly realized, and therefore services of the distributed system can be quickly recovered.
In an optional implementation manner, in step S101 of this embodiment, when it is detected that the network connection with the node is in a disconnected state, alarm information is further sent to the control end of the distributed system. The control end of the distributed system can be prompted by sending the alarm information, so that the control end can make corresponding decisions.
In an optional implementation manner, the method of this embodiment prompts the failed node to fail by sending the prompt message in S103, so that the failed node can be physically replaced by the standby node. Specifically, the replacement of the failed node by the backup node through a physical replacement mode refers to inserting a physical connection externally connected to the failed node into the backup node, where the physical connection may include one or more of a power line, a network line, a video line, and a USB line of the failed node. The prompt message may include an IP address of the failed node, and the failed node may be quickly traced in an actual operation according to the IP address of the failed node, so that the backup node may quickly replace the failed node on the physical connection. However, the replacement in this step is only physical connection, and at this time, because the relevant configuration information such as the IP address of the standby node is not modified to make it consistent with the failed node, in actual use, the standby node still cannot replace the failed node and still cannot replace the failed node to execute the original service of the failed node, and at this time, the subsequent steps are still required to be executed to make the configuration information of the standby node consistent with the failed node.
In an alternative embodiment, the status information may include the IP address of the standby node, the MAC address, and status information of whether a peer-to-peer network connection has been established. The IP address of the standby node is a preset initial value, the standby node can receive the broadcast instruction after being physically accessed through the IP address of the equipment, and the state information can be reported through the preset IP address. However, even if the standby node has a preset IP address, it does not establish an end-to-end connection with the control module of the distributed system or does not establish an end-to-end connection with other control modules, and therefore, the control module can determine whether the node reporting the state information is an available standby node according to whether the node establishes an end-to-end connection with the node.
Further, the specific step of step S105 includes:
judging whether the IP address of the standby node is a preset initial value or not and whether the standby node does not establish end-to-end network connection or not according to the state information reported by the standby node, and if the IP address of the standby node is the preset initial value and the standby node does not establish end-to-end network connection, determining that the standby node is an available standby node.
The control module judges the received state information of the standby node, and if the IP address of the standby node is a preset initial value and the standby node does not establish end-to-end network connection, the standby node is determined to be an available standby node.
Based on the method of this embodiment, the distributed system sets the control module to execute all the steps of steps S101 to S107, and completes the interaction with the standby node, so that the service can be quickly recovered when the node of the distributed system fails. The method of the present embodiment is further described below in conjunction with the interaction between the control module and the standby node.
Firstly, a control module establishes network connection with all nodes in a main state in a distributed system in advance, and a standby node in a standby state is preset with an initial IP address;
then, the control module starts to monitor all nodes in the main state, when the fact that the network connection between the nodes in the main state and the control module is in a disconnection state is detected, the nodes in the disconnection state are determined to be fault nodes, the IP addresses of the fault nodes are recorded, and alarm information is actively sent to a control end of the distributed system;
then, the control module sends out prompt information to prompt that the standby node needs to be replaced by the fault node in a physical replacement mode;
when the prompt information is found, inserting a power line, a network cable, a video line and a USB of the fault node into the standby node in an operation mode of manual work and the like, so that the standby node replaces the fault node on physical connection;
the control module sends a prompt message and then sends a broadcast instruction discovered by the equipment;
when the standby node replaces the fault node from the physical connection, the standby node can receive the broadcast instruction by using the preset initial IP address, and when the standby node receives the broadcast instruction, the standby node actively reports the state information of the standby node to the control module.
The control module receives state information reported by the standby node, and then judges whether the IP address of the standby node is a preset initial value and whether the standby node does not establish end-to-end network connection with any control module according to the state information, if so, the standby node can be determined to be an available standby node;
then, the control module sends a modification instruction to the available standby node, wherein the instruction instructs the available standby node to modify the IP address of the available standby node into the IP address of the fault node;
and after receiving the modification instruction, the available standby node immediately sets the IP address of the local computer according to the relevant parameters in the modification instruction.
When the available standby node modifies the IP address, the control module reestablishes the network connection with the available standby node and reallocates the service allocated to the fault node to the standby node.
The method of the invention realizes the effect of rapidly recovering the service of the distributed system by the plug and play of the standby node, simplifies the process of the standby node configuration and greatly shortens the time for repairing the fault.
Example 2
Different from embodiment 1, the service restoration method of the distributed system in this embodiment is a method for solving how to repair a plurality of failed nodes when more than one failed node is detected in step S101 of embodiment 1.
As shown in fig. 2, the specific steps of the service recovery method for a distributed system in this embodiment include:
s201, when it is detected that the network connection of the nodes is in a disconnected state and the nodes in the disconnected state include a plurality of nodes, determining that the plurality of nodes in the disconnected state are all fault nodes, and recording IP addresses of the plurality of fault nodes;
s202, numbering and sequencing a plurality of fault nodes according to a preset numbering rule;
s203, circularly executing the following steps S204-S209 according to the serial number sorting of the failed nodes so that all the failed nodes are replaced by available standby nodes;
s204, sending prompt information for replacing the fault node by the standby node in a physical replacement mode;
s205, sending a broadcast instruction discovered by the equipment;
s206, receiving the state information reported by the standby node after receiving the broadcast instruction discovered by the equipment;
s207, determining the standby node to be an available standby node according to the state information reported by the standby node;
s208, sending a modification instruction for modifying the IP address into the IP address of the fault node to the standby node;
s209, reestablishing the network connection with the standby node and reallocating the service distributed to the fault node to the standby node.
In an optional implementation manner, the preset numbering rule may be that nodes are numbered according to a logical relationship of nodes in a state in the distributed system, and then are sorted according to the node number size of each failed node.
Because a plurality of nodes break down simultaneously, in order to avoid the equipment from being changed by mistake, a rule must be formulated, and the rule can be formulated according to the characteristics of the system, for example, because the wall splicing system has a rank ordering relation, the nodes corresponding to each screen can be numbered in sequence from left to right and from top to bottom, and the node at the upper left corner is the node No. 1, and the serial number is increased in sequence. The rule may be that the least or most numbered node of the currently failed nodes is replaced each time.
Example 3
The invention also provides a service recovery system of the distributed system. As shown in fig. 3, a service recovery system of a distributed system in this embodiment specifically includes:
the monitoring module 301 is configured to monitor network connections with all nodes, determine that a node in a disconnected state is a failed node when detecting that the network connection with the node is in the disconnected state, and record an IP address of the failed node;
a prompt message sending module 302, configured to send a prompt message that the standby node replaces the failed node in a physical replacement manner;
a broadcast module 303, configured to send a broadcast instruction for device discovery;
a receiving module 304, configured to receive state information reported by the standby node after receiving a broadcast instruction discovered by the device;
a standby node determining module 305, configured to determine, according to the state information reported by the standby node, that the standby node is an available standby node;
an IP address modification module 306, configured to send a modification instruction that an IP address is modified to an IP address of the failed node to the standby node;
a connection establishing module 307 for re-establishing a network connection with a backup node and re-allocating traffic allocated to the failed node to the backup node.
The system of the invention utilizes a monitoring module 301 to detect a fault node of a distributed system on line, when a fault node is detected, the IP address of the fault node is recorded, then a prompt message sending module 302 sends a prompt message to remind a standby node to replace the fault node on a physical connection in a physical replacement mode, then a broadcast command module 303 sends a broadcast command of equipment discovery to enable the standby node which replaces the fault node on the physical connection to receive the broadcast command of the equipment discovery, a receiving module 304 determines that the standby node is an available standby node according to the state information reported by the standby node, and an IP address modifying module 306 modifies the IP address of the standby node into the IP address of the fault node by sending a modifying command, finally, the connection establishing module 307 is used to establish a network connection with the standby node and make the standby node replace the failed node to continue executing the original service of the failed node. The system of the invention solves the problem that after a node of the distributed system fails and is replaced by a new node, the configuration modification can be automatically completed by using an on-line mode without any manual configuration, so that the distributed system recovers the original working state, the standby node can be used in a plug-and-play manner, the operation steps are simplified, the operation difficulty is reduced, and the fault repairing time is greatly shortened.
In a specific implementation process, the service recovery system of this embodiment may be applied to a control module of a distributed system, and the service recovery system is implemented by setting each module of this embodiment on the control module. In a specific implementation process, the control module may be disposed on a server, and may be deployed on a certain node of a distributed system, so as to manage a logical relationship between nodes and a state between nodes. By arranging each module in the control module, when a node in a main state fails, the standby node in a standby state can be quickly inserted into the control module, and online configuration can be quickly realized, so that services of the distributed system can be quickly recovered.
In an optional implementation manner, the service recovery system of this embodiment further includes an alarm information sending module, where the alarm information sending module is configured to send alarm information to the control end of the distributed system when the monitoring module 301 detects that the network connection with the node is in a disconnected state. The control end of the distributed system can be reminded by sending out the alarm information, so that the control end can make corresponding decisions.
In an optional implementation manner, the present embodiment prompts the failed node to fail by sending the prompt information by the prompt information sending module 302, so that the failed node can be physically replaced by the standby node. Specifically, the replacement of the failed node by the standby node through a physical replacement mode refers to inserting physical wiring externally connected to the failed node into the standby node, and the physical wiring may include one or more of a power line, a network cable, a video line and a USB line of the failed node. The prompt message may include an IP address of the failed node, and the failed node may be quickly tracked in actual operation according to the IP address of the failed node, so that the failed node may be quickly replaced by the standby node on the physical connection. However, the replacement is only a physical connection, and at this time, because the relevant configuration information such as the IP address of the standby node is not modified to be consistent with the failed node, in actual use, the standby node still cannot replace the failed node and still cannot replace the failed node to execute the original service of the failed node, and at this time, other modules still need to be used to complete corresponding functions, so that the configuration information of the standby node is consistent with the failed node.
In an alternative embodiment, the status information may include the IP address of the standby node, the MAC address, and status information of whether a peer-to-peer network connection has been established. The IP address of the standby node is a preset initial value, the standby node can receive the broadcast instruction after being physically accessed through the IP address of the equipment, and the state information can be reported through the preset IP address. However, even if the standby node has the preset IP address, the standby node does not establish an end-to-end connection with the service recovery system of the distributed system or does not establish an end-to-end connection with another service recovery system, and therefore, the standby node determining module 305 in the service recovery system may determine whether the node reporting the state information is an available standby node according to whether the node establishes an end-to-end connection with the node.
Further, the standby node determining module 305 is specifically configured to:
and judging whether the IP address of the standby node is a preset initial value or not and whether the standby node does not establish end-to-end network connection with the service recovery system or not according to the state information reported by the standby node, and if the IP address of the standby node is the preset initial value and the standby node does not establish end-to-end network connection with the service recovery system, determining that the standby node is an available standby node.
The standby node determining module 305 determines the received status information of the standby node, and determines that the standby node is an available standby node if the IP address of the standby node is a preset initial value and the standby node does not establish an end-to-end network connection.
In an optional implementation manner, when detecting that a plurality of nodes in the disconnected state include a plurality of nodes, the monitoring module 301 determines that the plurality of nodes in the disconnected state are all faulty nodes, records IP addresses of the plurality of faulty nodes, and controls the prompt information sending module 302 to send the prompt information one by one, which corresponds to each faulty node one by one, so that each faulty node can be replaced by a standby node one by one, and the service restoration system can be quickly repaired.
In an optional implementation manner, the monitoring module 301 controls the prompt information sending module 302 to send out the prompt information one by one, which corresponds to each failed node one by one, specifically:
the monitoring module 301 performs numbering and sequencing on the plurality of failed nodes according to a preset numbering rule, and any one of the failure nodes sends out the prompt information corresponding to each failed node one by one according to the numbering and sequencing of the failed node, so that all the failed nodes are replaced by available standby nodes.
The preset numbering rule may be that nodes are numbered according to a logical relationship of nodes in a state in the distributed system, and then the nodes are ordered according to the node numbers of the fault nodes.
Because a plurality of nodes break down simultaneously, in order to avoid the equipment from being changed by mistake, a rule must be formulated, and the rule can be formulated according to the characteristics of the system, for example, because the wall splicing system has a rank ordering relation, the nodes corresponding to each screen can be numbered in sequence from left to right and from top to bottom, and the node at the upper left corner is the node No. 1, and the serial number is increased in sequence. The rule may be that the least or most numbered node of the currently failed nodes is replaced each time.
It should be understood that the above-mentioned embodiments of the present invention are only examples for clearly illustrating the technical solutions of the present invention, and are not intended to limit the specific embodiments of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the claims of the present invention should be included in the scope of protection of the claims of the present invention.

Claims (10)

1. A method for recovering services in a distributed system, the method comprising:
when the network connection with the node is detected to be in a disconnected state, determining the node in the disconnected state as a fault node, and recording the IP address of the fault node;
performing a replacement process on a failed node, the replacement process comprising:
sending prompt information that the standby node replaces the fault node in a physical replacement mode;
sending a broadcast instruction discovered by the equipment;
receiving state information reported by a standby node after receiving a broadcast instruction discovered by equipment;
determining the standby node as an available standby node according to the state information reported by the standby node;
sending a modification instruction for modifying the IP address into the IP address of the fault node to the standby node;
reestablishing a network connection with a backup node and reallocating traffic allocated to the failed node to the backup node.
2. The method for recovering traffic of the distributed system according to claim 1, wherein the method further comprises:
and when the network connection with the node is detected to be in a disconnected state, alarm information is also sent to the control end of the distributed system.
3. The method for recovering services in a distributed system according to claim 1, wherein the physical replacement means specifically includes: and inserting physical wiring externally connected with the fault node into the standby node.
4. The traffic restoration method of the distributed system according to claim 1, wherein the status information includes an IP address of the standby node, a MAC address, and status information on whether the end-to-end network connection has been established.
5. The method for recovering services in a distributed system according to claim 4, wherein the determining that the standby node is an available standby node according to the status information reported by the standby node specifically comprises:
judging whether the IP address of the standby node is a preset initial value or not and whether the standby node does not establish end-to-end network connection or not according to the state information reported by the standby node, and if the IP address of the standby node is the preset initial value and the standby node does not establish end-to-end network connection, determining that the standby node is an available standby node.
6. The traffic restoration method for the distributed system according to claim 1, wherein when it is detected that the network connection having the node is in a disconnected state and the number of the nodes in the disconnected state is plural, it is determined that all of the plural nodes in the disconnected state are faulty nodes, IP addresses of the plural faulty nodes are recorded, and replacement processing is performed on each of the faulty nodes.
7. A service recovery system for a distributed system, comprising:
the monitoring module is used for monitoring network connection with all nodes, determining the node in the disconnected state as a fault node when detecting that the network connection with the node is in the disconnected state, and recording the IP address of the fault node;
the prompt information sending module is used for sending prompt information for replacing the fault node with the standby node in a physical replacement mode;
the broadcast module is used for sending a broadcast instruction discovered by the equipment;
the receiving module is used for receiving the state information reported by the standby node after receiving the broadcast instruction discovered by the equipment;
the standby node determining module is used for determining the standby node as an available standby node according to the state information reported by the standby node;
the IP address modification module is used for sending a modification instruction that the IP address is modified into the IP address of the fault node to the standby node;
and the connection establishing module is used for reestablishing network connection with the standby node and reallocating the service distributed to the fault node to the standby node.
8. The service restoration system according to claim 7, wherein the status information includes an IP address of the standby node, a MAC address, and status information of whether the end-to-end network connection has been established with the service restoration system.
9. The service restoration system according to claim 8, wherein the standby node determining module is specifically configured to:
and judging whether the IP address of the standby node is a preset initial value or not and whether the standby node does not establish end-to-end network connection with the service recovery system or not according to the state information reported by the standby node, and if the IP address of the standby node is the preset initial value and the standby node does not establish end-to-end network connection with the service recovery system, determining that the standby node is an available standby node.
10. The service restoration system according to claim 8, wherein the monitoring module is specifically configured to:
when it is detected that the network connection with the nodes is in a disconnected state and the nodes in the disconnected state include a plurality of nodes, determining that the plurality of nodes in the disconnected state are all fault nodes, recording IP addresses of the plurality of fault nodes, and controlling the prompt information sending module to send the prompt information corresponding to each fault node one by one.
CN201911396738.3A 2019-12-30 2019-12-30 Service recovery method and system for distributed system Pending CN111130899A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201911396738.3A CN111130899A (en) 2019-12-30 2019-12-30 Service recovery method and system for distributed system
PCT/CN2020/141371 WO2021136370A1 (en) 2019-12-30 2020-12-30 Service restoration method and system for distributed system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911396738.3A CN111130899A (en) 2019-12-30 2019-12-30 Service recovery method and system for distributed system

Publications (1)

Publication Number Publication Date
CN111130899A true CN111130899A (en) 2020-05-08

Family

ID=70505248

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911396738.3A Pending CN111130899A (en) 2019-12-30 2019-12-30 Service recovery method and system for distributed system

Country Status (2)

Country Link
CN (1) CN111130899A (en)
WO (1) WO2021136370A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021136370A1 (en) * 2019-12-30 2021-07-08 威创集团股份有限公司 Service restoration method and system for distributed system
CN115002001A (en) * 2022-02-25 2022-09-02 苏州浪潮智能科技有限公司 Method, device, equipment and medium for detecting cluster network sub-health
CN116743752A (en) * 2023-08-11 2023-09-12 山东恒宇电子有限公司 System for realizing data processing load balance by distributed network communication

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09244910A (en) * 1996-03-11 1997-09-19 Nippon Steel Corp Backup method for decentralized control system
CN105406980A (en) * 2015-10-19 2016-03-16 浪潮(北京)电子信息产业有限公司 Multi-node backup method and multi-node backup device
CN107145306A (en) * 2017-04-27 2017-09-08 杭州哲信信息技术有限公司 Distributed data storage method and system
CN109151028A (en) * 2018-08-23 2019-01-04 郑州云海信息技术有限公司 A kind of distributed memory system disaster recovery method and device
WO2019049433A1 (en) * 2017-09-06 2019-03-14 日本電気株式会社 Cluster system, cluster system control method, server device, control method, and non-transitory computer-readable medium having program stored therein

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110572275B (en) * 2019-08-01 2022-09-09 新华三技术有限公司成都分公司 Network card switching method and device, server and computer readable storage medium
CN111130899A (en) * 2019-12-30 2020-05-08 威创集团股份有限公司 Service recovery method and system for distributed system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09244910A (en) * 1996-03-11 1997-09-19 Nippon Steel Corp Backup method for decentralized control system
CN105406980A (en) * 2015-10-19 2016-03-16 浪潮(北京)电子信息产业有限公司 Multi-node backup method and multi-node backup device
CN107145306A (en) * 2017-04-27 2017-09-08 杭州哲信信息技术有限公司 Distributed data storage method and system
WO2019049433A1 (en) * 2017-09-06 2019-03-14 日本電気株式会社 Cluster system, cluster system control method, server device, control method, and non-transitory computer-readable medium having program stored therein
CN109151028A (en) * 2018-08-23 2019-01-04 郑州云海信息技术有限公司 A kind of distributed memory system disaster recovery method and device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021136370A1 (en) * 2019-12-30 2021-07-08 威创集团股份有限公司 Service restoration method and system for distributed system
CN115002001A (en) * 2022-02-25 2022-09-02 苏州浪潮智能科技有限公司 Method, device, equipment and medium for detecting cluster network sub-health
CN115002001B (en) * 2022-02-25 2023-08-04 苏州浪潮智能科技有限公司 Method, device, equipment and medium for detecting sub-health of cluster network
CN116743752A (en) * 2023-08-11 2023-09-12 山东恒宇电子有限公司 System for realizing data processing load balance by distributed network communication

Also Published As

Publication number Publication date
WO2021136370A1 (en) 2021-07-08

Similar Documents

Publication Publication Date Title
WO2021136370A1 (en) Service restoration method and system for distributed system
US10592330B2 (en) Systems and methods for automatic replacement and repair of communications network devices
CN104469181B (en) Audio and video matrix switch method based on PIS
CN104038376A (en) Method and device for managing real servers and LVS clustering system
CN105681077A (en) Fault processing method, device and system
CN102882704B (en) Link protection method in the soft reboot escalation process of a kind of ISSU and equipment
CN103856357B (en) A kind of stacking system fault handling method and stacking system
CN105227385A (en) A kind of method and system of troubleshooting
CN109842505A (en) A kind of cloud clustering fault processing method and processing device
CN105915426A (en) Failure recovery method and device of ring network
CN106294795A (en) A kind of data base's changing method and system
JP2020088470A (en) Information processing apparatus, network system and teaming program
CN105634848A (en) Virtual router monitoring method and apparatus
CN102487332B (en) Fault processing method, apparatus thereof and system thereof
CN104980303A (en) Node failure repair method in multi-level tree network
CN104994327B (en) The method and system of MCU abnormality processings in a kind of video conference
CN109921949A (en) A kind of implementation method of disaster recovery and backup systems redundancy scheme
CN106027313B (en) Network link disaster tolerance system and method
CN102520611B (en) Double-computer thermal redundancy control system and method
CN106294030A (en) Storage redundancy approach based on server virtualization system and device
CN113300913B (en) Equipment testing method and device, testing equipment and storage medium
CN105550065A (en) Database server communication management method and device
CN115549775A (en) Method for processing optical signal transmission abnormity, optical transmission equipment and system
CN113946474A (en) Efficient disaster tolerance protection method and disaster tolerance processing system for storage system
CN111199701B (en) Synchronous control system of LED lattice display screen and self-checking method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200508

RJ01 Rejection of invention patent application after publication