WO2021136370A1 - Service restoration method and system for distributed system - Google Patents

Service restoration method and system for distributed system Download PDF

Info

Publication number
WO2021136370A1
WO2021136370A1 PCT/CN2020/141371 CN2020141371W WO2021136370A1 WO 2021136370 A1 WO2021136370 A1 WO 2021136370A1 CN 2020141371 W CN2020141371 W CN 2020141371W WO 2021136370 A1 WO2021136370 A1 WO 2021136370A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
address
standby
standby node
faulty
Prior art date
Application number
PCT/CN2020/141371
Other languages
French (fr)
Chinese (zh)
Inventor
董友球
杜铁军
Original Assignee
威创集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 威创集团股份有限公司 filed Critical 威创集团股份有限公司
Publication of WO2021136370A1 publication Critical patent/WO2021136370A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0663Performing the actions predefined by failover planning, e.g. switching to standby network elements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/50Address allocation
    • H04L61/5007Internet protocol [IP] addresses

Definitions

  • the present invention relates to the technical field of distributed systems, and more specifically, to a method and system for business recovery of distributed systems.
  • Distributed splicing display system and distributed network seat system are commonly used equipment in the control room.
  • the distributed splicing system usually includes multiple distributed splicing nodes, each node is responsible for displaying a certain part of the screen of the splicing wall, and multiple nodes perform their duties to form a splicing display system.
  • the distributed network seat system usually has a sending box and a receiving box, the sending box is connected to the business computer, the receiving box is connected to the keyboard, mouse and the monitor, and the audio, video and keyboard and mouse information are transmitted through the network, so as to achieve the effect of man-machine separation and one machine with multiple screens. .
  • each node communicates with each other through IP addresses.
  • the existing technical means generally first find out the faulty IP address information, and then connect the backup node to the computer to make a backup
  • the IP address of the node is set to the IP address of the original failed node, and the corresponding configuration is modified according to business needs, and then the backup node is replaced with the failed node.
  • This series of operations will take at least a few minutes. If the control room is used in an emergency command system, the difference in emergency command conditions is far from perfect. Therefore, within a few minutes, some key information may be missed. Which leads to decision-making errors.
  • the present invention aims to overcome at least one defect (deficiency) of the above-mentioned prior art and provide a distributed system service recovery method and system, which are used to achieve the effect of quickly recovering system services.
  • the technical solution adopted by the present invention is a business recovery method of a distributed system, and the method includes:
  • the network connection with the standby node is re-established and the services allocated to the failed node are re-allocated to the standby node.
  • the method of the present invention records the IP address of the faulty node, and then uses a physical replacement method to first replace the faulty node with the standby node, and then sends the broadcast instruction discovered by the device to make it in the standby state
  • the standby node receives and reports its own status information. According to the reported status information, it can be determined that the standby node is an available standby node, and then the IP address of the standby node is modified to the IP address of the failed node by sending a modification instruction to establish a connection with The network connection of the backup node allows the backup node to replace the failed node to continue to perform the original business of the failed node.
  • the method of the present invention solves the problem that after a node in the distributed system fails and a new node is replaced, the configuration modification can be automatically completed in an online manner without any manual configuration, so that the distributed system can restore the original working state.
  • the backup node can be plug and play, which not only simplifies the operation steps, reduces the operation difficulty, but also greatly shortens the repair time of the fault.
  • the present invention also provides a business recovery system of a distributed system, and the system includes:
  • the monitoring module is used to monitor the network connection with all nodes. When it is detected that the network connection of a node is in a disconnected state, determine that the disconnected node is a faulty node, and record the IP address of the faulty node;
  • a prompt message issuing module which is used to send out prompt information for replacing the faulty node with the standby node through a physical replacement method
  • Broadcast module used to send broadcast instructions for device discovery
  • the receiving module is used to receive the status information reported by the standby node after receiving the broadcast instruction discovered by the device;
  • a spare node determining module configured to determine that the spare node is an available spare node according to the status information reported by the spare node;
  • An IP address modification module configured to send a modification instruction for modifying an IP address to the IP address of the failed node to the standby node;
  • the connection establishment module is used for re-establishing the network connection with the standby node and redistributing the services allocated to the failed node to the standby node.
  • the system of the present invention uses the monitoring module to detect the faulty node of the distributed system online. When a faulty node is detected, the IP address of the faulty node is recorded, and then the prompt message is sent out by the prompt message sending module to remind the physical replacement
  • the standby node replaces the failed node on the physical connection, and then sends the broadcast instruction discovered by the device through the broadcast instruction module so that the standby node that has been physically connected to replace the failed node can receive the broadcast instruction discovered by the device, and the receiving module according to
  • the status information reported by the standby node, and then the standby node determination module can determine that the standby node is an available standby node based on the status information, and then the IP address modification module modifies the IP address of the standby node to the IP address of the failed node by sending a modification instruction, and finally
  • the connection establishment module is used to establish a network connection with the standby node, and the standby node replaces the failed node to continue to perform the original business of the
  • the system of the present invention solves the problem that after a node in the distributed system fails and a new node is replaced, the configuration modification can be automatically completed in an online manner without any manual configuration, so that the distributed system can restore the original working state.
  • the backup node can be plug and play, which not only simplifies the operation steps, reduces the operation difficulty, but also greatly shortens the repair time of the fault.
  • the beneficial effects of the present invention are: the method and system of the present invention can be applied to a distributed system.
  • a node of the distributed system fails, the cost of redundant equipment is not increased.
  • the standby node is plug-and-play, and the standby node does not need to do any manual configuration, which can achieve the effect of quickly restoring distributed system services.
  • Fig. 1 is a flowchart of a method for restoring a service in a distributed system according to Embodiment 1 of the present invention.
  • FIG. 2 is a flowchart of a method for restoring a service in a distributed system according to Embodiment 2 of the present invention.
  • FIG. 3 is a framework diagram of a service recovery system of a distributed system according to Embodiment 3 of the present invention.
  • the service recovery method of the distributed system in this embodiment includes the following specific steps:
  • S102 Send a prompt message that the standby node replaces the failed node by means of physical replacement
  • S105 Determine that the standby node is an available standby node according to the status information reported by the standby node;
  • S106 Send a modification instruction for modifying the IP address to the IP address of the failed node to the backup node;
  • the method of the present invention records the IP address of the faulty node, and then uses a physical replacement method to first replace the faulty node with the standby node, and then sends the broadcast instruction discovered by the device to make it in the standby state
  • the standby node receives and reports its own status information. According to the reported status information, it can be determined that the standby node is an available standby node, and then the IP address of the standby node is modified to the IP address of the failed node by sending a modification instruction to establish a connection with The network connection of the backup node allows the backup node to replace the failed node to continue to perform the original business of the failed node.
  • the method of the present invention solves the problem that after a node in the distributed system fails and a new node is replaced, the configuration modification can be automatically completed in an online manner without any manual configuration, so that the distributed system can restore the original working state.
  • the backup node can be plug and play, which not only simplifies the operation steps, reduces the operation difficulty, but also greatly shortens the repair time of the fault.
  • the method of this embodiment can be applied to a control module of a distributed system, and the method of this embodiment is implemented through the control module.
  • the control module can be arranged on a server or on a certain node of a distributed system, and its purpose is to manage the logical relationship between the various nodes and the state between the nodes.
  • the control module pre-establishes a network connection with all nodes in the main state in the distributed system. After the network connection is established, the control module executes all the steps S101-S107 above so that the distributed system is in the main state.
  • the standby node in the standby state can be quickly inserted and online configuration can be quickly realized, so as to quickly restore the business of the distributed system.
  • step S101 of this embodiment when it is detected that the network connection with a node is in a disconnected state, alarm information is also sent to the control end of the distributed system. By sending out alarm information, the control end of the distributed system can be reminded so that it can make corresponding decisions.
  • the method of this embodiment prompts that the faulty node has failed by sending a prompt message in S103, so that the backup node can be physically replaced by the faulty node.
  • the replacement of the faulty node by the standby node by means of physical replacement refers to inserting the physical wiring external to the faulty node into the standby node, and the physical wiring may include the power cord, network cable, and video cable of the faulty node. , One or more of the USB cables.
  • the prompt information may include the IP address of the failed node, and the failed node can be quickly traced in actual operations according to the IP address of the failed node, so that the backup node can quickly replace the failed node on the physical connection.
  • the replacement in this step is only on the physical connection.
  • the IP address and other related configuration information of the standby node are not modified to make it consistent with the failed node.
  • the standby node still cannot replace the failed node and cannot replace the failed node.
  • the subsequent steps still need to be performed to make the configuration information of the standby node consistent with the failed node.
  • the status information may include the IP address and MAC address of the standby node, and status information about whether an end-to-end network connection has been established.
  • the IP address of the standby node is a preset initial value.
  • the IP address of the device-use node enables it to receive broadcast instructions after physical access, and the preset IP address can report status information. But even if the standby node has a preset IP address, it has not established an end-to-end connection with the control module of the distributed system or has not established an end-to-end connection with other control modules. Therefore, the control module can be based on the node Whether an end-to-end connection is established with it to determine whether the node reporting status information is an available standby node.
  • step S105 includes:
  • the standby node determines whether the IP address of the standby node is the preset initial value and whether the standby node has not established an end-to-end network connection, if the IP address of the standby node is the preset initial value and the standby node If the end-to-end network connection is not established, the standby node is determined to be an available standby node.
  • the control module judges the received status information of the standby node, and if the IP address of the standby node is a preset initial value and the standby node has not established an end-to-end network connection, the standby node is determined to be an available standby node.
  • the distributed system executes all the above steps S101-S107 by setting the control module to complete the interaction with the standby node, so that the service can be quickly restored when the node of the distributed system fails.
  • the method of this embodiment is further described in conjunction with the interaction mode between the control module and the standby node.
  • control module pre-establishes a network connection with all nodes in the primary state in the distributed system, and the standby node in the standby state is preset with an initial IP address;
  • the control module starts to monitor all nodes in the main state.
  • the control module determines that the node in the disconnected state is a faulty node, and records all the nodes in the main state. State the IP address of the failed node, and actively send alarm information to the control end of the distributed system;
  • control module sends out a prompt message, prompting that the standby node needs to be replaced by a physical replacement method for the failed node;
  • control module After the control module sends out the prompt information, it also sends the broadcast instruction of the device discovery;
  • the backup node After the backup node replaces the failed node on the physical connection, the backup node can receive the broadcast instruction by using the preset initial IP address. When the backup node receives the broadcast instruction, it actively reports its status information to the control module.
  • the control module receives the status information reported by the standby node, and then determines whether the IP address of the standby node is the preset initial value and whether the standby node has not established an end-to-end network connection with any control module according to the status information. If so, the standby node can be determined
  • the node is an available spare node
  • control module sends a modification instruction to the available backup node, the instruction instructs the available backup node to modify its own IP address to the IP address of the failed node;
  • the available backup node After the available backup node receives the modification instruction, it immediately sets the IP address of the machine according to the relevant parameters in the modification instruction.
  • control module After the available backup node modifies the IP address, the control module re-establishes the network connection with the available backup node and redistributes the services allocated to the failed node to the backup node.
  • the method of the present invention realizes the effect of rapid service recovery of the distributed system through the plug and play of the standby node, simplifies the process of configuring the standby node, and greatly shortens the repair time of the fault.
  • the service recovery method of a distributed system in this embodiment is a method to solve how to repair multiple failed nodes when more than one faulty node is detected using step S101 of Embodiment 1. .
  • the specific steps of a method for business recovery of a distributed system in this embodiment include:
  • S202 Perform numbering and sorting of multiple faulty nodes according to a preset numbering rule
  • S203 Perform the following steps S204-S209 cyclically according to the numbering of the failed nodes so that all the failed nodes are replaced by available spare nodes;
  • S206 Receive the status information reported by the standby node after receiving the broadcast instruction discovered by the device;
  • S207 Determine that the standby node is an available standby node according to the status information reported by the standby node.
  • S208 Send a modification instruction for modifying the IP address to the IP address of the failed node to the standby node;
  • the preset numbering rule may be to number the nodes according to the logical relationship of the nodes in the state in the distributed system, and then to sort the nodes according to the size of the node number of each faulty node.
  • the rule can be formulated according to the characteristics of the system. For example, the wall-to-wall system can be sorted from left to right and from top to right. The node corresponding to each screen is numbered in the order below, the node 1 in the upper left corner is numbered, and the number is increased in sequence. The rule can be that each time the node with the smallest or largest number among the current failed nodes is replaced.
  • the invention also provides a business recovery system of the distributed system.
  • a service recovery system of a distributed system in this embodiment specifically includes:
  • the monitoring module 301 is used to monitor the network connection with all nodes. When it is detected that the network connection of a node is in a disconnected state, determine that the disconnected node is a faulty node, and record the IP address of the faulty node;
  • the prompt message issuing module 302 is configured to send out prompt information for replacing the faulty node with the standby node through a physical replacement method
  • the broadcast module 303 is used to send broadcast instructions for device discovery
  • the receiving module 304 is configured to receive the status information reported by the standby node after receiving the broadcast instruction discovered by the device;
  • the standby node determining module 305 is configured to determine that the standby node is an available standby node according to the status information reported by the standby node;
  • the IP address modification module 306 is configured to send a modification instruction for modifying the IP address to the IP address of the failed node to the standby node;
  • the connection establishment module 307 is configured to re-establish a network connection with the backup node and redistribute the services allocated to the failed node to the backup node.
  • the system of the present invention uses the monitoring module 301 to detect the faulty nodes of the distributed system online.
  • the IP address of the faulty node is recorded, and then the prompt message sending module 302 sends a prompt message to remind you to pass
  • the backup node replaces the failed node on the physical connection, and then sends the broadcast instruction of device discovery through the broadcast instruction module 303 so that the backup node that has been physically connected to replace the failed node can receive the broadcast instruction of device discovery.
  • the receiving module 304 can determine that the standby node is an available standby node according to the status information reported by the standby node, and then the standby node determining module 305 can determine that the standby node is an available standby node based on the status information, and then the IP address modification module 306 modifies the IP address of the standby node by sending a modification instruction to The IP address of the failed node, and finally the connection establishment module 307 is used to establish a network connection with the standby node and the standby node replaces the failed node to continue to perform the original service of the failed node.
  • the system of the present invention solves the problem that after a node in the distributed system fails and a new node is replaced, the configuration modification can be automatically completed in an online manner without any manual configuration, so that the distributed system can restore the original working state.
  • the backup node can be plug and play, which not only simplifies the operation steps, reduces the operation difficulty, but also greatly shortens the repair time of the fault.
  • the business recovery system of this embodiment can be applied to a control module of a distributed system, and the business recovery system is implemented by setting each module of this embodiment on the control module.
  • the control module can be arranged on a server or on a certain node of a distributed system, and its purpose is to manage the logical relationship between the nodes and the state between the nodes.
  • the service recovery system of this implementation further includes an alarm information sending module, and the alarm information sending module is used to send information to the distribution system when the monitoring module 301 detects that the network connection with the node is in a disconnected state.
  • the control end of the integrated system sends alarm information. By issuing warning messages, the control end of the distributed system can be reminded so that it can make corresponding decisions.
  • a prompt message is issued by the prompt message issuing module 302 to prompt that a faulty node has failed, so that the backup node can be physically replaced with the faulty node.
  • the replacement of the faulty node by the standby node by means of physical replacement refers to inserting the physical wiring external to the faulty node into the standby node, and the physical wiring may include the power cord, network cable, and video cable of the faulty node. , One or more of the USB cables.
  • the prompt information may include the IP address of the failed node, and the failed node can be quickly traced in actual operations according to the IP address of the failed node, so that the backup node can quickly replace the failed node on the physical connection.
  • the status information may include the IP address, MAC address of the standby node, and status information about whether an end-to-end network connection has been established.
  • the IP address of the standby node is a preset initial value.
  • the IP address of the device-use node enables it to receive broadcast instructions after physical access, and the preset IP address can report status information.
  • the standby node has a preset IP address, it has not established an end-to-end connection with the business recovery system of the distributed system or has not established an end-to-end connection with other business recovery systems. Therefore, in the business recovery system
  • the standby node determining module 305 can determine whether the node reporting the status information is an available standby node according to whether the node has established an end-to-end connection with it.
  • the standby node determining module 305 is specifically configured to:
  • the backup node determines whether the IP address of the backup node is the preset initial value and whether the backup node has not established an end-to-end network connection with the service recovery system, if the IP address of the backup node is preset If the initial value and the standby node have not established an end-to-end network connection with the service recovery system, it is determined that the standby node is an available standby node.
  • the standby node determining module 305 judges the status information of the standby node received, and if the IP address of the standby node is a preset initial value and the standby node has not established an end-to-end network connection, the standby node is determined to be an available standby node.
  • the monitoring module 301 when the monitoring module 301 detects that there are multiple disconnected nodes, it determines that the multiple disconnected nodes are all faulty nodes, and records multiple faults. Node’s IP address, and control the prompt message issuing module 302 to send out the prompt information corresponding to each failed node one by one, so that each failed node can be replaced with a spare node one by one, so that the service recovery system can quickly repair.
  • the monitoring module 301 controls the prompt information issuing module 302 to send out the prompt information corresponding to each faulty node one by one, specifically:
  • the monitoring module 301 sorts the number of multiple faulty nodes according to the preset numbering rule, and any one by one according to the number sequence of the faulty node sends out the prompt information corresponding to each faulty node one by one so that all the faulty nodes are available as spare nodes replace.
  • the preset numbering rule may be that the nodes are numbered according to the logical relationship of the nodes in the state in the distributed system, and then sorted according to the size of the node number of each failed node.
  • the rule can be formulated according to the characteristics of the system.
  • the node corresponding to each screen is numbered in the order below, the node 1 in the upper left corner is numbered, and the number is increased in sequence.
  • the rule can be that each time the node with the smallest or largest number among the current failed nodes is replaced.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Hardware Redundancy (AREA)

Abstract

A service restoration method and system for a distributed system. The method comprises: when it is detected that network connection of a node is in a disconnected state, determining that the node in the disconnected state is a faulty node, and recording an IP address of the faulty node; performing replacement processing on the faulty node, the replacement processing comprising: sending prompt information indicating that the faulty node is to be replaced with a backup node by means of physical replacement; sending a broadcast instruction for device discovery; receiving status information reported by the backup node after the backup node receives the broadcast instruction for device discovery; determining, according to the status information reported by the backup node, that the backup node is an available backup node; sending to the backup node a modification instruction to modify an IP address to the IP address of the faulty node; and re-establishing network connection to the backup node, and re-allocating a service allocated to the faulty node to the backup node. In the present invention, when a node of a distributed system is faulty, plug-and-play of the backup node is implemented, and a service of the distributed system can be quickly restored.

Description

一种分布式系统的业务恢复方法及系统Method and system for business recovery of distributed system
本申请要求于2019年12月30日提交中国专利局、申请号为201911396738.3、发明名称为“一种分布式系统的业务恢复方法及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office, the application number is 201911396738.3, and the invention title is "A method and system for business recovery of a distributed system" on December 30, 2019, the entire content of which is incorporated by reference In this application.
技术领域Technical field
本发明涉及分布式系统的技术领域,更具体地,涉及一种分布式系统的业务恢复方法及系统。The present invention relates to the technical field of distributed systems, and more specifically, to a method and system for business recovery of distributed systems.
背景技术Background technique
分布式拼接显示系统和分布式网络坐席系统,是控制室的常用设备。其中分布式拼接系统通常包括多个分布式拼接节点,每个节点负责显示拼墙的某一部分屏幕,多个节点各司其职,共同组成拼接显示系统。分布式网络坐席系统通常设有发送盒和接收盒,发送盒接业务计算机,接收盒接键鼠和显示器,通过网络传输音视频和键鼠信息,从而达到人机分离、一机多屏的效果。Distributed splicing display system and distributed network seat system are commonly used equipment in the control room. Among them, the distributed splicing system usually includes multiple distributed splicing nodes, each node is responsible for displaying a certain part of the screen of the splicing wall, and multiple nodes perform their duties to form a splicing display system. The distributed network seat system usually has a sending box and a receiving box, the sending box is connected to the business computer, the receiving box is connected to the keyboard, mouse and the monitor, and the audio, video and keyboard and mouse information are transmitted through the network, so as to achieve the effect of man-machine separation and one machine with multiple screens. .
由于分布式系统是一个个节点组成,各个节点通过IP地址进行相互联系,一旦某节点发生故障,现有的技术手段一般是先查出故障的IP地址信息,然后将备用节点连接电脑,将备用节点的IP地址设置为原故障节点的IP地址,再根据业务需要修改相应的配置后,将备用节点替换掉故障节点。这一系列的操作下来,至少需要几分钟时间,若控制室是应用在应急指挥系统时,在应急指挥情况下,差之毫厘谬以千里,因此,几分钟的时间有可能因错失一些关键信息而导致决策失误。Since the distributed system is composed of individual nodes, each node communicates with each other through IP addresses. Once a node fails, the existing technical means generally first find out the faulty IP address information, and then connect the backup node to the computer to make a backup The IP address of the node is set to the IP address of the original failed node, and the corresponding configuration is modified according to business needs, and then the backup node is replaced with the failed node. This series of operations will take at least a few minutes. If the control room is used in an emergency command system, the difference in emergency command conditions is far from perfect. Therefore, within a few minutes, some key information may be missed. Which leads to decision-making errors.
发明内容Summary of the invention
本发明旨在克服上述现有技术的至少一种缺陷(不足),提供一种分布式系统的业务恢复方法及系统,用于达到快速恢复系统业务的效果。The present invention aims to overcome at least one defect (deficiency) of the above-mentioned prior art and provide a distributed system service recovery method and system, which are used to achieve the effect of quickly recovering system services.
本发明采取的技术方案是一种分布式系统的业务恢复方法,所述方法包括:The technical solution adopted by the present invention is a business recovery method of a distributed system, and the method includes:
当检测到有节点的网络连接处于断开状态时,确定该处于断开状态的节点为故障节点,记录所述故障节点的IP地址;When detecting that the network connection of a node is in a disconnected state, determine that the disconnected node is a faulty node, and record the IP address of the faulty node;
对故障节点执行替换处理,所述替换处理包括:Perform replacement processing on the failed node, and the replacement processing includes:
发出备用节点通过物理替换方式替换所述故障节点的提示信息;Sending out a prompt message that the standby node replaces the failed node by means of physical replacement;
发送设备发现的广播指令;Send broadcast instructions discovered by the device;
接收备用节点接收到设备发现的广播指令后所上报的状态信息;Receiving the status information reported by the standby node after receiving the broadcast instruction discovered by the device;
根据备用节点上报的状态信息确定该备用节点为可用的备用节点;Determining that the standby node is an available standby node according to the status information reported by the standby node;
向所述备用节点发送IP地址修改为所述故障节点的IP地址的修改指令;Sending a modification instruction to the backup node to modify the IP address to the IP address of the failed node;
重新建立与备用节点的网络连接并将分配给所述故障节点的业务重新分配给所述备用节点。The network connection with the standby node is re-established and the services allocated to the failed node are re-allocated to the standby node.
本发明所述的方法在检测到有故障节点时,记录所述故障节点的IP地址,然后利用物理替换方式先将备用节点替换所述故障节点,然后通过发送设备发现的广播指令使处于备用状态的备用节点接收到并上报自身的状态信息,根据上报的状态信息可以确定备用节点是可用的备用节点,然后通过发送修改指令的方式修改备用节点的IP地址为故障节点的IP地址,从而建立与备用节点的网络连接并使备用节点代替故障节点继续执行故障节点原有的业务。本发明的方法解决了分布式系统有节点发生故障并更换新的节点后,可以在不需要任何手动配置的情况下,利用线上方式自动完成配置修改,使得分布式系统恢复原来的工作状态,使得备用节点可以即插即用,既简化了操作步骤,降低了操作难度,又大大缩短了故障的修复时间。When a faulty node is detected, the method of the present invention records the IP address of the faulty node, and then uses a physical replacement method to first replace the faulty node with the standby node, and then sends the broadcast instruction discovered by the device to make it in the standby state The standby node receives and reports its own status information. According to the reported status information, it can be determined that the standby node is an available standby node, and then the IP address of the standby node is modified to the IP address of the failed node by sending a modification instruction to establish a connection with The network connection of the backup node allows the backup node to replace the failed node to continue to perform the original business of the failed node. The method of the present invention solves the problem that after a node in the distributed system fails and a new node is replaced, the configuration modification can be automatically completed in an online manner without any manual configuration, so that the distributed system can restore the original working state. The backup node can be plug and play, which not only simplifies the operation steps, reduces the operation difficulty, but also greatly shortens the repair time of the fault.
本发明还提供一种分布式系统的业务恢复系统,所述系统包括:The present invention also provides a business recovery system of a distributed system, and the system includes:
监测模块,用于监测和所有节点的网络连接,当检测到有节点的网络连接处于断开状态时,确定该处于断开状态的节点为故障节点,记录所述故障节点的IP地址;The monitoring module is used to monitor the network connection with all nodes. When it is detected that the network connection of a node is in a disconnected state, determine that the disconnected node is a faulty node, and record the IP address of the faulty node;
提示信息发出模块,用于发出将备用节点通过物理替换方式替换所述故障节点的提示信息;A prompt message issuing module, which is used to send out prompt information for replacing the faulty node with the standby node through a physical replacement method;
广播模块,用于发送设备发现的广播指令;Broadcast module, used to send broadcast instructions for device discovery;
接收模块,用于接收备用节点接收到设备发现的广播指令后所上报的状态信息;The receiving module is used to receive the status information reported by the standby node after receiving the broadcast instruction discovered by the device;
备用节点确定模块,用于根据备用节点上报的状态信息确定该备用节点为可用的备用节点;A spare node determining module, configured to determine that the spare node is an available spare node according to the status information reported by the spare node;
IP地址修改模块,用于向所述备用节点发送IP地址修改为所述故障节点的IP地址的修改指令;An IP address modification module, configured to send a modification instruction for modifying an IP address to the IP address of the failed node to the standby node;
连接建立模块,用于重新建立与备用节点的网络连接并将分配给所述故障节点的业务重新分配给所述备用节点。The connection establishment module is used for re-establishing the network connection with the standby node and redistributing the services allocated to the failed node to the standby node.
本发明所述的系统利用监测模块在线检测分布式系统的故障节点,当检测到有故障节点时,记录所述故障节点的IP地址,然后通过提示信息发出模块发出提示信息的方式提醒通过物理替换方式将备用节点在物理连接上替换所述故障节点,然后通过广播指令模块发送设备发现的广播指令使已经物理连接上替换所述故障节点的备用节点可以接收到设备发现的广播指令,接收模块根据备用节点上报的状态信息,接着备用节点确定模块可以根据状态信息确定备用节点是可用的备用节点,然后IP地址修改模块通过发送修改指令的方式修改备用节点的IP地址为故障节点的IP地址,最后利用连接建立模块建立与备用节点的网络连接并使备用节点代替故障节点继续执行故障节点原有的业务。本发明的系统解决了分布式系统有节点发生故障并更换新的节点后,可以在不需要任何手动配置的情况下,利用线上方式自动完成配置修改,使得分布式系统恢复原来的工作状态,使得备用节点可以即插即用,既简化了操作步骤,降低了操作难度,又大大缩短了故障的修复时间。The system of the present invention uses the monitoring module to detect the faulty node of the distributed system online. When a faulty node is detected, the IP address of the faulty node is recorded, and then the prompt message is sent out by the prompt message sending module to remind the physical replacement The standby node replaces the failed node on the physical connection, and then sends the broadcast instruction discovered by the device through the broadcast instruction module so that the standby node that has been physically connected to replace the failed node can receive the broadcast instruction discovered by the device, and the receiving module according to The status information reported by the standby node, and then the standby node determination module can determine that the standby node is an available standby node based on the status information, and then the IP address modification module modifies the IP address of the standby node to the IP address of the failed node by sending a modification instruction, and finally The connection establishment module is used to establish a network connection with the standby node, and the standby node replaces the failed node to continue to perform the original business of the failed node. The system of the present invention solves the problem that after a node in the distributed system fails and a new node is replaced, the configuration modification can be automatically completed in an online manner without any manual configuration, so that the distributed system can restore the original working state. The backup node can be plug and play, which not only simplifies the operation steps, reduces the operation difficulty, but also greatly shortens the repair time of the fault.
与现有技术相比,本发明的有益效果为:本发明的方法和系统可以应用于分布式系统中,在分布式系统的节点发生故障时,在不增加冗余设备成本的情况下,实现备用节点的即插即用,备用节点无需做任何的手动配置,可以实现快速恢复分布式系统业务的效果。Compared with the prior art, the beneficial effects of the present invention are: the method and system of the present invention can be applied to a distributed system. When a node of the distributed system fails, the cost of redundant equipment is not increased. The standby node is plug-and-play, and the standby node does not need to do any manual configuration, which can achieve the effect of quickly restoring distributed system services.
附图说明Description of the drawings
图1为本发明实施例1一种分布式系统的业务恢复方法的流程图。Fig. 1 is a flowchart of a method for restoring a service in a distributed system according to Embodiment 1 of the present invention.
图2为本发明实施例2一种分布式系统的业务恢复方法的流程图。FIG. 2 is a flowchart of a method for restoring a service in a distributed system according to Embodiment 2 of the present invention.
图3为本发明实施例3一种分布式系统的业务恢复系统的框架图。FIG. 3 is a framework diagram of a service recovery system of a distributed system according to Embodiment 3 of the present invention.
具体实施方式Detailed ways
本发明附图仅用于示例性说明,不能理解为对本发明的限制。为了更好说明以下实施例,附图某些部件会有省略、放大或缩小,并不代表实际产品的尺寸;对于本领域技术人员来说,附图中某些公知结构及其说明可能省略是可以理解的。The drawings of the present invention are only used for exemplary description, and should not be construed as limiting the present invention. In order to better illustrate the following embodiments, some components in the drawings may be omitted, enlarged or reduced, and do not represent the size of the actual product; for those skilled in the art, some well-known structures in the drawings and their descriptions may be omitted. Understandable.
实施例1Example 1
如图1所示,本实施例一种分布式系统的业务恢复方法包括如下具体步骤:As shown in FIG. 1, the service recovery method of the distributed system in this embodiment includes the following specific steps:
S101、当检测到有节点的网络连接处于断开状态时,确定该处于断开状态的节点为故障节点,记录所述故障节点的IP地址;S101: When it is detected that the network connection of a node is in a disconnected state, determine that the disconnected node is a faulty node, and record the IP address of the faulty node;
S102、发出备用节点通过物理替换方式替换所述故障节点的提示信息;S102: Send a prompt message that the standby node replaces the failed node by means of physical replacement;
S103、发送设备发现的广播指令;S103: Send a broadcast instruction discovered by the device;
S104、接收备用节点接收到设备发现的广播指令后所上报的状态信息;S104. Receive status information reported by the standby node after receiving the broadcast instruction discovered by the device;
S105、根据备用节点上报的状态信息确定该备用节点为可用的备用节点;S105: Determine that the standby node is an available standby node according to the status information reported by the standby node;
S106、向所述备用节点发送IP地址修改为所述故障节点的IP地址的修改指令;S106: Send a modification instruction for modifying the IP address to the IP address of the failed node to the backup node;
S107、重新建立与备用节点的网络连接并将分配给所述故障节点的业务重新分配给所述备用节点。S107. Re-establish a network connection with the backup node and redistribute the services allocated to the failed node to the backup node.
本发明所述的方法在检测到有故障节点时,记录所述故障节点的IP地址,然后利用物理替换方式先将备用节点替换所述故障节点,然后通过发送设备发现的广播指令使处于备用状态的备用节点接收到并上报自身的状态信息,根据上报的状态信息可以确定备用节点是可用的备用节点,然后通过发送修改指令的方式修改备用节点的IP地址为故障节点的IP地址,从而建立与备用节点的网络连接并使备用节点代替故障节点继续执行故障节点原有的业务。本发明的方法解决了分布式系统有节点发生故障并更换 新的节点后,可以在不需要任何手动配置的情况下,利用线上方式自动完成配置修改,使得分布式系统恢复原来的工作状态,使得备用节点可以即插即用,既简化了操作步骤,降低了操作难度,又大大缩短了故障的修复时间。When a faulty node is detected, the method of the present invention records the IP address of the faulty node, and then uses a physical replacement method to first replace the faulty node with the standby node, and then sends the broadcast instruction discovered by the device to make it in the standby state The standby node receives and reports its own status information. According to the reported status information, it can be determined that the standby node is an available standby node, and then the IP address of the standby node is modified to the IP address of the failed node by sending a modification instruction to establish a connection with The network connection of the backup node allows the backup node to replace the failed node to continue to perform the original business of the failed node. The method of the present invention solves the problem that after a node in the distributed system fails and a new node is replaced, the configuration modification can be automatically completed in an online manner without any manual configuration, so that the distributed system can restore the original working state. The backup node can be plug and play, which not only simplifies the operation steps, reduces the operation difficulty, but also greatly shortens the repair time of the fault.
在具体实施过程中,本实施例的方法可以应用于分布式系统的控制模块中,通过控制模块实现本实施例的方法。所述控制模块可以布置在服务器上,可以部署在分布式系统的某个节点上,其目的是管理各个节点之间的逻辑关系和节点之间的状态。所述控制模块预先与分布式系统中的各个处于主状态的所有节点建立网络连接,在建立网络连接后所述控制模块执行上述步骤S101-S107所有步骤以使分布式系统在有处于主状态的节点发生故障时,可以快速插入处于备状态的备用节点并快速实现在线配置,从而快速恢复分布式系统的业务。In a specific implementation process, the method of this embodiment can be applied to a control module of a distributed system, and the method of this embodiment is implemented through the control module. The control module can be arranged on a server or on a certain node of a distributed system, and its purpose is to manage the logical relationship between the various nodes and the state between the nodes. The control module pre-establishes a network connection with all nodes in the main state in the distributed system. After the network connection is established, the control module executes all the steps S101-S107 above so that the distributed system is in the main state. When a node fails, the standby node in the standby state can be quickly inserted and online configuration can be quickly realized, so as to quickly restore the business of the distributed system.
在一种可选的实施方式中,在本实施例的步骤S101中,当检测到有节点的网络连接处于断开状态时还向所述分布式系统的控制端发送告警信息。通过发出告警信息的方式可以提醒分布式系统的控制端,使其可以做出相应的决策。In an optional implementation manner, in step S101 of this embodiment, when it is detected that the network connection with a node is in a disconnected state, alarm information is also sent to the control end of the distributed system. By sending out alarm information, the control end of the distributed system can be reminded so that it can make corresponding decisions.
在一种可选的实施方式中,本实施例的方法通过S103发出提示信息的方式提示有故障节点发生故障,从而可以将备用节点在物理上替换掉所述故障节点。具体的,备用节点通过物理替换方式替换所述故障节点指的是将外接所述故障节点的物理接线插入到所述备用节点,所述的物理接线可以包括故障节点的电源线、网线、视频线、USB线中的一种或多种。所述提示信息中可以包括故障节点的IP地址,根据故障节点的IP地址在实际操作中可以快速追踪到所述故障节点,从而在物理连接上使备用节点可以快速替换故障节点。但此步骤的替换仅仅是物理连接上,此时备用节点的IP地址等相关配置信息由于没有修改使其和故障节点一致,在实际使用上,备用节点仍无法替换故障节点而且仍无法代替故障节点执行故障节点原有的业务,此时仍需执行后续的步骤使得备用节点的配置信息和故障节点一致。In an optional implementation manner, the method of this embodiment prompts that the faulty node has failed by sending a prompt message in S103, so that the backup node can be physically replaced by the faulty node. Specifically, the replacement of the faulty node by the standby node by means of physical replacement refers to inserting the physical wiring external to the faulty node into the standby node, and the physical wiring may include the power cord, network cable, and video cable of the faulty node. , One or more of the USB cables. The prompt information may include the IP address of the failed node, and the failed node can be quickly traced in actual operations according to the IP address of the failed node, so that the backup node can quickly replace the failed node on the physical connection. However, the replacement in this step is only on the physical connection. At this time, the IP address and other related configuration information of the standby node are not modified to make it consistent with the failed node. In actual use, the standby node still cannot replace the failed node and cannot replace the failed node. To execute the original services of the failed node, the subsequent steps still need to be performed to make the configuration information of the standby node consistent with the failed node.
在一种可选的实施方式中,所述状态信息可以包括备用节点的IP地 址、MAC地址和是否已经建立端对端网络连接的状态信息。备用节点的IP地址是预设的初始值,通过设备用节点的IP地址使其可以在物理接入后可以接收广播指令,并通过该预设的IP地址可以上报状态信息。但即使备用节点具备预设的IP地址,但其并未与本分布式系统的控制模块建立起端对端的连接或者并未与其他控制模块建立起端对端的连接,因此,控制模块可以根据节点是否与其建立了端对端的连接来判断上报状态信息的节点是否为可用的备用节点。In an optional implementation manner, the status information may include the IP address and MAC address of the standby node, and status information about whether an end-to-end network connection has been established. The IP address of the standby node is a preset initial value. The IP address of the device-use node enables it to receive broadcast instructions after physical access, and the preset IP address can report status information. But even if the standby node has a preset IP address, it has not established an end-to-end connection with the control module of the distributed system or has not established an end-to-end connection with other control modules. Therefore, the control module can be based on the node Whether an end-to-end connection is established with it to determine whether the node reporting status information is an available standby node.
进一步的,所述步骤S105的具体步骤包括:Further, the specific steps of step S105 include:
根据备用节点上报的状态信息判断备用节点的IP地址是否为预设的初始值和该备用节点是否未建立端对端的网络连接,若是该备用节点的IP地址为预设的初始值和该备用节点未建立端对端的网络连接,则确定该备用节点为可用的备用节点。According to the status information reported by the standby node, determine whether the IP address of the standby node is the preset initial value and whether the standby node has not established an end-to-end network connection, if the IP address of the standby node is the preset initial value and the standby node If the end-to-end network connection is not established, the standby node is determined to be an available standby node.
控制模块对接收到备用节点的状态信息进行判断,若该备用节点的IP地址为预设的初始值和该备用节点未建立端对端的网络连接,则确定该备用节点为可用的备用节点。The control module judges the received status information of the standby node, and if the IP address of the standby node is a preset initial value and the standby node has not established an end-to-end network connection, the standby node is determined to be an available standby node.
基于本实施例的方法,分布式系统通过设置控制模块执行上述步骤S101-S107所有步骤,完成和备用节点的交互,从而可以在分布式系统的节点发生故障时快速恢复业务。下面,结合控制模块和备用节点之间的交互方式进一步说明本实施例的方法。Based on the method of this embodiment, the distributed system executes all the above steps S101-S107 by setting the control module to complete the interaction with the standby node, so that the service can be quickly restored when the node of the distributed system fails. In the following, the method of this embodiment is further described in conjunction with the interaction mode between the control module and the standby node.
首先,控制模块预先与分布式系统中的各个处于主状态的所有节点建立网络连接,处于备状态的备用节点预先设置了初始IP地址;First, the control module pre-establishes a network connection with all nodes in the primary state in the distributed system, and the standby node in the standby state is preset with an initial IP address;
接着,控制模块开始对所有处于主状态的节点进行监测,当检测到有处于主状态的节点与控制模块的网络连接处于断开状态时,确定该处于断开状态的节点为故障节点,记录所述故障节点的IP地址,并主动向分布式系统的控制端发送告警信息;Then, the control module starts to monitor all nodes in the main state. When it is detected that the network connection between the node in the main state and the control module is disconnected, it determines that the node in the disconnected state is a faulty node, and records all the nodes in the main state. State the IP address of the failed node, and actively send alarm information to the control end of the distributed system;
然后,控制模块发出提示信息,提示需要将备用节点通过物理替换方式替换所述故障节点;Then, the control module sends out a prompt message, prompting that the standby node needs to be replaced by a physical replacement method for the failed node;
当所述提示信息被发现后,通过人工等操作方式将故障节点的电源线、网线、视频线、USB先等插入到备用节点中,使得备用节点在物理连接上 代替所述故障节点;After the prompt information is found, manually insert the power cord, network cable, video cable, and USB of the faulty node into the standby node through manual operations, so that the standby node replaces the faulty node in a physical connection;
控制模块发出提示信息后还发送设备发现的广播指令;After the control module sends out the prompt information, it also sends the broadcast instruction of the device discovery;
当备用节点从在物理连接上代替所述故障节点后,备用节点可以利用预设的初始IP地址接收到广播指令,当备用节点接收到广播指令后,主动向控制模块上报自己的状态信息。After the backup node replaces the failed node on the physical connection, the backup node can receive the broadcast instruction by using the preset initial IP address. When the backup node receives the broadcast instruction, it actively reports its status information to the control module.
控制模块接收备用节点上报的状态信息,然后根据状态信息判断备用节点的IP地址是否为预设的初始值和该备用节点是否未与任何控制模块建立端对端的网络连接,若是则可以确定该备用节点为可用的备用节点;The control module receives the status information reported by the standby node, and then determines whether the IP address of the standby node is the preset initial value and whether the standby node has not established an end-to-end network connection with any control module according to the status information. If so, the standby node can be determined The node is an available spare node;
接着,控制模块向所述可用的备用节点发送修改指令,所述指令指示所述可用的备用节点将自身IP地址修改为所述故障节点的IP地址;Then, the control module sends a modification instruction to the available backup node, the instruction instructs the available backup node to modify its own IP address to the IP address of the failed node;
所述可用的备用节点在接收到修改指令后,立即根据修改指令中相关参数设置本机的IP地址。After the available backup node receives the modification instruction, it immediately sets the IP address of the machine according to the relevant parameters in the modification instruction.
当所述可用的备用节点修改IP地址后,控制模块重新建立与所述可用的备用节点的网络连接并将分配给所述故障节点的业务重新分配给所述备用节点。After the available backup node modifies the IP address, the control module re-establishes the network connection with the available backup node and redistributes the services allocated to the failed node to the backup node.
本发明的方法通过备用节点的即插即用,实现分布式系统快速恢复业务的效果,简化了备用节点配置的过程,也大大缩短了故障的修复时间。The method of the present invention realizes the effect of rapid service recovery of the distributed system through the plug and play of the standby node, simplifies the process of configuring the standby node, and greatly shortens the repair time of the fault.
实施例2Example 2
与实施例1不同的是,本实施例一种分布式系统的业务恢复方法是在利用实施例1的步骤S101检测到有故障的节点不止一个时,解决如何对多个故障节点进行修复的方法。The difference from Embodiment 1 is that the service recovery method of a distributed system in this embodiment is a method to solve how to repair multiple failed nodes when more than one faulty node is detected using step S101 of Embodiment 1. .
如图2所示,本实施例一种分布式系统的业务恢复方法的具体步骤包括:As shown in FIG. 2, the specific steps of a method for business recovery of a distributed system in this embodiment include:
S201、当检测到有节点的网络连接处于断开状态且处于断开状态的节点包括多个时,确定该多个处于断开状态的节点均为故障节点,记录多个所述故障节点的IP地址;S201. When it is detected that the network connection of a node is in a disconnected state and there are multiple nodes in the disconnected state, determine that the multiple disconnected nodes are all faulty nodes, and record the IPs of the multiple faulty nodes address;
S202、根据预设的编号规则对多个故障节点进行编号排序;S202: Perform numbering and sorting of multiple faulty nodes according to a preset numbering rule;
S203、根据故障节点的编号排序循环执行如下步骤S204-S209使得所有故障节点都被可用的备用节点替换;S203: Perform the following steps S204-S209 cyclically according to the numbering of the failed nodes so that all the failed nodes are replaced by available spare nodes;
S204、发出备用节点通过物理替换方式替换所述故障节点的提示信息;S204: Send a prompt message that the standby node replaces the failed node by means of physical replacement;
S205、发送设备发现的广播指令;S205: Send the broadcast instruction discovered by the device;
S206、接收备用节点接收到设备发现的广播指令后所上报的状态信息;S206: Receive the status information reported by the standby node after receiving the broadcast instruction discovered by the device;
S207、根据备用节点上报的状态信息确定该备用节点为可用的备用节点;S207: Determine that the standby node is an available standby node according to the status information reported by the standby node.
S208、向所述备用节点发送IP地址修改为所述故障节点的IP地址的修改指令;S208: Send a modification instruction for modifying the IP address to the IP address of the failed node to the standby node;
S209、重新建立与备用节点的网络连接并将分配给所述故障节点的业务重新分配给所述备用节点。S209. Re-establish a network connection with the backup node and redistribute the service allocated to the failed node to the backup node.
在一种可选的实施方式中,所述预设的编号规则可以是根据所述分布式系统中处于状态的节点的逻辑关系对节点进行编号,然后根据各个故障节点的节点编号大小进行排序。In an optional implementation manner, the preset numbering rule may be to number the nodes according to the logical relationship of the nodes in the state in the distributed system, and then to sort the nodes according to the size of the node number of each faulty node.
由于同时有多个节点发生故障,为避免换错了设备,必须制定一个规则,规则可以根据系统的特性来制定,例如拼墙系统由于存在行列排序关系,可以按从左到右、从上到下的顺序依次给每个屏对应的节点进行编号,左上角的为1号节点,依次增加序号。规则可以是每次都是更换当前故障节点中编号最小或最大的节点。Since multiple nodes fail at the same time, in order to avoid replacing the wrong equipment, a rule must be formulated. The rule can be formulated according to the characteristics of the system. For example, the wall-to-wall system can be sorted from left to right and from top to right. The node corresponding to each screen is numbered in the order below, the node 1 in the upper left corner is numbered, and the number is increased in sequence. The rule can be that each time the node with the smallest or largest number among the current failed nodes is replaced.
实施例3Example 3
本发明还提供一种分布式系统的业务恢复系统。如图3所示,本实施例一种分布式系统的业务恢复系统,具体包括:The invention also provides a business recovery system of the distributed system. As shown in FIG. 3, a service recovery system of a distributed system in this embodiment specifically includes:
监测模块301,用于监测和所有节点的网络连接,当检测到有节点的网络连接处于断开状态时,确定该处于断开状态的节点为故障节点,记录所述故障节点的IP地址;The monitoring module 301 is used to monitor the network connection with all nodes. When it is detected that the network connection of a node is in a disconnected state, determine that the disconnected node is a faulty node, and record the IP address of the faulty node;
提示信息发出模块302,用于发出将备用节点通过物理替换方式替换所述故障节点的提示信息;The prompt message issuing module 302 is configured to send out prompt information for replacing the faulty node with the standby node through a physical replacement method;
广播模块303,用于发送设备发现的广播指令;The broadcast module 303 is used to send broadcast instructions for device discovery;
接收模块304,用于接收备用节点接收到设备发现的广播指令后所上报的状态信息;The receiving module 304 is configured to receive the status information reported by the standby node after receiving the broadcast instruction discovered by the device;
备用节点确定模块305,用于根据备用节点上报的状态信息确定该备 用节点为可用的备用节点;The standby node determining module 305 is configured to determine that the standby node is an available standby node according to the status information reported by the standby node;
IP地址修改模块306,用于向所述备用节点发送IP地址修改为所述故障节点的IP地址的修改指令;The IP address modification module 306 is configured to send a modification instruction for modifying the IP address to the IP address of the failed node to the standby node;
连接建立模块307,用于重新建立与备用节点的网络连接并将分配给所述故障节点的业务重新分配给所述备用节点。The connection establishment module 307 is configured to re-establish a network connection with the backup node and redistribute the services allocated to the failed node to the backup node.
本发明所述的系统利用监测模块301在线检测分布式系统的故障节点,当检测到有故障节点时,记录所述故障节点的IP地址,然后通过提示信息发出模块302发出提示信息的方式提醒通过物理替换方式将备用节点在物理连接上替换所述故障节点,然后通过广播指令模块303发送设备发现的广播指令使已经物理连接上替换所述故障节点的备用节点可以接收到设备发现的广播指令,接收模块304根据备用节点上报的状态信息,接着备用节点确定模块305可以根据状态信息确定备用节点是可用的备用节点,然后IP地址修改模306块通过发送修改指令的方式修改备用节点的IP地址为故障节点的IP地址,最后利用连接建立模块307建立与备用节点的网络连接并使备用节点代替故障节点继续执行故障节点原有的业务。本发明的系统解决了分布式系统有节点发生故障并更换新的节点后,可以在不需要任何手动配置的情况下,利用线上方式自动完成配置修改,使得分布式系统恢复原来的工作状态,使得备用节点可以即插即用,既简化了操作步骤,降低了操作难度,又大大缩短了故障的修复时间。The system of the present invention uses the monitoring module 301 to detect the faulty nodes of the distributed system online. When a faulty node is detected, the IP address of the faulty node is recorded, and then the prompt message sending module 302 sends a prompt message to remind you to pass In the physical replacement method, the backup node replaces the failed node on the physical connection, and then sends the broadcast instruction of device discovery through the broadcast instruction module 303 so that the backup node that has been physically connected to replace the failed node can receive the broadcast instruction of device discovery. The receiving module 304 can determine that the standby node is an available standby node according to the status information reported by the standby node, and then the standby node determining module 305 can determine that the standby node is an available standby node based on the status information, and then the IP address modification module 306 modifies the IP address of the standby node by sending a modification instruction to The IP address of the failed node, and finally the connection establishment module 307 is used to establish a network connection with the standby node and the standby node replaces the failed node to continue to perform the original service of the failed node. The system of the present invention solves the problem that after a node in the distributed system fails and a new node is replaced, the configuration modification can be automatically completed in an online manner without any manual configuration, so that the distributed system can restore the original working state. The backup node can be plug and play, which not only simplifies the operation steps, reduces the operation difficulty, but also greatly shortens the repair time of the fault.
在具体实施过程中,本实施例的业务恢复系统可以应用于分布式系统的控制模块中,通过在控制模块上设置本实施例的各个模块来实现业务恢复系统。具体实施过程中,所述控制模块可以布置在服务器上,可以部署在分布式系统的某个节点上,其目的是管理各个节点之间的逻辑关系和节点之间的状态。通过在所述控制模块设置各个模块,可以使分布式系统在有处于主状态的节点发生故障时,可以快速插入处于备状态的备用节点并快速实现在线配置,从而快速恢复分布式系统的业务。In a specific implementation process, the business recovery system of this embodiment can be applied to a control module of a distributed system, and the business recovery system is implemented by setting each module of this embodiment on the control module. In a specific implementation process, the control module can be arranged on a server or on a certain node of a distributed system, and its purpose is to manage the logical relationship between the nodes and the state between the nodes. By setting each module in the control module, when a node in the primary state fails, the distributed system can quickly insert the standby node in the standby state and quickly implement online configuration, thereby quickly restoring the service of the distributed system.
在一种可选的实施方式中,本实施的业务恢复系统还包括告警信息发送模块,所述告警信息发送模块用于监测模块301检测到有节点的网络连接处于断开状态时向所述分布式系统的控制端发送告警信息。通过发出告 警信息的方式可以提醒分布式系统的控制端,使其可以作出相应的决策。In an optional implementation manner, the service recovery system of this implementation further includes an alarm information sending module, and the alarm information sending module is used to send information to the distribution system when the monitoring module 301 detects that the network connection with the node is in a disconnected state. The control end of the integrated system sends alarm information. By issuing warning messages, the control end of the distributed system can be reminded so that it can make corresponding decisions.
在一种可选的实施方式中,本实施例通过提示信息发出模块302发出提示信息的方式提示有故障节点发生故障,从而可以将备用节点在物理上替换掉所述故障节点。具体的,备用节点通过物理替换方式替换所述故障节点指的是将外接所述故障节点的物理接线插入到所述备用节点,所述的物理接线可以包括故障节点的电源线、网线、视频线、USB线中的一种或多种。所述提示信息中可以包括故障节点的IP地址,根据故障节点的IP地址在实际操作中可以快速追踪到所述故障节点,从而在物理连接上使备用节点可以快速替换故障节点。但此替换仅仅是物理连接上,此时备用节点的IP地址等相关配置信息由于没有修改使其和故障节点一致,在实际使用上,备用节点仍无法替换故障节点而且仍无法代替故障节点执行故障节点原有的业务,此时仍需利用其他模块完成相应的功能使得备用节点的配置信息和故障节点一致。In an optional implementation manner, in this embodiment, a prompt message is issued by the prompt message issuing module 302 to prompt that a faulty node has failed, so that the backup node can be physically replaced with the faulty node. Specifically, the replacement of the faulty node by the standby node by means of physical replacement refers to inserting the physical wiring external to the faulty node into the standby node, and the physical wiring may include the power cord, network cable, and video cable of the faulty node. , One or more of the USB cables. The prompt information may include the IP address of the failed node, and the failed node can be quickly traced in actual operations according to the IP address of the failed node, so that the backup node can quickly replace the failed node on the physical connection. However, this replacement is only on the physical connection. At this time, the IP address and other related configuration information of the standby node are not modified to make it consistent with the failed node. In actual use, the standby node still cannot replace the failed node and cannot replace the failed node to perform the failure. For the original business of the node, other modules still need to be used to complete the corresponding functions at this time to make the configuration information of the standby node consistent with the failed node.
在一种可选的实施方式中,所述状态信息可以包括备用节点的IP地址、MAC地址和是否已经建立端对端网络连接的状态信息。备用节点的IP地址是预设的初始值,通过设备用节点的IP地址使其可以在物理接入后可以接收广播指令,并通过该预设的IP地址可以上报状态信息。但即使备用节点具备预设的IP地址,但其并未与分布式系统的业务恢复系统建立起端对端的连接或者并未与其他业务恢复系统建立起端对端的连接,因此,业务恢复系统中的备用节点确定模块305可以根据节点是否与其建立了端对端的连接来判断上报状态信息的节点是否为可用的备用节点。In an optional implementation manner, the status information may include the IP address, MAC address of the standby node, and status information about whether an end-to-end network connection has been established. The IP address of the standby node is a preset initial value. The IP address of the device-use node enables it to receive broadcast instructions after physical access, and the preset IP address can report status information. However, even if the standby node has a preset IP address, it has not established an end-to-end connection with the business recovery system of the distributed system or has not established an end-to-end connection with other business recovery systems. Therefore, in the business recovery system The standby node determining module 305 can determine whether the node reporting the status information is an available standby node according to whether the node has established an end-to-end connection with it.
进一步的,所述备用节点确定模块305具体用于:Further, the standby node determining module 305 is specifically configured to:
根据备用节点上报的状态信息判断备用节点的IP地址是否为预设的初始值和该备用节点是否未与所述业务恢复系统建立端对端的网络连接,若是该备用节点的IP地址为预设的初始值和该备用节点未与所述业务恢复系统建立端对端的网络连接,则确定该备用节点为可用的备用节点。According to the status information reported by the backup node, determine whether the IP address of the backup node is the preset initial value and whether the backup node has not established an end-to-end network connection with the service recovery system, if the IP address of the backup node is preset If the initial value and the standby node have not established an end-to-end network connection with the service recovery system, it is determined that the standby node is an available standby node.
备用节点确定模块305对接收到备用节点的状态信息进行判断,若该备用节点的IP地址为预设的初始值和该备用节点未建立端对端的网络连接,则确定该备用节点为可用的备用节点。The standby node determining module 305 judges the status information of the standby node received, and if the IP address of the standby node is a preset initial value and the standby node has not established an end-to-end network connection, the standby node is determined to be an available standby node.
在一种可选的实施方式中,所述监控模块301在检测到处于断开状态的节点包括多个时,确定该多个处于断开状态的节点均为故障节点,记录多个所述故障节点的IP地址,并控制所述提示信息发出模块302逐个发出与每个故障节点一一对应的所述提示信息,从而可以逐个地利用备用节点替换各个故障节点,使得所述业务恢复系统可以快速修复。In an optional implementation manner, when the monitoring module 301 detects that there are multiple disconnected nodes, it determines that the multiple disconnected nodes are all faulty nodes, and records multiple faults. Node’s IP address, and control the prompt message issuing module 302 to send out the prompt information corresponding to each failed node one by one, so that each failed node can be replaced with a spare node one by one, so that the service recovery system can quickly repair.
在一种可选的实施方式中,监控模块301控制所述提示信息发出模块302逐个发出与每个故障节点一一对应的所述提示信息,具体是:In an optional implementation manner, the monitoring module 301 controls the prompt information issuing module 302 to send out the prompt information corresponding to each faulty node one by one, specifically:
监控模块301根据预设的编号规则对多个故障节点进行编号排序,任何根据故障节点的编号排序逐个发出与每个故障节点一一对应的所述提示信息使得所有故障节点都被可用的备用节点替换。The monitoring module 301 sorts the number of multiple faulty nodes according to the preset numbering rule, and any one by one according to the number sequence of the faulty node sends out the prompt information corresponding to each faulty node one by one so that all the faulty nodes are available as spare nodes replace.
所述预设的编号规则可以是根据所述分布式系统中处于状态的节点的逻辑关系对节点进行编号,然后根据各个故障节点的节点编号大小进行排序。The preset numbering rule may be that the nodes are numbered according to the logical relationship of the nodes in the state in the distributed system, and then sorted according to the size of the node number of each failed node.
由于同时有多个节点发生故障,为避免换错了设备,必须制定一个规则,规则可以根据系统的特性来制定,例如拼墙系统由于存在行列排序关系,可以按从左到右、从上到下的顺序依次给每个屏对应的节点进行编号,左上角的为1号节点,依次增加序号。规则可以是每次都是更换当前故障节点中编号最小或最大的节点。Since multiple nodes fail at the same time, in order to avoid replacing the wrong equipment, a rule must be formulated. The rule can be formulated according to the characteristics of the system. The node corresponding to each screen is numbered in the order below, the node 1 in the upper left corner is numbered, and the number is increased in sequence. The rule can be that each time the node with the smallest or largest number among the current failed nodes is replaced.
显然,本发明的上述实施例仅仅是为清楚地说明本发明技术方案所作的举例,而并非是对本发明的具体实施方式的限定。凡在本发明权利要求书的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明权利要求的保护范围之内。Obviously, the above-mentioned embodiments of the present invention are merely examples to clearly illustrate the technical solutions of the present invention, and are not intended to limit the specific implementation manners of the present invention. Any modification, equivalent replacement and improvement made within the spirit and principle of the claims of the present invention shall be included in the protection scope of the claims of the present invention.

Claims (10)

  1. 一种分布式系统的业务恢复方法,其特征在于,所述方法包括:A business recovery method for a distributed system, characterized in that the method includes:
    当检测到有节点的网络连接处于断开状态时,确定该处于断开状态的节点为故障节点,记录所述故障节点的IP地址;When detecting that the network connection of a node is in a disconnected state, determine that the disconnected node is a faulty node, and record the IP address of the faulty node;
    对故障节点执行替换处理,所述替换处理包括:Perform replacement processing on the failed node, and the replacement processing includes:
    发出备用节点通过物理替换方式替换所述故障节点的提示信息;Sending out a prompt message that the standby node replaces the failed node by means of physical replacement;
    发送设备发现的广播指令;Send broadcast instructions discovered by the device;
    接收备用节点接收到设备发现的广播指令后所上报的状态信息;Receiving the status information reported by the standby node after receiving the broadcast instruction discovered by the device;
    根据备用节点上报的状态信息确定该备用节点为可用的备用节点;Determining that the standby node is an available standby node according to the status information reported by the standby node;
    向所述备用节点发送IP地址修改为所述故障节点的IP地址的修改指令;Sending a modification instruction to the backup node to modify the IP address to the IP address of the failed node;
    重新建立与备用节点的网络连接并将分配给所述故障节点的业务重新分配给所述备用节点。The network connection with the standby node is re-established and the services allocated to the failed node are re-allocated to the standby node.
  2. 根据权利要求1所述的分布式系统的业务恢复方法,其特征在于,所述方法还包括:The service restoration method of a distributed system according to claim 1, wherein the method further comprises:
    当检测到有节点的网络连接处于断开状态时还向所述分布式系统的控制端发送告警信息。When it is detected that the network connection of a node is in a disconnected state, alarm information is also sent to the control end of the distributed system.
  3. 根据权利要求1所述的分布式系统业务恢复方法,其特征在于,所述物理替换方式具体为:The method for restoring a distributed system service according to claim 1, wherein the physical replacement method is specifically:
    将外接所述故障节点的物理接线插入到所述备用节点。Insert the physical wiring external to the faulty node into the standby node.
  4. 根据权利要求1所述的分布式系统的业务恢复方法,其特征在于,所述状态信息包括备用节点的IP地址、MAC地址和是否已经建立端对端网络连接的状态信息。The service recovery method of a distributed system according to claim 1, wherein the status information includes the IP address, MAC address of the standby node, and status information about whether an end-to-end network connection has been established.
  5. 根据权利要求4所述的分布式系统的业务恢复方法,其特征在于,根据备用节点上报的状态信息确定该备用节点为可用的备用节点,具体为:The service recovery method of the distributed system according to claim 4, characterized in that determining that the standby node is an available standby node according to the status information reported by the standby node is specifically:
    根据备用节点上报的状态信息判断备用节点的IP地址是否为预设的初始值和该备用节点是否未建立端对端的网络连接,若是该备用节点的IP地址为预设的初始值和该备用节点未建立端对端的网络连接,则确定该备用节点为可用的备用节点。According to the status information reported by the standby node, determine whether the IP address of the standby node is the preset initial value and whether the standby node has not established an end-to-end network connection, if the IP address of the standby node is the preset initial value and the standby node If the end-to-end network connection is not established, the standby node is determined to be an available standby node.
  6. 根据权利要求1所述的分布式系统的业务恢复方法,其特征在于,当检测到有节点的网络连接处于断开状态且处于断开状态的节点包括多个时,确定该多个处于断开状态的节点均为故障节点,记录多个所述故障节点的IP地址,对每个所述故障节点执行替换处理。The service recovery method of a distributed system according to claim 1, wherein when it is detected that the network connection of a node is in a disconnected state and there are multiple nodes in the disconnected state, it is determined that the multiple nodes are in a disconnected state. The nodes in the state are all failed nodes, the IP addresses of multiple failed nodes are recorded, and replacement processing is performed on each of the failed nodes.
  7. 一种分布式系统的业务恢复系统,其特征在于,包括:A business recovery system of a distributed system, which is characterized in that it includes:
    监测模块,用于监测和所有节点的网络连接,当检测到有节点的网络连接处于断开状态时,确定该处于断开状态的节点为故障节点,记录所述故障节点的IP地址;The monitoring module is used to monitor the network connection with all nodes. When it is detected that the network connection of a node is in a disconnected state, determine that the disconnected node is a faulty node, and record the IP address of the faulty node;
    提示信息发出模块,用于发出将备用节点通过物理替换方式替换所述故障节点的提示信息;A prompt message issuing module, which is used to send out prompt information for replacing the faulty node with the standby node through a physical replacement method;
    广播模块,用于发送设备发现的广播指令;Broadcast module, used to send broadcast instructions for device discovery;
    接收模块,用于接收备用节点接收到设备发现的广播指令后所上报的状态信息;The receiving module is used to receive the status information reported by the standby node after receiving the broadcast instruction discovered by the device;
    备用节点确定模块,用于根据备用节点上报的状态信息确定该备用节点为可用的备用节点;A spare node determining module, configured to determine that the spare node is an available spare node according to the status information reported by the spare node;
    IP地址修改模块,用于向所述备用节点发送IP地址修改为所述故障节点的IP地址的修改指令;An IP address modification module, configured to send a modification instruction for modifying an IP address to the IP address of the failed node to the standby node;
    连接建立模块,用于重新建立与备用节点的网络连接并将分配给所述故障节点的业务重新分配给所述备用节点。The connection establishment module is used for re-establishing the network connection with the standby node and redistributing the services allocated to the failed node to the standby node.
  8. 根据权利要求7所述的分布式系统的业务恢复系统,其特征在于,所述状态信息包括备用节点的IP地址、MAC地址和是否已经与所述业务恢复系统建立端对端网络连接的状态信息。The service recovery system of the distributed system according to claim 7, wherein the status information includes the IP address, MAC address of the standby node, and status information about whether an end-to-end network connection has been established with the service recovery system .
  9. 根据权利要求8所述的分布式系统的业务恢复系统,其特征在于,所述备用节点确定模块具体用于:The service recovery system of a distributed system according to claim 8, wherein the standby node determination module is specifically configured to:
    根据备用节点上报的状态信息判断备用节点的IP地址是否为预设的初始值和该备用节点是否未与所述业务恢复系统建立端对端的网络连接,若是该备用节点的IP地址为预设的初始值和该备用节点未与所述业务恢复系统建立端对端的网络连接,则确定该备用节点为可用的备用节点。According to the status information reported by the backup node, determine whether the IP address of the backup node is the preset initial value and whether the backup node has not established an end-to-end network connection with the service recovery system, if the IP address of the backup node is preset If the initial value and the standby node have not established an end-to-end network connection with the service recovery system, it is determined that the standby node is an available standby node.
  10. 根据权利要求8所述的分布式系统的业务恢复系统,其特征在于, 所述监测模块具体用于:The business recovery system of a distributed system according to claim 8, wherein the monitoring module is specifically configured to:
    当检测到有节点的网络连接处于断开状态且处于断开状态的节点包括多个时,确定该多个处于断开状态的节点均为故障节点,记录多个所述故障节点的IP地址,并控制所述提示信息发出模块逐个发出与每个故障节点一一对应的所述提示信息。When it is detected that the network connection of a node is disconnected and there are multiple nodes in the disconnected state, it is determined that the multiple disconnected nodes are all faulty nodes, and the IP addresses of the multiple faulty nodes are recorded, And control the prompt information issuing module to send out the prompt information corresponding to each faulty node one by one.
PCT/CN2020/141371 2019-12-30 2020-12-30 Service restoration method and system for distributed system WO2021136370A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911396738.3 2019-12-30
CN201911396738.3A CN111130899A (en) 2019-12-30 2019-12-30 Service recovery method and system for distributed system

Publications (1)

Publication Number Publication Date
WO2021136370A1 true WO2021136370A1 (en) 2021-07-08

Family

ID=70505248

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/141371 WO2021136370A1 (en) 2019-12-30 2020-12-30 Service restoration method and system for distributed system

Country Status (2)

Country Link
CN (1) CN111130899A (en)
WO (1) WO2021136370A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111130899A (en) * 2019-12-30 2020-05-08 威创集团股份有限公司 Service recovery method and system for distributed system
CN115002001B (en) * 2022-02-25 2023-08-04 苏州浪潮智能科技有限公司 Method, device, equipment and medium for detecting sub-health of cluster network
CN116743752A (en) * 2023-08-11 2023-09-12 山东恒宇电子有限公司 System for realizing data processing load balance by distributed network communication

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105406980A (en) * 2015-10-19 2016-03-16 浪潮(北京)电子信息产业有限公司 Multi-node backup method and multi-node backup device
CN109151028A (en) * 2018-08-23 2019-01-04 郑州云海信息技术有限公司 A kind of distributed memory system disaster recovery method and device
WO2019049433A1 (en) * 2017-09-06 2019-03-14 日本電気株式会社 Cluster system, cluster system control method, server device, control method, and non-transitory computer-readable medium having program stored therein
CN110572275A (en) * 2019-08-01 2019-12-13 新华三技术有限公司成都分公司 Network card switching method and device, server and computer readable storage medium
CN111130899A (en) * 2019-12-30 2020-05-08 威创集团股份有限公司 Service recovery method and system for distributed system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09244910A (en) * 1996-03-11 1997-09-19 Nippon Steel Corp Backup method for decentralized control system
CN107145306B (en) * 2017-04-27 2020-08-21 杭州哲信信息技术有限公司 Distributed data storage method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105406980A (en) * 2015-10-19 2016-03-16 浪潮(北京)电子信息产业有限公司 Multi-node backup method and multi-node backup device
WO2019049433A1 (en) * 2017-09-06 2019-03-14 日本電気株式会社 Cluster system, cluster system control method, server device, control method, and non-transitory computer-readable medium having program stored therein
CN109151028A (en) * 2018-08-23 2019-01-04 郑州云海信息技术有限公司 A kind of distributed memory system disaster recovery method and device
CN110572275A (en) * 2019-08-01 2019-12-13 新华三技术有限公司成都分公司 Network card switching method and device, server and computer readable storage medium
CN111130899A (en) * 2019-12-30 2020-05-08 威创集团股份有限公司 Service recovery method and system for distributed system

Also Published As

Publication number Publication date
CN111130899A (en) 2020-05-08

Similar Documents

Publication Publication Date Title
WO2021136370A1 (en) Service restoration method and system for distributed system
CN103607297B (en) Fault processing method of computer cluster system
US11307943B2 (en) Disaster recovery deployment method, apparatus, and system
JPH07221782A (en) Network centralized supervisory equipment
CN111158962B (en) Remote disaster recovery method, device and system, electronic equipment and storage medium
CN107147540A (en) Fault handling method and troubleshooting cluster in highly available system
CN105681077A (en) Fault processing method, device and system
CN104038376A (en) Method and device for managing real servers and LVS clustering system
CN104469181B (en) Audio and video matrix switch method based on PIS
CN105933407A (en) Method and system for achieving high availability of Redis cluster
CN111464601A (en) Node service scheduling system and method
CN103701655A (en) Fault self-diagnosis and self-recovery method and system for interchanger
CN105915426A (en) Failure recovery method and device of ring network
CN106294795A (en) A kind of data base's changing method and system
JP2020088470A (en) Information processing apparatus, network system and teaming program
CN104243304B (en) The data processing method of non-full-mesh topological structure, equipment and system
CN102487332B (en) Fault processing method, apparatus thereof and system thereof
CN106657390A (en) Cluster file system directory isolation method, cluster file system directory isolation device and cluster file system directory isolation system
JPH05260049A (en) Fault managing method for network system
CN106294030A (en) Storage redundancy approach based on server virtualization system and device
CN106027313A (en) Disaster tolerance system and method of network link based on VPN (Virtual Private Network)
JP5225166B2 (en) Power system monitoring system and power system monitoring method
CN108829570A (en) Server node information display control method, device, system and storage medium
CN113946474A (en) Efficient disaster tolerance protection method and disaster tolerance processing system for storage system
CN101106461B (en) Control method for status management computer of communication device line clamp

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20909901

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20909901

Country of ref document: EP

Kind code of ref document: A1