CN116232864A - Multi-machine hot backup method and system for network system based on event controller - Google Patents

Multi-machine hot backup method and system for network system based on event controller Download PDF

Info

Publication number
CN116232864A
CN116232864A CN202310491209.1A CN202310491209A CN116232864A CN 116232864 A CN116232864 A CN 116232864A CN 202310491209 A CN202310491209 A CN 202310491209A CN 116232864 A CN116232864 A CN 116232864A
Authority
CN
China
Prior art keywords
machine
standby
working
host
timing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310491209.1A
Other languages
Chinese (zh)
Other versions
CN116232864B (en
Inventor
朱珂
陈培岩
张明伟
常超
张波
肖峰
闻亮
毛英杰
徐涛
高庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jingxin Microelectronics Technology Tianjin Co Ltd
Original Assignee
Jingxin Microelectronics Technology Tianjin Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jingxin Microelectronics Technology Tianjin Co Ltd filed Critical Jingxin Microelectronics Technology Tianjin Co Ltd
Priority to CN202310491209.1A priority Critical patent/CN116232864B/en
Publication of CN116232864A publication Critical patent/CN116232864A/en
Application granted granted Critical
Publication of CN116232864B publication Critical patent/CN116232864B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0663Performing the actions predefined by failover planning, e.g. switching to standby network elements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/10Active monitoring, e.g. heartbeat, ping or trace-route

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Cardiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Maintenance And Management Of Digital Transmission (AREA)
  • Hardware Redundancy (AREA)

Abstract

The invention belongs to the technical field of network data processing and digital information transmission, and particularly relates to a multi-machine hot backup method and system of a network system based on event control symbols, which comprise a normal working mode and an abnormal working mode.

Description

Multi-machine hot backup method and system for network system based on event controller
Technical Field
The invention belongs to the technical field of network data processing and digital information transmission, and particularly relates to a multi-machine hot backup method and system of a network system based on an event controller.
Background
RapidIO is a high performance, low pin count, packet switching based interconnect architecture; for a high-performance embedded communication system, the rapidIO protocol has the characteristics of high bandwidth, low time delay, high flexibility, high reliability and the like, and is the most preferable in the embedded interconnection technology. Typically, the RapidIO network includes end point devices (PE, processing Element) that are responsible for generating, sending and processing packets, and switching devices (SWITCH) that are responsible for receiving and forwarding. One device is generally used as a host node in the endpoint device, and the function of the endpoint device is to complete network maintenance work such as initial enumeration, route deployment, fault management and the like of the rapidIO network;
from the reliability point of view, when the host machine itself or the connection between the host machine itself and the RapidIO network fails, a hot backup mechanism is needed to ensure that the RapidIO network can keep normal operation. The current main-stream hot backup system is a dual-machine hot backup system, wherein the dual-machine hot backup system comprises a host machine and a standby machine, and when the host machine fails, the standby machine can timely take over the position of the host machine, so that the service and management of the rapidIO network are ensured not to be out of control. Common hot backup system implementation modes are numerous, including realizing heartbeat communication between the main machine and the standby machine by means of a third party arbitration mechanism through external hardware or rapidIO messages, and the methods have obvious defects: for example, the scheme by means of the third party arbitration mechanism is completely established on the basis of the reliability of the third party arbitration mechanism, and the robustness of the system is not further improved; the heartbeat communication mechanism between the main machine and the standby machine is realized through external hardware, besides the increase of hardware cost, a hardware path is needed between the main machine and the standby machine, and the form of the whole network topology is greatly limited; the channel resources in the network are occupied by the rapidIO message, the route configuration is easy to generate conflict, and the forwarding priority of the data packet is difficult to guarantee;
considering the characteristics of the rapidIO network and the actual application scene, the dual hot standby system still cannot provide enough reliable guarantee in the rapidIO network, especially the situation that a host participates in service interaction, frequently dynamically enters into the network and exits from the network or the host and the standby are in continuous fault under certain extreme conditions. At this time, the problem can be solved by increasing the number of the standby machines, namely, one host machine is adopted to match a plurality of standby machines. It is not difficult to find that higher cost overhead and system complexity are likely to be brought if the technology is adopted, and a large amount of external hardware is needed, so that the form of the rapidIO network topology is more rigid; or more channel resources may be occupied, and routing configuration becomes more complex.
The prior art has the problems that a low-cost and reliable hot backup mechanism of multiple standby machines is lacked, so that the fault tolerance of a network system is poor, the complexity is high, and the interaction of data services is affected.
Disclosure of Invention
The invention provides a multi-machine hot backup method and a system of a network system based on event control symbols, which are used for solving the problems that the prior art in the background technology lacks a low-cost and reliable hot backup mechanism of multiple machines, so that the fault tolerance of the network system is poor, the complexity is high, and the interaction of data services is affected.
The technical problems solved by the invention are realized by adopting the following technical scheme: the multi-machine hot backup method of the network system based on the event controller comprises the following steps:
multi-machine network of one host machine and multiple standby machines based on the interconnection system structure of the rapidIO network data packet exchange:
normal operation mode: if the network enumeration is finished, the current host selects and wakes up a standby machine to form a first working standby machine, and establishes heartbeat communication through the multicast event controller;
abnormal operation mode: if the host in communication fails, the first working standby machine takes over the current host to form a working host, the other standby machine selected and awakened by the working host forms a second working standby machine, and heartbeat communication is reestablished through the multicast event controller.
Further, the normal operation mode further includes:
in the initial stage of the system, the current host detects and discovers all the standby machines of the whole multi-machine network by initiating network enumeration operation, and establishes rapidIO channels of the current host and all the standby machines.
Further, the normal operation mode further includes:
and selecting and waking up the most applicable standby machine according to the comprehensively determined topological structure of the multi-machine network and the equipment physical property of each standby machine of the multi-machine network, and determining the most applicable standby machine as a first working standby machine.
Further, the normal operation mode further includes:
if the standby machine is awakened, determining a heartbeat communication complete path between the current host machine and the first working standby machine, configuring the switching equipment and the first working standby machine one by one through maintenance packets according to the heartbeat communication complete path, starting a port multicast event controller of the first working standby machine to send enabling, and establishing a multicast event controller transmission path between the current host machine and the first working standby machine;
if the standby machine is not awakened, the current host machine reselects and wakes the standby machine.
Further, the normal operation mode further includes:
based on the control transmission period, the current host transmits a multicast event controller to the first working machine;
if the first work machine receives the first multicast event controller, starting a first control sending timing and periodically detecting the multicast event controller;
if the first working machine receives the second multicast event controller, starting a second control sending timing and periodically detecting the multicast event controller;
if the first working machine receives the third multicast event controller, starting third control sending timing and periodically detecting the multicast event controller;
and so on;
if the first working machine receives the Nth multicast event controller, starting the Nth control sending timing;
averaging the first control transmit timing, the second control transmit timing, and the third control transmit timing to form an average transmit timing, and counting as a host heartbeat cycle, namely:
Figure SMS_1
the Ta is 1 Timing the first control transmission;
the Ta is 2 Timing the second control transmission;
the Ta is 3 Timing the first control transmission;
the Ta is N Timing the Nth control transmission;
and N is the number of times of controlling the sending timing.
Further, the normal operation mode further includes:
average transmit timing threshold function:
Figure SMS_2
the Tab is + An upper timing threshold for average transmission;
the Tab is - A lower threshold for average transmit timing;
the Ta is average sending timing;
and the Tg is a transmission timing error, and the value of the Tg is determined according to the network transmission rate.
Further, the normal operation mode further includes:
heartbeat loss judgment function:
Figure SMS_3
further, the normal operation mode further includes:
based on the control transmission period, the current host transmits a multicast event controller to the first working machine;
if the first work machine receives a first multicast event controller, a first fault sending moment record is started;
if the first working machine receives the second multicast event controller, starting a second control sending time record;
the first failure transmission timing and the second control transmission timing are formed into a transmission interval timing and counted as a failure interval period.
Further, the normal operation mode further includes:
transmission interval timing function:
QT=T2-T1;
the QT is interval sending timing;
the T2 is a second fault sending time record;
the T1 is a first fault sending time record;
host fault threshold function:
Figure SMS_4
the TAB + The upper limit of the fault interval period threshold is set;
the TAB - A lower limit of a fault interval period threshold;
the TA is average interval timing;
the TG is a fault interval error, and the TG value is determined according to the network transmission rate;
host fault determination function:
Figure SMS_5
;/>
meanwhile, the invention provides a network system multi-machine hot backup system based on event control symbols, which comprises a multi-machine hot backup platform for realizing the multi-machine hot backup method, wherein the multi-machine hot backup platform comprises a normal working module and an abnormal working module;
the normal operation module is used for: if the network enumeration is finished, the current host selects and wakes up a standby machine to form a first working standby machine, and establishes heartbeat communication through the multicast event controller;
the abnormal working module is used for: if the host in communication fails, the first working standby machine takes over the current host to form a working host, the other standby machine selected and awakened by the working host forms a second working standby machine, and heartbeat communication is reestablished through the multicast event controller.
The beneficial technical effects are as follows:
the scheme adopts a multi-machine network of one host machine and multiple standby machines based on an interconnection system structure of the rapidIO network data packet exchange: normal operation mode: if the network enumeration is finished, the current host selects and wakes up a standby machine to form a first working standby machine, and establishes heartbeat communication through the multicast event controller; abnormal operation mode: if the host computer in communication fails, the first working standby machine takes over the current host computer to form a working host computer, the other standby machine selected and awakened by the working host computer forms a second working standby machine, and heartbeat communication is reestablished through the multicast event controller; when the host fails, the current standby machine takes over the network to become a new host, wakes up a standby machine in a dormant state to further establish new main standby heartbeat communication, thereby realizing a reliable hot backup mechanism of multiple standby machines with lower cost, greatly improving the fault tolerance of the rapidIO network system, and not increasing the complexity of the system and affecting the interaction of data services.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a general flow chart of a multi-machine hot standby method of the present invention;
FIG. 2 is a main flow chart of the multi-machine hot standby method of the present invention;
FIG. 3 is a flow chart of a multi-machine hot standby method of the present invention;
fig. 4 is a schematic structural diagram of a first embodiment of the present invention.
Description of the embodiments
The invention is further described below with reference to the accompanying drawings:
in the figure:
s101, a normal working mode;
s102, an abnormal working mode;
s1001-a multi-machine network of a host multi-standby machine based on an interconnection system structure of rapidIO network data packet exchange;
s1002, if the network enumeration is finished, selecting and waking up a standby machine by the current host machine to form a first working standby machine, and establishing heartbeat communication through a multicast event controller;
s1003-if the host in communication fails, the first working standby machine takes over the current host to form a working host, the other standby machine selected and awakened by the working host forms a second working standby machine, and the heartbeat communication is reestablished through the multicast event controller;
examples
This embodiment:
the multi-machine hot backup method of the network system based on the event controller comprises the following steps:
multi-machine network S1001 of one host and multiple standby machines of interconnection system structure based on RapidIO network data packet exchange:
normal operation mode S101: if the network enumeration is finished, the current host selects and wakes up a standby machine to form a first working standby machine, and establishes heartbeat communication through a multicast event controller S1002;
abnormal operation mode S102: if the communicating host fails, the first working standby takes over the current host to form a working host, the other standby selected and awakened by the working host forms a second working standby, and the heartbeat communication is reestablished through the multicast event controller S1003.
Due to the adoption of a multi-machine network of one host and multiple standby machines based on an interconnection system structure of the rapidIO network data packet exchange: normal operation mode: if the network enumeration is finished, the current host selects and wakes up a standby machine to form a first working standby machine, and establishes heartbeat communication through the multicast event controller; abnormal operation mode: if the host computer in communication fails, the first working standby machine takes over the current host computer to form a working host computer, the other standby machine selected and awakened by the working host computer forms a second working standby machine, and heartbeat communication is reestablished through the multicast event controller; when the host fails, the current standby machine takes over the network to become a new host, wakes up a standby machine in a dormant state to further establish new main standby heartbeat communication, thereby realizing a reliable hot backup mechanism of multiple standby machines with lower cost, greatly improving the fault tolerance of the rapidIO network system, and not increasing the complexity of the system and affecting the interaction of data services.
The normal operation mode S101 further includes:
in the initial stage of the system, the current host detects and discovers all the standby machines of the whole multi-machine network by initiating network enumeration operation, and establishes rapidIO channels of the current host and all the standby machines.
Since the normal operation mode is adopted, the method further comprises: in the initial stage of the system, the current host detects and discovers all the standby machines of the whole multi-machine network by initiating network enumeration operation, and establishes a rapidIO (input/output) path between the current host and all the standby machines; one of the endpoint devices is selected as a main control processing node and is responsible for initial enumeration, configuration deployment, fault management and the like of the whole rapidIO network. In the initial stage of the system, the host initiates network enumeration operation to finish the equipment detection and discovery of the whole rapidIO network, and rapidIO channels of the host, all switching equipment and endpoint equipment are established at the moment.
The normal operation mode S101 further includes:
and selecting and waking up the most applicable standby machine according to the comprehensively determined topological structure of the multi-machine network and the equipment physical property of each standby machine of the multi-machine network, and determining the most applicable standby machine as a first working standby machine.
The adoption of the normal working mode further comprises: selecting and waking up the most applicable standby machine according to the comprehensively determined topological structure of the multi-machine network and the equipment physical property of each standby machine of the multi-machine network, and simultaneously determining the most applicable standby machine as a first working standby machine, wherein the standby machine is selected and waken up: the selection rule of the standby machine is not particularly limited, and only the endpoint equipment with the network management function is needed, the actual network can be comprehensively determined according to the topological structure, the physical properties of the endpoint equipment and the like, the topological structure can be the number of exchanges between the standby machine and the host machine, and the priority is higher when the number is smaller; the physical performance of the endpoint device may be the strength of the network management function of the endpoint device, and the stronger the function is, the higher the priority is; the host wakes up the standby machine immediately after the selection is completed, and particularly, if the wake-up fails, the selection and the wake-up are performed again.
The normal operation mode S101 further includes:
if the standby machine is awakened, determining a heartbeat communication complete path between the current host machine and the first working standby machine, configuring the switching equipment and the first working standby machine one by one through maintenance packets according to the heartbeat communication complete path, starting a port multicast event controller of the first working standby machine to send enabling, and establishing a multicast event controller transmission path between the current host machine and the first working standby machine;
if the standby machine is not awakened, the current host machine reselects and wakes the standby machine.
Since the normal operation mode is adopted, the method further comprises: if the standby machine is awakened, determining a heartbeat communication complete path between the current host machine and the first working standby machine, configuring the switching equipment and the first working standby machine one by one through maintenance packets according to the heartbeat communication complete path, starting a port multicast event controller of the first working standby machine to send enabling, and establishing a multicast event controller transmission path between the current host machine and the first working standby machine; if the standby machine is not awakened, the current host machine reselects and wakes the standby machine, after the standby machine is awakened, a complete path P of heartbeat communication between the host machine and the standby machine is established, mainly comprising intermediate switching equipment and corresponding ports of the path, and then switching equipment contained in the path P is configured one by one from the direct connection of the host machine through a maintenance packet, and the port multicast event controller of the next switching (or standby machine) of the current switching connection is started to transmit and enable, so that a multicast event controller transmission path between the host machine and the standby machine is established, and the host machine and the standby machine can participate in specific data service transmission as required while executing network management due to the characteristic of the multicast event controller, thereby effectively improving network throughput and the utilization rate of system resources.
The normal operation mode S101 further includes:
based on the control transmission period, the current host transmits a multicast event controller to the first working machine;
if the first work machine receives the first multicast event controller, starting a first control sending timing and periodically detecting the multicast event controller;
if the first working machine receives the second multicast event controller, starting a second control sending timing and periodically detecting the multicast event controller;
if the first working machine receives the third multicast event controller, starting third control sending timing and periodically detecting the multicast event controller;
and so on;
if the first working machine receives the Nth multicast event controller, starting the Nth control sending timing;
averaging the first control transmit timing, the second control transmit timing, and the third control transmit timing to form an average transmit timing, and counting as a host heartbeat cycle, namely:
Figure SMS_6
the Ta is 1 Timing the first control transmission;
the Ta is 2 Timing the second control transmission;
the Ta is 3 Timing the first control transmission;
the Ta is N Timing the Nth control transmission;
and N is the number of times of controlling the sending timing.
Since the normal operation mode is adopted, the method further comprises: based on the control transmission period, the current host transmits a multicast event controller to the first working machine; if the first work machine receives the first multicast event controller, starting a first control sending timing and periodically detecting the multicast event controller; if the first working machine receives the second multicast event controller, starting a second control sending timing and periodically detecting the multicast event controller; if the first working machine receives the third multicast event controller, starting third control sending timing and periodically detecting the multicast event controller; and forming the first control sending timing, the second control sending timing and the third control sending timing into average sending timing and counting as the heartbeat period of the host computer, and initializing the heartbeat communication setting between the host computer and the standby computer. The host sends the multicast event control symbol to the standby machine by taking T as a period, the standby machine records time T0 when receiving the first multicast event control symbol, starts the standby machine control program, and always detects whether the multicast event control symbol sent by the host machine is received periodically or not; the standby machine receives the second multicast time controller symbol time T1, and the like, and T2 and T3 are calculated, the arithmetic average value Ta of the time spent by the standby machine for transmitting the multicast event controller symbol to the standby machine corresponding to the three times of the host machine for T1, T2 and T3 is recorded as the period of the standby machine for detecting the heartbeat of the host machine, and the initial setting of the heartbeat communication between the main machine and the standby machine is completed. Considering that the network transmission may have jitter, the time limit Tl for judging the heartbeat loss can be properly widened compared with Ta, the value can be customized according to the specific application scene, after the switching of the main and standby is completed, only the multicast event control symbol transfer path between the new host and the new standby is required to be re-established, the configuration change of the whole rapidIO network can be almost ignored, the interaction of data service can not be generated, and the influence on the existing service of the system can be reduced to the minimum.
The normal operation mode S101 further includes:
average transmit timing threshold function:
Figure SMS_7
the Tab is + An upper timing threshold for average transmission;
the Tab is - A lower threshold for average transmit timing;
the Ta is average sending timing;
and the Tg is a transmission timing error, and the value of the Tg is determined according to the network transmission rate.
The normal operation mode S101 further includes:
heartbeat loss judgment function:
Figure SMS_8
the normal operation mode S101 further includes:
based on the control transmission period, the current host transmits a multicast event controller to the first working machine;
if the first work machine receives a first multicast event controller, a first fault sending moment record is started;
if the first working machine receives the second multicast event controller, starting a second control sending time record;
the first failure transmission timing and the second control transmission timing are formed into a transmission interval timing and counted as a failure interval period.
Since the normal operation mode is adopted, the method further comprises: based on the control transmission period, the current host transmits a multicast event controller to the first working machine; if the first work machine receives a first multicast event controller, a first fault sending moment record is started; if the first working machine receives the second multicast event controller, starting a second control sending time record; forming a transmission interval timing by the first fault transmission timing and the second control transmission timing, and counting as a fault interval period, wherein the standby machine always detects heartbeat information transmitted by the host machine and records a time interval Ti adjacent to two times, and if the Ti does not exceed Tl, the standby machine is regarded as normal to continue to wait in a circulating way; otherwise, when Ti is greater than Tl or the time from the last heartbeat exceeds Tl, the host computer is regarded as fault, and the cycle detection is stopped; when the host fails, the standby machine stops the loop detection, starts the network takeover program, and the role finishes the switching from the standby machine to the host to take over the maintenance and management of the whole rapidIO network by the original host; after the network is taken over, the new host computer repeats the step 4 to complete the new standby computer to select and wake up and start the subsequent operation.
The normal operation mode S101 further includes:
transmission interval timing function:
Figure SMS_9
the Tab is + An upper timing threshold for average transmission;
the Tab is - A lower threshold for average transmit timing;
the Ta is average sending timing;
and the Tg is a transmission timing error, and the value of the Tg is determined according to the network transmission rate.
The normal operation mode S101 further includes:
heartbeat loss judgment function:
Figure SMS_10
the normal operation mode S101 further includes:
based on the control transmission period, the current host transmits a multicast event controller to the first working machine;
if the first work machine receives a first multicast event controller, a first fault sending moment record is started;
if the first working machine receives the second multicast event controller, starting a second control sending time record;
the first failure transmission timing and the second control transmission timing are formed into a transmission interval timing and counted as a failure interval period.
Since the normal operation mode is adopted, the method further comprises: based on the control transmission period, the current host transmits a multicast event controller to the first working machine; if the first work machine receives a first multicast event controller, a first fault sending moment record is started; if the first working machine receives the second multicast event controller, starting a second control sending time record; forming a transmission interval timing by the first fault transmission timing and the second control transmission timing, and counting as a fault interval period, wherein the standby machine always detects heartbeat information transmitted by the host machine and records a time interval Ti adjacent to two times, and if the Ti does not exceed Tl, the standby machine is regarded as normal to continue to wait in a circulating way; otherwise, when Ti is greater than Tl or the time from the last heartbeat exceeds Tl, the host computer is regarded as fault, and the cycle detection is stopped; when the host fails, the standby machine stops the loop detection, starts the network takeover program, and the role finishes the switching from the standby machine to the host to take over the maintenance and management of the whole rapidIO network by the original host; after the network is taken over, the new host computer repeats the step 4 to complete the new standby computer to select and wake up and start the subsequent operation.
The normal operation mode S101 further includes:
transmission interval timing function:
QT=T2-T1;
the QT is interval sending timing;
the T2 is a second fault sending time record;
the T1 is a first fault sending time record;
host fault threshold function:
Figure SMS_11
the TAB + The upper limit of the fault interval period threshold is set;
the TAB - A lower limit of a fault interval period threshold;
the TA is average interval timing;
the TG is a fault interval error, and the TG value is determined according to the network transmission rate;
host fault determination function:
Figure SMS_12
meanwhile, the invention also provides a multi-machine hot backup system of the network system based on the event controller, which comprises a multi-machine hot backup platform for realizing the multi-machine hot backup method, wherein the multi-machine hot backup platform comprises a normal working module and an abnormal working module;
the normal operation module is used for: if the network enumeration is finished, the current host selects and wakes up a standby machine to form a first working standby machine, and establishes heartbeat communication through a multicast event controller S1002;
the abnormal working module is used for: if the communicating host fails, the first working standby takes over the current host to form a working host, the other standby selected and awakened by the working host forms a second working standby, and the heartbeat communication is reestablished through the multicast event controller S1003.
Meanwhile, the invention also provides a multi-machine hot backup system of the network system based on the event controller, which comprises a multi-machine hot backup platform of the multi-machine hot backup method, wherein the multi-machine hot backup platform comprises a normal working module and an abnormal working module; the normal operation module is used for: if the network enumeration is finished, the current host selects and wakes up a standby machine to form a first working standby machine, and establishes heartbeat communication through the multicast event controller; the abnormal working module is used for: if the host computer in communication fails, the first working standby machine takes over the current host computer to form a working host computer, the other standby machine selected and awakened by the working host computer forms a second working standby machine, and heartbeat communication is reestablished through the multicast event controller, so that the system is proved to have practicability.
Embodiment one:
for the purpose of highlighting the implementation of the present patent solution, rapidIO shown in the above topology does not list enough endpoint devices, where endpoint devices with network management functions are hosts and endpoints a, d, e. The scheme of the patent is implemented as follows:
1) Initiating network enumeration by a host to obtain a network topology comprising all endpoint devices and switching devices, wherein a network path between the host and any device in the network is provided;
2) And selecting the endpoint a as a standby machine based on the network topology obtained in the last step, and waking up the endpoint a. Planning a path between a host and an endpoint a: p= { ( SW 1,0, 5), (SW 2,11, 1) }, where all exchanges of the whole path of host to standby, and corresponding ingress and egress ports;
3) Configuring each exchange in P one by one, respectively starting multicast event controller forwarding enabling of the SW1 port 5 and the SW2 port 1, and completing the establishment of a heartbeat communication transmission path between the main machine and the standby machine;
4) Initial setting of heartbeat communication between the main machine and the standby machine is started to be executed:
a) The host sends the multicast event control symbol according to the fixed interval T cycle
b) The standby machine receives the first multicast event control symbol time T0 and starts the standby machine control program
c) The standby machine receives the second multicast event control symbol time T1, and the like, T2 and T3 are calculated, and an arithmetic average value Ta of time spent by the standby machine for transmitting the multicast event control symbol to the standby machine by the host machine corresponding to three times of T1, T2 and T3 is recorded as a period for detecting the heartbeat of the host machine by the standby machine;
5) The network transmission and service scene conditions and the like are synthesized, and a heartbeat loss judgment time limit TI=2Ta is determined;
6) So far, the host computer has ready all configuration about heartbeat communication, can start to deploy network data service, and the data interaction of the whole rapidIO network application layer is fully developed immediately; as the real-time load and link state of the network may change dynamically, the paths between the host and the network may also fail;
7) The standby machine always detects heartbeat information sent by the host machine, records a time interval Ti adjacent to two times, and if the Ti does not exceed Tl, the standby machine is regarded as normal to the host machine and continues to wait circularly; otherwise, when Ti is greater than Tl or the time from the last heartbeat exceeds Tl, the host computer is regarded as fault, and the cycle detection is stopped;
8) The standby machine starts a network take-over program, and the role finishes the switching from the standby machine to the host machine, so that the original host machine is replaced to maintain and manage the whole rapidIO network;
9) Based on the current network topology, endpoint d is selected as a standby and attempts to wake it up.
10 Endpoint d failed to wake up, reselect endpoint e as a standby, and attempt to wake it up.
11 Successfully awakening endpoint e, planning the path between the current host (endpoint a) and endpoint e: p= { (SW 2,1, 11), ( SW 1,5, 13), (SW 3,4, 15), ( SW 4,11, 4) }. Configuring the exchanges in P one by one, respectively starting the multicast event controller forwarding enabling of the SW2 port 11, the SW1 port 13, the SW3 port 15 and the SW4 port 4, and reestablishing the heartbeat communication transmission path between the main machine and the standby machine;
12 Repeating the above step 4 and subsequent operations. The multi-standby hot backup method of the rapidIO network system based on the multicast controller described in the patent is realized.
Working principle:
the scheme is that a multi-machine network of one host machine and multiple standby machines of an interconnection system structure based on the rapid IO network data packet exchange is adopted: normal operation mode: if the network enumeration is finished, the current host selects and wakes up a standby machine to form a first working standby machine, and establishes heartbeat communication through the multicast event controller; abnormal operation mode: if the host computer in communication fails, the first working standby machine takes over the current host computer to form a working host computer, the other standby machine selected and awakened by the working host computer forms a second working standby machine, and heartbeat communication is reestablished through the multicast event controller; when the host fails, the current standby machine takes over the network to become a new host, wakes up a standby machine in a dormant state to create new main standby heartbeat communication.
It should be noted herein that any process or method descriptions that are otherwise described may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and that scope of preferred embodiments of the present invention includes additional implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present invention. The processor performs the various methods and processes described above. For example, method embodiments in the present solution may be implemented as a software program tangibly embodied on a machine-readable medium, such as a memory. In some embodiments, part or all of the software program may be loaded and/or installed via memory and/or a communication interface. One or more of the steps of the methods described above may be performed when a software program is loaded into memory and executed by a processor. Alternatively, in other embodiments, the processor may be configured to perform one of the methods described above in any other suitable manner (e.g., by means of firmware).
The logic and/or steps described elsewhere herein may be embodied in any readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the present invention.

Claims (10)

1. The multi-machine hot backup method of the network system based on the event controller is characterized by comprising the following steps:
multi-machine network of one host machine and multiple standby machines based on the interconnection system structure of the rapidIO network data packet exchange:
normal operation mode: if the network enumeration is finished, the current host selects and wakes up a standby machine to form a first working standby machine, and establishes heartbeat communication through the multicast event controller;
abnormal operation mode: if the host in communication fails, the first working standby machine takes over the current host to form a working host, the other standby machine selected and awakened by the working host forms a second working standby machine, and heartbeat communication is reestablished through the multicast event controller.
2. The multi-machine hot standby method according to claim 1, wherein the normal operation mode further comprises:
in the initial stage of the system, the current host detects and discovers all the standby machines of the whole multi-machine network by initiating network enumeration operation, and establishes rapidIO channels of the current host and all the standby machines.
3. The multi-machine hot standby method according to claim 1, wherein the normal operation mode further comprises:
and selecting and waking up the most applicable standby machine according to the comprehensively determined topological structure of the multi-machine network and the equipment physical property of each standby machine of the multi-machine network, and determining the most applicable standby machine as a first working standby machine.
4. A multi-machine hot standby method according to claim 3, wherein said normal operating mode further comprises:
if the standby machine is awakened, determining a heartbeat communication complete path between the current host machine and the first working standby machine, configuring the switching equipment and the first working standby machine one by one through maintenance packets according to the heartbeat communication complete path, starting a port multicast event controller of the first working standby machine to send enabling, and establishing a multicast event controller transmission path between the current host machine and the first working standby machine;
if the standby machine is not awakened, the current host machine reselects and wakes the standby machine.
5. The multi-machine hot standby method according to claim 4, wherein the normal operation mode further comprises:
based on the control transmission period, the current host transmits a multicast event controller to the first working machine;
if the first work machine receives the first multicast event controller, starting a first control sending timing and periodically detecting the multicast event controller;
if the first working machine receives the second multicast event controller, starting a second control sending timing and periodically detecting the multicast event controller;
if the first working machine receives the third multicast event controller, starting third control sending timing and periodically detecting the multicast event controller;
and so on;
if the first working machine receives the Nth multicast event controller, starting the Nth control sending timing;
averaging the first control transmit timing, the second control transmit timing, and the third control transmit timing to form an average transmit timing, and counting as a host heartbeat cycle, namely:
Figure QLYQS_1
the Ta is 1 Timing the first control transmission;
the Ta is 2 Timing the second control transmission;
the Ta is 3 Timing the first control transmission;
the Ta is N Timing the Nth control transmission;
and N is the number of times of controlling the sending timing.
6. The multi-machine hot standby method according to claim 5, wherein the normal operation mode further comprises:
average transmit timing threshold function:
Figure QLYQS_2
the Tab is + An upper timing threshold for average transmission;
the Tab is - A lower threshold for average transmit timing;
the Ta is average sending timing;
and the Tg is a transmission timing error, and the value of the Tg is determined according to the network transmission rate.
7. The multi-machine hot standby method according to claim 6, wherein the normal operation mode further comprises:
heartbeat loss judgment function:
Figure QLYQS_3
8. the multi-machine hot standby method according to claim 4, wherein the normal operation mode further comprises:
based on the control transmission period, the current host transmits a multicast event controller to the first working machine;
if the first work machine receives a first multicast event controller, a first fault sending moment record is started;
if the first working machine receives the second multicast event controller, starting a second control sending time record;
the first failure transmission timing and the second control transmission timing are formed into a transmission interval timing and counted as a failure interval period.
9. The multi-machine hot standby method according to claim 8, wherein the normal operation mode further comprises:
transmission interval timing function:
QT=T2-T1;
the QT is interval sending timing;
the T2 is a second fault sending time record;
the T1 is a first fault sending time record;
host fault threshold function:
Figure QLYQS_4
the TAB + The upper limit of the fault interval period threshold is set;
the TAB - A lower limit of a fault interval period threshold;
the TA is average interval timing;
the TG is a fault interval error, and the TG value is determined according to the network transmission rate;
host fault determination function:
Figure QLYQS_5
。/>
10. the network system multi-machine hot backup system based on the event controller is characterized by comprising a multi-machine hot backup platform for realizing the multi-machine hot backup method according to any one of claims 1 to 9, wherein the multi-machine hot backup platform comprises a normal working module and an abnormal working module;
the normal operation module is used for: if the network enumeration is finished, the current host selects and wakes up a standby machine to form a first working standby machine, and establishes heartbeat communication through the multicast event controller;
the abnormal working module is used for: if the host in communication fails, the first working standby machine takes over the current host to form a working host, the other standby machine selected and awakened by the working host forms a second working standby machine, and heartbeat communication is reestablished through the multicast event controller.
CN202310491209.1A 2023-05-05 2023-05-05 Multi-machine hot backup method and system for network system based on event controller Active CN116232864B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310491209.1A CN116232864B (en) 2023-05-05 2023-05-05 Multi-machine hot backup method and system for network system based on event controller

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310491209.1A CN116232864B (en) 2023-05-05 2023-05-05 Multi-machine hot backup method and system for network system based on event controller

Publications (2)

Publication Number Publication Date
CN116232864A true CN116232864A (en) 2023-06-06
CN116232864B CN116232864B (en) 2023-07-14

Family

ID=86573478

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310491209.1A Active CN116232864B (en) 2023-05-05 2023-05-05 Multi-machine hot backup method and system for network system based on event controller

Country Status (1)

Country Link
CN (1) CN116232864B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103001867A (en) * 2012-12-27 2013-03-27 中航(苏州)雷达与电子技术有限公司 Host-standby machine duplicated hot-backup system and method
CN103746910A (en) * 2013-11-28 2014-04-23 苏州长风航空电子有限公司 RapidIO network recursive enumeration method
CN108183762A (en) * 2017-12-28 2018-06-19 天津芯海创科技有限公司 The method for synchronizing time of RapidIO network systems and RapidIO network systems
WO2018166308A1 (en) * 2017-03-13 2018-09-20 中兴通讯股份有限公司 Distributed nat dual-system hot backup traffic switching system and method
CN112511394A (en) * 2020-11-05 2021-03-16 中国航空工业集团公司西安航空计算技术研究所 Management and maintenance method of RapidIO bus system
CN114244466A (en) * 2021-12-29 2022-03-25 中国航空工业集团公司西安航空计算技术研究所 Distributed time synchronization method and system of RapidIO network system
CN114356665A (en) * 2021-12-23 2022-04-15 中国航空工业集团公司西安航空计算技术研究所 Comprehensive photoelectric signal processing computing resource management method
CN116032731A (en) * 2023-03-28 2023-04-28 井芯微电子技术(天津)有限公司 Method and device for realizing hot backup of RapidIO network system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103001867A (en) * 2012-12-27 2013-03-27 中航(苏州)雷达与电子技术有限公司 Host-standby machine duplicated hot-backup system and method
CN103746910A (en) * 2013-11-28 2014-04-23 苏州长风航空电子有限公司 RapidIO network recursive enumeration method
WO2018166308A1 (en) * 2017-03-13 2018-09-20 中兴通讯股份有限公司 Distributed nat dual-system hot backup traffic switching system and method
CN108183762A (en) * 2017-12-28 2018-06-19 天津芯海创科技有限公司 The method for synchronizing time of RapidIO network systems and RapidIO network systems
CN112511394A (en) * 2020-11-05 2021-03-16 中国航空工业集团公司西安航空计算技术研究所 Management and maintenance method of RapidIO bus system
CN114356665A (en) * 2021-12-23 2022-04-15 中国航空工业集团公司西安航空计算技术研究所 Comprehensive photoelectric signal processing computing resource management method
CN114244466A (en) * 2021-12-29 2022-03-25 中国航空工业集团公司西安航空计算技术研究所 Distributed time synchronization method and system of RapidIO network system
CN116032731A (en) * 2023-03-28 2023-04-28 井芯微电子技术(天津)有限公司 Method and device for realizing hot backup of RapidIO network system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘永胜: "《基于DevieNet总线的嵌入式控制器双机热备系统》", 《《中国优秀硕士学位论文全文数据库 信息科技辑》》 *
高逸龙: "《RapidIO网络集群管理技术》", 《通信技术》, vol. 53, no. 5 *

Also Published As

Publication number Publication date
CN116232864B (en) 2023-07-14

Similar Documents

Publication Publication Date Title
CN101304340B (en) Method and apparatus for monitoring resource condition as well as communication network
EP2144400B1 (en) Distributed ethernet system and method for detecting fault based thereon
CN101087207B (en) A processing method for multi-node communication failure
JP4527447B2 (en) Network relay device and control method thereof
US20100020680A1 (en) Multi-chassis ethernet link aggregation
EP2452250B1 (en) Method of supporting power control in a communication network
CN101594383B (en) Method for monitoring service and status of controllers of double-controller storage system
US10666554B2 (en) Inter-chassis link failure management system
US20130238738A1 (en) Distributed method and system for implementing link aggregation control protocol (lacp) standard state machines
CN109218232B (en) Method, equipment and system for realizing Mux machine
WO2012048585A1 (en) Switching method and router
TW201324175A (en) Universal Serial Bus device and method for power management
US20110216647A1 (en) Telephone system, gateway for telephone system, and redundancy switching method
US10090952B2 (en) Master/slave negotiation associated with a synchronous ethernet network
US7602706B1 (en) Inter-ring protection for shared packet rings
CN115589273A (en) EPA communication system
CN116032731B (en) Method and device for realizing hot backup of RapidIO network system
US20090154341A1 (en) Method And Apparatus For Providing Network Redundancy
CN116232864B (en) Multi-machine hot backup method and system for network system based on event controller
WO2012000338A1 (en) Method and system for achieving main/standby switch for single boards
WO2012159570A1 (en) Link switchover method and apparatus
JP3880482B2 (en) Duplex network computer system and computer system network duplication method
WO2009062351A1 (en) Method for stacking system merging
US20230327981A1 (en) Efficient traffic redirection for an mclag for controlled unavailability events
EP2464062B1 (en) Method for switch device establishing topology structure and switch device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant