CN115150322B - Multichannel RapidIO distribution system and fault self-isolation method thereof - Google Patents
Multichannel RapidIO distribution system and fault self-isolation method thereof Download PDFInfo
- Publication number
- CN115150322B CN115150322B CN202211081103.6A CN202211081103A CN115150322B CN 115150322 B CN115150322 B CN 115150322B CN 202211081103 A CN202211081103 A CN 202211081103A CN 115150322 B CN115150322 B CN 115150322B
- Authority
- CN
- China
- Prior art keywords
- data
- port
- fault
- output port
- rapidio
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L45/00—Routing or path finding of packets in data switching networks
- H04L45/22—Alternate routing
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L45/00—Routing or path finding of packets in data switching networks
- H04L45/28—Routing or path finding of packets in data switching networks using route fault recovery
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L49/00—Packet switching elements
- H04L49/25—Routing or path finding in a switch fabric
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L49/00—Packet switching elements
- H04L49/55—Prevention, detection or correction of errors
- H04L49/552—Prevention, detection or correction of errors by ensuring the integrity of packets received through redundant connections
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L49/00—Packet switching elements
- H04L49/55—Prevention, detection or correction of errors
- H04L49/555—Error detection
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L49/00—Packet switching elements
- H04L49/55—Prevention, detection or correction of errors
- H04L49/557—Error correction, e.g. fault recovery or fault tolerance
Abstract
The invention provides a multi-channel RapidIO distribution system and a fault self-isolation method thereof, which can automatically find faults and isolate ports, increase backup processing nodes into the distribution system in real time, reduce the coupling of data sources and processing nodes, reduce the complexity of distribution logic design of the data sources and ensure the processing capacity of the whole system. According to the invention, the RapidIO switch is configured, so that the RapidIO switch can detect the fault of the interconnection channel, and when the fault occurs, the port automatically discards the packet, so that the normal data receiving of the ports of other processing nodes is not influenced, namely, the automatic isolation is realized; and meanwhile, when the management node of the RapidIO switch finds that the state of a certain port is abnormal, the routing table is changed, namely the data of the fault port is routed to the backup output port, and the fault port is deleted from the routing table, so that the processing capacity of the system can be effectively ensured when the node is in fault.
Description
Technical Field
The invention relates to the technical field of data communication, in particular to a multichannel RapidIO distribution system and a fault self-isolation method thereof.
Background
At present, in the field of edge calculation, data processing and operation are directly completed locally after data acquisition, so as to meet the real-time requirement of data processing. The edge calculation has the characteristics of high real-time performance, large data volume and large calculation amount, the transmission of data is a key technology for realizing the edge calculation, the processing of the data is finished by massive data generated at the front end and huge calculation capacity, a single CPU or a processing node cannot meet the calculation requirement, and then the data needs to be distributed to a plurality of processing nodes, and a multi-parallel operation mode is adopted. The RapidIO bus is an open interconnection technology designed for meeting high-performance data transmission, has the characteristics of high instantaneity, small transmission delay and high bandwidth, and is suitable for being used as a real-time data distribution bus. The data is distributed to each processing node through RapidIO exchange, and because a reliable point-to-point transmission mode is adopted by a RapidIO bus physical layer, namely when the receiving port does not normally receive the data, a confirmation packet cannot be replied to the sending end, and the data of the sending port is always kept in a cache region. In the field of edge calculation, an upstream data source continuously generates data to be distributed and calculated in real time, when one processing node or port fails, cache congestion of an output port connected with the upstream data source on RapidIO exchange is caused, the data source needs to find out port state abnormity in time through a physical layer control symbol and stop data transmission of a corresponding processing node through complex processing logic, otherwise, the whole data distribution system is paralyzed.
In the prior art, a data source needs special fault detection logic and corresponding data scheduling logic, and the design complexity is high; when one processing node or port in the data distribution system fails, if a data source finds and closes the data traffic of the relevant processing node in time, other ports of the data distribution system can work normally, but data loss is caused; if the fault is not found in time, the whole data distribution system is paralyzed due to the data back pressure of the cache region; the data source is tightly coupled with each processing node, the state of each processing node/channel needs to be monitored continuously, and the system reliability is low; the management of backup channels and backup processing nodes is lacked, and the backup channels and the backup processing nodes cannot be added into a processing system in real time.
Disclosure of Invention
In view of this, the invention provides a multi-channel RapidIO distribution system and a fault self-isolation method thereof, which can automatically find a fault and isolate ports, increase backup processing nodes to the distribution system in real time, reduce the coupling between a data source and the processing nodes, reduce the complexity of distribution logic design of the data source, and ensure the processing capability of the whole system.
In order to achieve the purpose, the technical scheme of the invention is as follows:
the invention discloses a multi-channel RapidIO distribution system based on a fault self-isolation and backup mechanism, which comprises a data source, a port zero, an output port, a processing node, a backup output port, a backup processing node and a management node, wherein the port zero is used for receiving a fault signal; one output port corresponds to one processing node, and the data source distributes data to each output port through a port zero and further to each processing node; the port zero and each output port are provided with corresponding cache regions, and the data in the corresponding cache regions can be released only after the processing node correctly receives the data and replies a confirmation packet; the management node configures a routing table of RapidIO exchange, monitors the residence time of data in the cache region, and triggers a port corresponding to the cache region to automatically abandon a packet when the residence time of the data in the cache region is greater than a residence time threshold value, so that automatic isolation of a fault port is realized; meanwhile, the management node monitors the status registers of the output ports all the time, judges that the output port has a fault when detecting that a certain output port continuously loses packets, immediately modifies the routing table, routes the data originally routed to the output port to the backup output port, and realizes the thorough isolation of the fault.
Wherein, the residence time threshold is set according to the data transmission bandwidth and the size of the buffer area.
Wherein the system is applied to edge computing data distribution.
The invention also provides a fault self-isolation method of the multichannel RapidIO distribution system, which is realized by adopting the system of the invention and comprises the following steps: after power-on, the management node configures a routing table of RapidIO exchange, and binds the relation between the destination ID and the routing port; the management node opens a cache area data monitoring function, configures a data retention time threshold, and simultaneously starts to continuously scan the status registers of all output ports to monitor the retention time of data in the cache area; when the residence time of the data in the cache region is greater than a set threshold value, the port is triggered to automatically abandon the packet, and the automatic isolation of the fault port is realized; when the packet loss of one output port is detected continuously, the output port is judged to have a fault, the routing table is immediately modified, the data originally routed to the output port is routed to the backup output port, and the complete isolation of the fault is realized.
Wherein, the residence time threshold is set according to the data transmission bandwidth and the size of the buffer area.
Has the beneficial effects that:
1. according to the invention, the RapidIO switch is configured, so that the RapidIO switch can detect the fault of the interconnection channel, and when the fault occurs, the port automatically discards the packet, so that the normal data receiving of the ports of other processing nodes is not influenced, namely, the automatic isolation is realized; meanwhile, when the management node of the RapidIO switch finds that the state of a certain port is abnormal, the routing table is changed, namely the data of the fault port is routed to the backup output port, and the fault port is deleted from the routing table, so that the processing capacity of the system can be effectively ensured when the node is in fault, the coupling between the data source and the processing node is reduced, the system reliability is improved, and the complexity of the distribution logic design of the data source is reduced.
2. In the data distribution system, the management of the whole data distribution architecture is realized through the management node in the RapidIO exchange process, the fault can be automatically found, the port can be isolated, the backup processing node is added into the distribution system in real time, the data source and each processing node are in loose coupling relation without the participation of the data source node, the reliability of the system is improved, the coupling of the data source and the processing node is reduced, and the complexity of the distribution logic design of the data source is reduced.
3. In the method, based on the data distribution system, rapidIO exchange can detect the fault of the interconnection channel through the management of the RapidIO exchange, and when the fault occurs, the port automatically discards the packet to ensure that the port of other processing nodes does not influence the normal data receiving of the port, namely, the automatic isolation is realized; and meanwhile, when the management node of the RapidIO switch finds that the state of a certain port is abnormal, the routing table is changed, namely the data of the fault port is routed to the backup output port, and the fault port is deleted from the routing table. In the RapidIO exchange process, the management of the whole data distribution architecture is realized through the management node, the fault can be automatically found, the port can be isolated, the backup processing node is added into the distribution system in real time, the data source and each processing node become a loose coupling relation without the participation of the data source node, the system reliability is improved, the coupling of the data source and the processing node is reduced, and the distribution logic design complexity of the data source is reduced.
Drawings
Fig. 1 is a schematic diagram of a RapidIO exchange distribution system of the present invention.
Fig. 2 is a schematic diagram of an edge computing data distribution system according to an embodiment of the present invention.
Detailed Description
The invention is described in detail below by way of example with reference to the accompanying drawings.
The invention provides a multi-channel RapidIO distribution system based on a fault self-isolation and backup mechanism, which realizes real-time self-isolation of an exchange port when a certain node fails, does not influence data of other ports, and improves the reliability of the system; the management node can find the fault port in time, modify the route mapping relation, delete the fault node, add the backup processing node and ensure the processing capacity of the whole data distribution system.
As shown in fig. 1, the distribution system of the present invention includes a data source, a port zero, an output port, a processing node, a backup output port, a backup processing node, and a management node. One output port corresponds to one processing node, and the data source distributes data to each output port through a port zero and further distributes the data to each processing node; the port zero and each output port are provided with corresponding cache regions, the data in the corresponding cache regions can be released only after the processing node correctly receives the data and replies a confirmation packet, when the processing node or the port fails, the buffered data in the corresponding cache regions of the output ports cannot be released, so that the data from the port zero route cannot be cached, the data in the cache regions corresponding to the port zero cannot be released, and finally the distribution system fails.
The management node configures a routing table of RapidIO switching, namely, the mapping relation between a target ID in a RapidIO data packet and an output port is bound, and data routing from a port zero to each output port is realized; the management node monitors the residence time of the data in the cache region, and when the residence time of the data in the cache region is greater than a residence time threshold value, the management node triggers a port corresponding to the cache region to automatically abandon a packet, namely, the automatic isolation of a fault port is realized, wherein the residence time threshold value is set according to the data transmission bandwidth and the size of the cache region; meanwhile, the management node monitors the status registers of the output ports all the time, judges that the output port has a fault when detecting that a certain output port continuously loses packets, immediately modifies the routing table, routes the data originally routed to the output port to the backup output port, realizes the thorough isolation of the fault, and ensures the processing capacity of the whole system.
The invention also provides a fault self-isolation method of the multichannel RapidIO distribution system, which is realized based on the distribution system and comprises the following steps: after power-on, the management node configures a routing table of RapidIO exchange, and binds the relation between the destination ID and the routing port; the management node opens a cache area data monitoring function, configures a data retention time threshold, and simultaneously starts to continuously scan the status registers of all output ports to monitor the retention time of data in the cache area; when the residence time of the data in the cache region is greater than a set threshold value, the port is triggered to automatically abandon the packet, namely, the automatic isolation of the fault port is realized; when detecting that a certain output port continuously loses packets, judging that the output port has faults, immediately modifying a routing table, routing data originally routed to the output port to a backup output port, realizing thorough isolation of the faults and ensuring the processing capacity of the whole system.
The invention can be applied to the distribution of edge computing data, and the specific embodiment of the edge computing data distribution system is shown in fig. 2, wherein a data source is a data acquisition board card, the data exchange adopts a RapidIO exchange chip CPS1848, a management node is a P2020 processor, a processing node is 5 computing blades, and 1 backup computing blade is reserved in the processing nodes. The data flow is 12.4Gbit/s, and the data needs to be distributed to 4 computing blades for processing at the same time. The fault self-isolation method comprises the following steps:
the management node configures a routing table and binds the relation between the destination ID and the routing port;
the management node opens a monitoring function of a cache region, configures the maximum retention time of data to be 1.2us, and simultaneously starts to continuously scan status registers of all ports;
after receiving the handshake signals of all the computing blades, the data acquisition card starts to send data; under a normal state, data are sent to 4 computing blades in a time sharing mode through a RapidIO switching chip, namely first time slice data are sent to a first computing blade, second time slice data are sent to a second computing blade, third time slice data are sent to a third computing blade, fourth time slice data are sent to a fourth computing blade, and the data are sent in a circulating mode continuously, so that all the blades can continuously receive the data. When the first computing blade fails, the first port cannot receive a confirmation packet, the data in the cache region of the first port cannot be released and is retained in the cache region all the time, and when the retention time exceeds 1.2us, the first port automatically loses packets, so that the cache region is ensured to have available space to receive routing data of an internal port zero;
the management node continuously inquires the state register of each port, judges the exception of the port when finding that the packet abandon exists in the port I all the time, and starts the operation of modifying the routing table; in this embodiment, the data of the port one is routed to the port five, and the fault self-isolation is completed.
In summary, the above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (5)
1. A multi-channel RapidIO distribution system based on a fault self-isolation and backup mechanism is characterized by comprising a data source, a port zero, an output port, a processing node, a backup output port, a backup processing node and a management node; one output port corresponds to one processing node, and the data source distributes data to each output port through a port zero and further distributes the data to each processing node; the port zero and each output port are provided with corresponding cache regions, and the data in the corresponding cache regions can be released only after the processing node correctly receives the data and replies a confirmation packet; the management node configures a routing table of RapidIO exchange, monitors the residence time of data in the cache region, and triggers a port corresponding to the cache region to automatically abandon a packet when the residence time of the data in the cache region is greater than a residence time threshold value, so that automatic isolation of a fault port is realized; meanwhile, the management node monitors the status registers of the output ports all the time, judges that the output port has a fault when detecting that a certain output port continuously loses packets, immediately modifies the routing table, routes the data originally routed to the output port to the backup output port, and realizes the thorough isolation of the fault.
2. The system of claim 1, wherein the residence time threshold is set based on a data transfer bandwidth and a buffer size.
3. The system of claim 1 or 2, wherein the system is applied to edge computing data distribution.
4. A method for fault self-isolation of a multi-channel RapidIO distribution system, implemented using a system according to any one of claims 1 to 3, comprising the steps of: after power-on, the management node configures a routing table of RapidIO exchange, and binds the relation between the destination ID and the routing port; the management node opens a cache area data monitoring function, configures a data retention time threshold, and simultaneously starts to continuously scan the status registers of all output ports to monitor the retention time of data in the cache area; when the residence time of the data in the cache region is greater than a set threshold value, the port is triggered to automatically abandon the packet, and the automatic isolation of the fault port is realized; when detecting that a certain output port continuously loses packets, judging that the output port has faults, immediately modifying a routing table, and routing data originally routed to the output port to a backup output port to realize thorough isolation of the faults.
5. The method of claim 4, wherein the residence time threshold is set based on a data transmission bandwidth and a buffer size.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211081103.6A CN115150322B (en) | 2022-09-06 | 2022-09-06 | Multichannel RapidIO distribution system and fault self-isolation method thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211081103.6A CN115150322B (en) | 2022-09-06 | 2022-09-06 | Multichannel RapidIO distribution system and fault self-isolation method thereof |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115150322A CN115150322A (en) | 2022-10-04 |
CN115150322B true CN115150322B (en) | 2022-11-25 |
Family
ID=83416392
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211081103.6A Active CN115150322B (en) | 2022-09-06 | 2022-09-06 | Multichannel RapidIO distribution system and fault self-isolation method thereof |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115150322B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN202059375U (en) * | 2011-05-20 | 2011-11-30 | 广州励丰声光科技有限公司 | Device for automatic fault detection and hot standby of power amplifier |
CN105281304A (en) * | 2015-12-02 | 2016-01-27 | 国网上海市电力公司 | Quick feeder fault positioning and isolating method |
CN110704250A (en) * | 2019-09-23 | 2020-01-17 | 天津津航计算技术研究所 | Hot backup device of distributed system |
CN110708245A (en) * | 2019-09-29 | 2020-01-17 | 华南理工大学 | SDN data plane fault monitoring and recovery method under multi-controller architecture |
CN112511394A (en) * | 2020-11-05 | 2021-03-16 | 中国航空工业集团公司西安航空计算技术研究所 | Management and maintenance method of RapidIO bus system |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7639001B2 (en) * | 2006-01-17 | 2009-12-29 | The Boeing Company | Built-in test for high speed electrical networks |
US9479434B2 (en) * | 2013-07-19 | 2016-10-25 | Fabric Embedded Tools Corporation | Virtual destination identification for rapidio network elements |
US10771369B2 (en) * | 2017-03-20 | 2020-09-08 | International Business Machines Corporation | Analyzing performance and capacity of a complex storage environment for predicting expected incident of resource exhaustion on a data path of interest by analyzing maximum values of resource usage over time |
US20220248296A1 (en) * | 2021-04-23 | 2022-08-04 | Intel Corporation | Managing session continuity for edge services in multi-access environments |
-
2022
- 2022-09-06 CN CN202211081103.6A patent/CN115150322B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN202059375U (en) * | 2011-05-20 | 2011-11-30 | 广州励丰声光科技有限公司 | Device for automatic fault detection and hot standby of power amplifier |
CN105281304A (en) * | 2015-12-02 | 2016-01-27 | 国网上海市电力公司 | Quick feeder fault positioning and isolating method |
CN110704250A (en) * | 2019-09-23 | 2020-01-17 | 天津津航计算技术研究所 | Hot backup device of distributed system |
CN110708245A (en) * | 2019-09-29 | 2020-01-17 | 华南理工大学 | SDN data plane fault monitoring and recovery method under multi-controller architecture |
CN112511394A (en) * | 2020-11-05 | 2021-03-16 | 中国航空工业集团公司西安航空计算技术研究所 | Management and maintenance method of RapidIO bus system |
Non-Patent Citations (3)
Title |
---|
A dynamic flow allocation method for the design of a software-defined real-time mesh network;Florian Greff 等;《2017 IEEE 13th International Workshop on Factory Communication Systems (WFCS)》;20170727;1-11 * |
RapidIO在分布式机载传感器系统中的应用;张洪亮 等;《电讯技术》;20220630;第62卷(第6期);734-741 * |
云存储系统管理节点故障自恢复算法;马玮骏等;《计算机系统应用》;20170215(第02期);114-119 * |
Also Published As
Publication number | Publication date |
---|---|
CN115150322A (en) | 2022-10-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6411599B1 (en) | Fault tolerant switching architecture | |
US5491687A (en) | Method and system in a local area network switch for dynamically changing operating modes | |
US7058844B2 (en) | System and method for rapid fault isolation in a storage area network | |
US10193829B2 (en) | Indefinitely expandable high-capacity data switch | |
US6738344B1 (en) | Link extenders with link alive propagation | |
JP2003507910A (en) | Apparatus and method for measuring traffic in a switch | |
US8542679B2 (en) | Method of controlling data propagation within a network | |
US5319633A (en) | Enhanced serial network topology generation algorithm | |
CN111064680B (en) | Communication device and data processing method | |
US20020150056A1 (en) | Method for avoiding broadcast deadlocks in a mesh-connected network | |
CN115150322B (en) | Multichannel RapidIO distribution system and fault self-isolation method thereof | |
CN114401191B (en) | Error configured uplink identification | |
US20060056303A1 (en) | Increased availability on routers through detection of data path failures and subsequent recovery | |
JPWO2011074052A1 (en) | Communication device, statistical information collection control device, and statistical information collection control method | |
JP2015536621A (en) | Passive connectivity optical module | |
CN105721181A (en) | Method of message transmission, backbone switch and access switch | |
US7680142B1 (en) | Communications chip having a plurality of logic analysers | |
CN110213118B (en) | FC network system and flow control method thereof | |
US20080298381A1 (en) | Apparatus for queue management of a global link control byte in an input/output subsystem | |
US7969994B2 (en) | Method and apparatus for multiple connections to group of switches | |
US8880759B2 (en) | Apparatus and method for fragmenting transmission data | |
CN112087400A (en) | Network flow mirroring system and method based on EtherCAT slave station chip | |
CN112019432B (en) | Uplink input message forwarding system based on multiport binding | |
JP3006286B2 (en) | ATM switch | |
US11711318B1 (en) | Packet switches |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |