CN115150322B - Multichannel RapidIO distribution system and fault self-isolation method thereof - Google Patents

Multichannel RapidIO distribution system and fault self-isolation method thereof Download PDF

Info

Publication number
CN115150322B
CN115150322B CN202211081103.6A CN202211081103A CN115150322B CN 115150322 B CN115150322 B CN 115150322B CN 202211081103 A CN202211081103 A CN 202211081103A CN 115150322 B CN115150322 B CN 115150322B
Authority
CN
China
Prior art keywords
data
port
fault
output port
rapidio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211081103.6A
Other languages
Chinese (zh)
Other versions
CN115150322A (en
Inventor
闫海明
陈瑞祥
刘峰巍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongying Technology Co ltd
Original Assignee
Zhongying Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongying Technology Co ltd filed Critical Zhongying Technology Co ltd
Priority to CN202211081103.6A priority Critical patent/CN115150322B/en
Publication of CN115150322A publication Critical patent/CN115150322A/en
Application granted granted Critical
Publication of CN115150322B publication Critical patent/CN115150322B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/22Alternate routing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/28Routing or path finding of packets in data switching networks using route fault recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/25Routing or path finding in a switch fabric
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/55Prevention, detection or correction of errors
    • H04L49/552Prevention, detection or correction of errors by ensuring the integrity of packets received through redundant connections
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/55Prevention, detection or correction of errors
    • H04L49/555Error detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/55Prevention, detection or correction of errors
    • H04L49/557Error correction, e.g. fault recovery or fault tolerance

Abstract

The invention provides a multi-channel RapidIO distribution system and a fault self-isolation method thereof, which can automatically find faults and isolate ports, increase backup processing nodes into the distribution system in real time, reduce the coupling of data sources and processing nodes, reduce the complexity of distribution logic design of the data sources and ensure the processing capacity of the whole system. According to the invention, the RapidIO switch is configured, so that the RapidIO switch can detect the fault of the interconnection channel, and when the fault occurs, the port automatically discards the packet, so that the normal data receiving of the ports of other processing nodes is not influenced, namely, the automatic isolation is realized; and meanwhile, when the management node of the RapidIO switch finds that the state of a certain port is abnormal, the routing table is changed, namely the data of the fault port is routed to the backup output port, and the fault port is deleted from the routing table, so that the processing capacity of the system can be effectively ensured when the node is in fault.

Description

Multichannel RapidIO distribution system and fault self-isolation method thereof
Technical Field
The invention relates to the technical field of data communication, in particular to a multichannel RapidIO distribution system and a fault self-isolation method thereof.
Background
At present, in the field of edge calculation, data processing and operation are directly completed locally after data acquisition, so as to meet the real-time requirement of data processing. The edge calculation has the characteristics of high real-time performance, large data volume and large calculation amount, the transmission of data is a key technology for realizing the edge calculation, the processing of the data is finished by massive data generated at the front end and huge calculation capacity, a single CPU or a processing node cannot meet the calculation requirement, and then the data needs to be distributed to a plurality of processing nodes, and a multi-parallel operation mode is adopted. The RapidIO bus is an open interconnection technology designed for meeting high-performance data transmission, has the characteristics of high instantaneity, small transmission delay and high bandwidth, and is suitable for being used as a real-time data distribution bus. The data is distributed to each processing node through RapidIO exchange, and because a reliable point-to-point transmission mode is adopted by a RapidIO bus physical layer, namely when the receiving port does not normally receive the data, a confirmation packet cannot be replied to the sending end, and the data of the sending port is always kept in a cache region. In the field of edge calculation, an upstream data source continuously generates data to be distributed and calculated in real time, when one processing node or port fails, cache congestion of an output port connected with the upstream data source on RapidIO exchange is caused, the data source needs to find out port state abnormity in time through a physical layer control symbol and stop data transmission of a corresponding processing node through complex processing logic, otherwise, the whole data distribution system is paralyzed.
In the prior art, a data source needs special fault detection logic and corresponding data scheduling logic, and the design complexity is high; when one processing node or port in the data distribution system fails, if a data source finds and closes the data traffic of the relevant processing node in time, other ports of the data distribution system can work normally, but data loss is caused; if the fault is not found in time, the whole data distribution system is paralyzed due to the data back pressure of the cache region; the data source is tightly coupled with each processing node, the state of each processing node/channel needs to be monitored continuously, and the system reliability is low; the management of backup channels and backup processing nodes is lacked, and the backup channels and the backup processing nodes cannot be added into a processing system in real time.
Disclosure of Invention
In view of this, the invention provides a multi-channel RapidIO distribution system and a fault self-isolation method thereof, which can automatically find a fault and isolate ports, increase backup processing nodes to the distribution system in real time, reduce the coupling between a data source and the processing nodes, reduce the complexity of distribution logic design of the data source, and ensure the processing capability of the whole system.
In order to achieve the purpose, the technical scheme of the invention is as follows:
the invention discloses a multi-channel RapidIO distribution system based on a fault self-isolation and backup mechanism, which comprises a data source, a port zero, an output port, a processing node, a backup output port, a backup processing node and a management node, wherein the port zero is used for receiving a fault signal; one output port corresponds to one processing node, and the data source distributes data to each output port through a port zero and further to each processing node; the port zero and each output port are provided with corresponding cache regions, and the data in the corresponding cache regions can be released only after the processing node correctly receives the data and replies a confirmation packet; the management node configures a routing table of RapidIO exchange, monitors the residence time of data in the cache region, and triggers a port corresponding to the cache region to automatically abandon a packet when the residence time of the data in the cache region is greater than a residence time threshold value, so that automatic isolation of a fault port is realized; meanwhile, the management node monitors the status registers of the output ports all the time, judges that the output port has a fault when detecting that a certain output port continuously loses packets, immediately modifies the routing table, routes the data originally routed to the output port to the backup output port, and realizes the thorough isolation of the fault.
Wherein, the residence time threshold is set according to the data transmission bandwidth and the size of the buffer area.
Wherein the system is applied to edge computing data distribution.
The invention also provides a fault self-isolation method of the multichannel RapidIO distribution system, which is realized by adopting the system of the invention and comprises the following steps: after power-on, the management node configures a routing table of RapidIO exchange, and binds the relation between the destination ID and the routing port; the management node opens a cache area data monitoring function, configures a data retention time threshold, and simultaneously starts to continuously scan the status registers of all output ports to monitor the retention time of data in the cache area; when the residence time of the data in the cache region is greater than a set threshold value, the port is triggered to automatically abandon the packet, and the automatic isolation of the fault port is realized; when the packet loss of one output port is detected continuously, the output port is judged to have a fault, the routing table is immediately modified, the data originally routed to the output port is routed to the backup output port, and the complete isolation of the fault is realized.
Wherein, the residence time threshold is set according to the data transmission bandwidth and the size of the buffer area.
Has the beneficial effects that:
1. according to the invention, the RapidIO switch is configured, so that the RapidIO switch can detect the fault of the interconnection channel, and when the fault occurs, the port automatically discards the packet, so that the normal data receiving of the ports of other processing nodes is not influenced, namely, the automatic isolation is realized; meanwhile, when the management node of the RapidIO switch finds that the state of a certain port is abnormal, the routing table is changed, namely the data of the fault port is routed to the backup output port, and the fault port is deleted from the routing table, so that the processing capacity of the system can be effectively ensured when the node is in fault, the coupling between the data source and the processing node is reduced, the system reliability is improved, and the complexity of the distribution logic design of the data source is reduced.
2. In the data distribution system, the management of the whole data distribution architecture is realized through the management node in the RapidIO exchange process, the fault can be automatically found, the port can be isolated, the backup processing node is added into the distribution system in real time, the data source and each processing node are in loose coupling relation without the participation of the data source node, the reliability of the system is improved, the coupling of the data source and the processing node is reduced, and the complexity of the distribution logic design of the data source is reduced.
3. In the method, based on the data distribution system, rapidIO exchange can detect the fault of the interconnection channel through the management of the RapidIO exchange, and when the fault occurs, the port automatically discards the packet to ensure that the port of other processing nodes does not influence the normal data receiving of the port, namely, the automatic isolation is realized; and meanwhile, when the management node of the RapidIO switch finds that the state of a certain port is abnormal, the routing table is changed, namely the data of the fault port is routed to the backup output port, and the fault port is deleted from the routing table. In the RapidIO exchange process, the management of the whole data distribution architecture is realized through the management node, the fault can be automatically found, the port can be isolated, the backup processing node is added into the distribution system in real time, the data source and each processing node become a loose coupling relation without the participation of the data source node, the system reliability is improved, the coupling of the data source and the processing node is reduced, and the distribution logic design complexity of the data source is reduced.
Drawings
Fig. 1 is a schematic diagram of a RapidIO exchange distribution system of the present invention.
Fig. 2 is a schematic diagram of an edge computing data distribution system according to an embodiment of the present invention.
Detailed Description
The invention is described in detail below by way of example with reference to the accompanying drawings.
The invention provides a multi-channel RapidIO distribution system based on a fault self-isolation and backup mechanism, which realizes real-time self-isolation of an exchange port when a certain node fails, does not influence data of other ports, and improves the reliability of the system; the management node can find the fault port in time, modify the route mapping relation, delete the fault node, add the backup processing node and ensure the processing capacity of the whole data distribution system.
As shown in fig. 1, the distribution system of the present invention includes a data source, a port zero, an output port, a processing node, a backup output port, a backup processing node, and a management node. One output port corresponds to one processing node, and the data source distributes data to each output port through a port zero and further distributes the data to each processing node; the port zero and each output port are provided with corresponding cache regions, the data in the corresponding cache regions can be released only after the processing node correctly receives the data and replies a confirmation packet, when the processing node or the port fails, the buffered data in the corresponding cache regions of the output ports cannot be released, so that the data from the port zero route cannot be cached, the data in the cache regions corresponding to the port zero cannot be released, and finally the distribution system fails.
The management node configures a routing table of RapidIO switching, namely, the mapping relation between a target ID in a RapidIO data packet and an output port is bound, and data routing from a port zero to each output port is realized; the management node monitors the residence time of the data in the cache region, and when the residence time of the data in the cache region is greater than a residence time threshold value, the management node triggers a port corresponding to the cache region to automatically abandon a packet, namely, the automatic isolation of a fault port is realized, wherein the residence time threshold value is set according to the data transmission bandwidth and the size of the cache region; meanwhile, the management node monitors the status registers of the output ports all the time, judges that the output port has a fault when detecting that a certain output port continuously loses packets, immediately modifies the routing table, routes the data originally routed to the output port to the backup output port, realizes the thorough isolation of the fault, and ensures the processing capacity of the whole system.
The invention also provides a fault self-isolation method of the multichannel RapidIO distribution system, which is realized based on the distribution system and comprises the following steps: after power-on, the management node configures a routing table of RapidIO exchange, and binds the relation between the destination ID and the routing port; the management node opens a cache area data monitoring function, configures a data retention time threshold, and simultaneously starts to continuously scan the status registers of all output ports to monitor the retention time of data in the cache area; when the residence time of the data in the cache region is greater than a set threshold value, the port is triggered to automatically abandon the packet, namely, the automatic isolation of the fault port is realized; when detecting that a certain output port continuously loses packets, judging that the output port has faults, immediately modifying a routing table, routing data originally routed to the output port to a backup output port, realizing thorough isolation of the faults and ensuring the processing capacity of the whole system.
The invention can be applied to the distribution of edge computing data, and the specific embodiment of the edge computing data distribution system is shown in fig. 2, wherein a data source is a data acquisition board card, the data exchange adopts a RapidIO exchange chip CPS1848, a management node is a P2020 processor, a processing node is 5 computing blades, and 1 backup computing blade is reserved in the processing nodes. The data flow is 12.4Gbit/s, and the data needs to be distributed to 4 computing blades for processing at the same time. The fault self-isolation method comprises the following steps:
the management node configures a routing table and binds the relation between the destination ID and the routing port;
the management node opens a monitoring function of a cache region, configures the maximum retention time of data to be 1.2us, and simultaneously starts to continuously scan status registers of all ports;
after receiving the handshake signals of all the computing blades, the data acquisition card starts to send data; under a normal state, data are sent to 4 computing blades in a time sharing mode through a RapidIO switching chip, namely first time slice data are sent to a first computing blade, second time slice data are sent to a second computing blade, third time slice data are sent to a third computing blade, fourth time slice data are sent to a fourth computing blade, and the data are sent in a circulating mode continuously, so that all the blades can continuously receive the data. When the first computing blade fails, the first port cannot receive a confirmation packet, the data in the cache region of the first port cannot be released and is retained in the cache region all the time, and when the retention time exceeds 1.2us, the first port automatically loses packets, so that the cache region is ensured to have available space to receive routing data of an internal port zero;
the management node continuously inquires the state register of each port, judges the exception of the port when finding that the packet abandon exists in the port I all the time, and starts the operation of modifying the routing table; in this embodiment, the data of the port one is routed to the port five, and the fault self-isolation is completed.
In summary, the above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (5)

1. A multi-channel RapidIO distribution system based on a fault self-isolation and backup mechanism is characterized by comprising a data source, a port zero, an output port, a processing node, a backup output port, a backup processing node and a management node; one output port corresponds to one processing node, and the data source distributes data to each output port through a port zero and further distributes the data to each processing node; the port zero and each output port are provided with corresponding cache regions, and the data in the corresponding cache regions can be released only after the processing node correctly receives the data and replies a confirmation packet; the management node configures a routing table of RapidIO exchange, monitors the residence time of data in the cache region, and triggers a port corresponding to the cache region to automatically abandon a packet when the residence time of the data in the cache region is greater than a residence time threshold value, so that automatic isolation of a fault port is realized; meanwhile, the management node monitors the status registers of the output ports all the time, judges that the output port has a fault when detecting that a certain output port continuously loses packets, immediately modifies the routing table, routes the data originally routed to the output port to the backup output port, and realizes the thorough isolation of the fault.
2. The system of claim 1, wherein the residence time threshold is set based on a data transfer bandwidth and a buffer size.
3. The system of claim 1 or 2, wherein the system is applied to edge computing data distribution.
4. A method for fault self-isolation of a multi-channel RapidIO distribution system, implemented using a system according to any one of claims 1 to 3, comprising the steps of: after power-on, the management node configures a routing table of RapidIO exchange, and binds the relation between the destination ID and the routing port; the management node opens a cache area data monitoring function, configures a data retention time threshold, and simultaneously starts to continuously scan the status registers of all output ports to monitor the retention time of data in the cache area; when the residence time of the data in the cache region is greater than a set threshold value, the port is triggered to automatically abandon the packet, and the automatic isolation of the fault port is realized; when detecting that a certain output port continuously loses packets, judging that the output port has faults, immediately modifying a routing table, and routing data originally routed to the output port to a backup output port to realize thorough isolation of the faults.
5. The method of claim 4, wherein the residence time threshold is set based on a data transmission bandwidth and a buffer size.
CN202211081103.6A 2022-09-06 2022-09-06 Multichannel RapidIO distribution system and fault self-isolation method thereof Active CN115150322B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211081103.6A CN115150322B (en) 2022-09-06 2022-09-06 Multichannel RapidIO distribution system and fault self-isolation method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211081103.6A CN115150322B (en) 2022-09-06 2022-09-06 Multichannel RapidIO distribution system and fault self-isolation method thereof

Publications (2)

Publication Number Publication Date
CN115150322A CN115150322A (en) 2022-10-04
CN115150322B true CN115150322B (en) 2022-11-25

Family

ID=83416392

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211081103.6A Active CN115150322B (en) 2022-09-06 2022-09-06 Multichannel RapidIO distribution system and fault self-isolation method thereof

Country Status (1)

Country Link
CN (1) CN115150322B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN202059375U (en) * 2011-05-20 2011-11-30 广州励丰声光科技有限公司 Device for automatic fault detection and hot standby of power amplifier
CN105281304A (en) * 2015-12-02 2016-01-27 国网上海市电力公司 Quick feeder fault positioning and isolating method
CN110704250A (en) * 2019-09-23 2020-01-17 天津津航计算技术研究所 Hot backup device of distributed system
CN110708245A (en) * 2019-09-29 2020-01-17 华南理工大学 SDN data plane fault monitoring and recovery method under multi-controller architecture
CN112511394A (en) * 2020-11-05 2021-03-16 中国航空工业集团公司西安航空计算技术研究所 Management and maintenance method of RapidIO bus system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7639001B2 (en) * 2006-01-17 2009-12-29 The Boeing Company Built-in test for high speed electrical networks
US9479434B2 (en) * 2013-07-19 2016-10-25 Fabric Embedded Tools Corporation Virtual destination identification for rapidio network elements
US10771369B2 (en) * 2017-03-20 2020-09-08 International Business Machines Corporation Analyzing performance and capacity of a complex storage environment for predicting expected incident of resource exhaustion on a data path of interest by analyzing maximum values of resource usage over time
US20220248296A1 (en) * 2021-04-23 2022-08-04 Intel Corporation Managing session continuity for edge services in multi-access environments

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN202059375U (en) * 2011-05-20 2011-11-30 广州励丰声光科技有限公司 Device for automatic fault detection and hot standby of power amplifier
CN105281304A (en) * 2015-12-02 2016-01-27 国网上海市电力公司 Quick feeder fault positioning and isolating method
CN110704250A (en) * 2019-09-23 2020-01-17 天津津航计算技术研究所 Hot backup device of distributed system
CN110708245A (en) * 2019-09-29 2020-01-17 华南理工大学 SDN data plane fault monitoring and recovery method under multi-controller architecture
CN112511394A (en) * 2020-11-05 2021-03-16 中国航空工业集团公司西安航空计算技术研究所 Management and maintenance method of RapidIO bus system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
A dynamic flow allocation method for the design of a software-defined real-time mesh network;Florian Greff 等;《2017 IEEE 13th International Workshop on Factory Communication Systems (WFCS)》;20170727;1-11 *
RapidIO在分布式机载传感器系统中的应用;张洪亮 等;《电讯技术》;20220630;第62卷(第6期);734-741 *
云存储系统管理节点故障自恢复算法;马玮骏等;《计算机系统应用》;20170215(第02期);114-119 *

Also Published As

Publication number Publication date
CN115150322A (en) 2022-10-04

Similar Documents

Publication Publication Date Title
US6411599B1 (en) Fault tolerant switching architecture
US5491687A (en) Method and system in a local area network switch for dynamically changing operating modes
US7058844B2 (en) System and method for rapid fault isolation in a storage area network
US10193829B2 (en) Indefinitely expandable high-capacity data switch
US6738344B1 (en) Link extenders with link alive propagation
JP2003507910A (en) Apparatus and method for measuring traffic in a switch
US8542679B2 (en) Method of controlling data propagation within a network
US5319633A (en) Enhanced serial network topology generation algorithm
CN111064680B (en) Communication device and data processing method
US20020150056A1 (en) Method for avoiding broadcast deadlocks in a mesh-connected network
CN115150322B (en) Multichannel RapidIO distribution system and fault self-isolation method thereof
CN114401191B (en) Error configured uplink identification
US20060056303A1 (en) Increased availability on routers through detection of data path failures and subsequent recovery
JPWO2011074052A1 (en) Communication device, statistical information collection control device, and statistical information collection control method
JP2015536621A (en) Passive connectivity optical module
CN105721181A (en) Method of message transmission, backbone switch and access switch
US7680142B1 (en) Communications chip having a plurality of logic analysers
CN110213118B (en) FC network system and flow control method thereof
US20080298381A1 (en) Apparatus for queue management of a global link control byte in an input/output subsystem
US7969994B2 (en) Method and apparatus for multiple connections to group of switches
US8880759B2 (en) Apparatus and method for fragmenting transmission data
CN112087400A (en) Network flow mirroring system and method based on EtherCAT slave station chip
CN112019432B (en) Uplink input message forwarding system based on multiport binding
JP3006286B2 (en) ATM switch
US11711318B1 (en) Packet switches

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant