CN105656773B - The fault-tolerant module of highly reliable link and its method of transient fault and intermittent defect are directed in network-on-chip - Google Patents

The fault-tolerant module of highly reliable link and its method of transient fault and intermittent defect are directed in network-on-chip Download PDF

Info

Publication number
CN105656773B
CN105656773B CN201610184999.9A CN201610184999A CN105656773B CN 105656773 B CN105656773 B CN 105656773B CN 201610184999 A CN201610184999 A CN 201610184999A CN 105656773 B CN105656773 B CN 105656773B
Authority
CN
China
Prior art keywords
routing node
data
link
node
fault
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201610184999.9A
Other languages
Chinese (zh)
Other versions
CN105656773A (en
Inventor
欧阳鸣
欧阳一鸣
孙成龙
蒋哲远
梁华国
黄正峰
姜兆能
安鑫
易茂祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN201610184999.9A priority Critical patent/CN105656773B/en
Publication of CN105656773A publication Critical patent/CN105656773A/en
Application granted granted Critical
Publication of CN105656773B publication Critical patent/CN105656773B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/28Routing or path finding of packets in data switching networks using route fault recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0668Management of faults, events, alarms or notifications using network fault recovery by dynamic selection of recovery network elements, e.g. replacement by the most appropriate element after failure
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/60Router architectures

Abstract

The invention discloses the fault-tolerant modules of highly reliable link and its method that transient fault and intermittent defect are directed in a kind of network-on-chip, it is characterized in that:It is encoded using a kind of separate type ECC, whether real-time detection data occurs mistake in a network, realizes the definition to transient fault and intermittent defect;Using being arranged in the retransmission buffer of router interior, leads to corrupt data when transient fault occurs in link and when cannot correctly correct, transmission is re-started by the data backed up in retransmission buffer;Using head microplate of the backup in Virtual Channel, intermittent defect leads to corrupt data and when cannot correctly correct, data packet transmission is truncated, and head microplate or tail microplate are added again by the data to being truncated when occurring in link, re-route or resource release.The present invention is using smaller hardware spending as cost, when failure occurs, can effectively improve the reliability of network, safeguards system performance.

Description

The fault-tolerant mould of highly reliable link of transient fault and intermittent defect is directed in network-on-chip Block and its method
Technical field
The invention belongs to the fault-toleranr technique field of design of integrated circuit, for instantaneous in especially a kind of network-on-chip The fault-tolerant module of highly reliable link and its method of failure and intermittent defect.
Background technology
With the development of semiconductor technology, on one single chip integrate check figure mesh it is more and more, compared to it is traditional based on The system on chip (System-on-Chip, SoC) of bus architecture, network-on-chip (Network-on-Chip, NoC) is as a kind of The solution of new multi-processor system-on-chip interconnection communication construction, the advantages of due to the high and low delay of its scalability and high bandwidth It is suggested.
The major function of NoC systems is to ensure that data packet correctly lossless can be transferred to mesh from source node by router Node.Link is played a crucial role as the critical data path connected between router.However due to soft error The problems such as mistake, line-to-line crosstalk, temperature and aging, link transmission reliability receive great challenge.When link failure occurs, Even if router fault-free, its normal routing function can not be played, overall network performance is greatly reduced.Therefore it is directed to chain The fault-tolerant design on road is particularly important.
The failure occurred on the link can be divided into permanent fault, transient fault and intermittent defect.Link once occurs Will always exist will not disappear, and controllability is good, fault-tolerant general to be solved using heavy-route or hardware redundancy.
The generation of transient fault is random and does not have rule, is generally instantaneity and can restore.About 80% Communication failure is transient fault.It is fault-tolerant for transient fault, it can be generally divided into following two major classes:The first kind is based on Random Communication Fault tolerant mechanism, such as flooding, by broadcasting and spreading, destination node will receive many redundancies data packet backup, bring Prodigious power dissipation overhead;Second class is the request retransmission mechanism based on error-detecging code and error correcting code, mainly there is end-to-end (end- To-end, e2e) re-transmission and hop to hop (switch-to-switch, s2s) re-transmission, e2e retransmission mechanism in transmitting terminal and ECC encoding and decoding are carried out in the network interface of receiving terminal, this method only carries out error detection in destination node, can be led when retransmitting Cause delay double.S2s retransmission mechanism keeps in the data of transmission in each router interior setting retransmission buffer (Buffer), but A data mistake can only be covered by being ECC, and long numeric data can trigger retransmission mechanism when malfunctioning, also will increase network delay.
Intermittent defect be due to the influence of the factors such as temperature, voltage cause failure intermittence occur, and continue it is multiple when Clock period, poor controllability.It can neither be solved by retransmission mechanism, permanent fault can not be defined as and solved, When having a rest property failure occurs, the transmission path of data packet is blocked by faulty link.By the data of faulty link due to lacking tail Release of the microplate (flit) to its occupied resource, prolonged resource occupation can cause network congestion, reduce network performance; Similarly, due to the presence of faulty link, the routing for not lacking a flit by the data of faulty link guides, and occupies for a long time Buffer resources can cause network congestion, it could even be possible to leading to deadlock.In conclusion considering tolerance transient fault and interval Property failure in terms of seem very necessary.
Invention content
The present invention is in order to avoid in place of above-mentioned the shortcomings of the prior art, providing in a kind of network-on-chip for instantaneous The fault-tolerant module of highly reliable link and its method of failure and intermittent defect, are directed to transient fault respectively and intermittent defect carries out Detailed analysis, the corresponding fault-tolerant module of addition carry out the fault-tolerant of failure, to can using smaller hardware spending as cost, So as to which in transient fault and intermittent defect generation, the reliability of Logistics networks improves the performance of system.
The technical proposal for solving the technical problem of the invention is:
It is directed to the fault-tolerant module of highly reliable link of transient fault and intermittent defect in a kind of network-on-chip of the present invention, is to answer For by input port module, routing calculation module, crossbar switch, crossbar switch distribution module, Virtual Channel arbitration modules and defeated In the router that exit port module is formed;In the input port module include n Virtual Channel VC, multichannel data distributor with Multi-channel data selector;Enter n void by the data distributor of the input port module by the data packet of link transmission to lead to Road VC, and selection transmission is carried out by data selector;
The data packet is divided into several flit and is transmitted, and according to data packet along the routing node passed through Sequentially, it is upstream node that any one routing node that definition is passed through, which is the above routing node, with next routing Node is the current routing node of downstream node;The current routing node is denoted as i-th of routing node;Then upstream node is (i-1)-th routing node;Downstream node is i+1 routing node;Its main feature is that:
The input terminal of the input port module of i-th of routing node is provided with the first error detection units ECC1; There is triple gate respectively on the n Virtual Channel and blocks recovery unit TRU;Each Virtual Channel and block recovery unit accordingly TRU selects 1 multi-channel data selector to be transmitted in the multi-channel data selector by 2;In the multi-channel data selector Output end, which is provided with, retransmits recovery unit RRU and the second error detection units ECC2;To constitute the fault-tolerant module of highly reliable link;
Pass through its second error detection units when i-th of routing node receives (i-1)-th routing node by link When the data packet of ECC2 codings, the first error detection units ECC1 of i-th of routing node detects the data in the data packet Whether position malfunctions, if not malfunctioning, the data packet enters n by the data distributor of i-th of input port module It is transmitted in a Virtual Channel, if whether error, the first error detection units ECC1 misjudgments of i-th of routing node can It is correct to correct, if can correctly correct, it is transmitted after automatic correct, otherwise, informs the input terminal of (i-1)-th routing node The re-transmission recovery unit RRU of mouth mold block, retransmits the data of error, while the meter of i-th of routing node and (i-1)-th routing node Number device adds one respectively;Indicate that there are transient faults in the link between (i-1)-th routing node and i-th of routing node;
When the counter, which continuously adds, reaches fault threshold together, (i-1)-th routing node and i-th of routing section are indicated There are intermittent defects in link between point, then 1 multi-channel data selector, gating i-th are selected in 2 by i-th of routing node The recovery unit TRU that blocks corresponding to intermittent defect link in a routing node carries out resource release;Pass through (i-1)-th road Select 1 multi-channel data selector, blocking corresponding to (i-1)-th routing node intermittent faulty link of gating extensive by the 2 of node Multiple unit TRU reroutings are simultaneously transmitted to crossbar switch.
The fault-tolerant module of highly reliable link of transient fault and intermittent defect is directed in network-on-chip of the present invention Feature is lain also in:
The re-transmission recovery unit RRU of input port module includes in (i-1)-th routing node:Memory space is two The re-transmission buffer of flit, one 2 select 1 multiple selector, counter, RRU controllers and a VC trace table;The VC is chased after The Virtual Channel ID being stored in track table in the re-transmission buffer;
When in the link between (i-1)-th routing node and i-th of routing node there are when transient fault, i-th of routing First error detection units ECC1 of node sends RRU controller of the NACK signal to (i-1)-th routing node;
The RRU controllers of (i-1)-th routing node control the counter and add one, and control described 2 multichannels for selecting 1 The data that selector gates in the re-transmission buffer of described two flit are retransmitted;
When in the link between (i-1)-th routing node and i-th of routing node there are when intermittent defect, described i-th- The RRU controllers of 1 routing node send RX signals to (i-1)-th routing node block recovery unit TRU for weight New selection path;First error detection units ECC1 of i-th of routing node sends TX signals to i-th of routing node Recovery unit TRU is blocked to discharge for resource.
The recovery unit TRU that blocks in i-th of routing node includes:Memory space be a flit buffer, Select 1 multiple selector, a 2 circuit-switched data distributors, pseudo- head flit modification access Head and pseudo- tail flit modification accesses for one 2 Tail, TRU controller;The head flit of data packet is stored in the buffer;
When in the link between (i-1)-th routing node and i-th of routing node there are when intermittent defect, described i-th The the first error detection units ECC1 transmission TX letters for blocking recovery unit TRU and receiving i-th of routing node of a routing node Number, it gates pseudo- tail flit modification accesses Tail and carries out resource release;
The RRU controllers for blocking recovery unit TRU and receiving (i-1)-th routing node of (i-1)-th routing node RX signals are sent, pseudo- head flit modifications access access Head is gated and is re-routed;
The TRU controllers delete the head flit stored in buffer after the transmission for completing data packet.
It is directed to the highly reliable link fault-tolerance approach of transient fault and intermittent defect in a kind of network-on-chip of the present invention, is to answer For by input port module, routing calculation module, crossbar switch, crossbar switch distribution module, Virtual Channel arbitration modules and defeated In the router that exit port module is formed;In the input port module include n Virtual Channel VC, multichannel data distributor with Multi-channel data selector;Its main feature is that
The input terminal of the input port module of i-th of routing node is provided with the first error detection units ECC1; There is triple gate respectively on the n Virtual Channel and blocks recovery unit TRU;Each Virtual Channel and block recovery unit accordingly TRU selects 1 multi-channel data selector to be transmitted in the multi-channel data selector by 2;In the multi-channel data selector Output end, which is provided with, retransmits recovery unit RRU and the second error detection units ECC2;To constitute the fault-tolerant module of highly reliable link;
The highly reliable link fault-tolerance approach is to carry out as follows:
Step 1 passes through its second error detection when i-th of routing node receives (i-1)-th routing node by link When the data packet of unit ECC2 codings, the first error detection units ECC1 of i-th of routing node is detected in the data packet Whether data bit malfunctions, if not malfunctioning, the data packet by the data distributor of i-th of input port module into Enter and be transmitted in n Virtual Channel, if error, thens follow the steps 2;
Whether the first error detection units ECC1 misjudgments of step 2, i-th of routing node can correctly be corrected, if energy It is correct to correct, then it is transmitted after automatic correct, otherwise, executes step 3;
Step 3, i-th of routing node the first error detection units ECC1 inform the input terminal of (i-1)-th routing node The re-transmission recovery unit RRU of mouth mold block, retransmits the data of error;The meter of i-th of routing node and (i-1)-th routing node simultaneously Number device adds one respectively;
The re-transmission of step 4, the first error detection units ECC1 of i-th routing node and (i-1)-th routing node restores Unit R RU judges whether respective counter is continuously to add one and reach fault threshold 3 respectively;If counter is not continuously plus once Reach fault threshold, then it represents that there are transient faults in the link between (i-1)-th routing node and i-th of routing node;And Execute step 1;Otherwise, indicate that there are intermittent defects in the link between (i-1)-th routing node and i-th of routing node; And execute step 5;
Step 5, i-th routing node 2 select 1 multi-channel data selector, gate the intermittence in i-th of routing node therefore The recovery unit TRU that blocks corresponding to barrier link carries out resource release;1 multichannel data is selected to select by the 2 of (i-1)-th routing node Device is selected, gates and blocks recovery unit TRU reroutings simultaneously corresponding to (i-1)-th routing node intermittent faulty link It is transmitted to crossbar switch.
For the highly reliable link fault-tolerance approach of transient fault and intermittent defect in network-on-chip of the present invention Feature is lain also in, and the second error detection units ECC2 encodes data packet using coding is intersected;I.e.:
Any one flit in the data packet is evenly dividing as m groups, includes k data in every group;Every group identical One group of new data of position data restructuring, to form k group data;K groups data are encoded respectively, it is new to form one Flit.
Compared with the prior art, effect of the present invention is embodied in:
1, the present invention proposes fault-tolerant for the highly reliable link of transient fault and intermittent defect in a kind of network-on-chip Module and its method add fault-tolerant module in input port, respectively to the transient fault that may occur in link and intermittence event Barrier progress is fault-tolerant, can be solved in the prior art for the excessive disadvantage of failure tolerant hardware spending with smaller hardware spending End, so as to when failure occurs, effectively increase network reliability, has ensured network performance.
2, present invention uses a kind of separate type ECC coding strategies, are detected and encoded in router input mouth addition ECC Module can detect the error in data occurred in link, due to the use of separate type ECC coding, can tolerate four differences simultaneously The mistake of grouping, maximization improve the fault-tolerant abilities of ECC, hence it is evident that the reliability in data transmission procedure is improved, compared to Under conventional router mechanism, it can efficiently reduce network delay and improve network throughput.
3, the present invention uses a kind of transient fault link transmission fault-tolerance approach, inside RRU addition retransmit Buffer and corresponding Control logic part and signal will retransmit the backup of Buffer when ECC detects corrupt data and cannot correctly correct Correct data re-starts transmission, restores the data of error, can tolerate the situation of transient fault.
4, the present invention blocks re-transmission fault-tolerance approach using a kind of intermittent defect link, has backup data package inside TRU Head flit, two data accesses and corresponding control logic signal, when the counter records data number that continuously malfunctions reaches threshold value 3, then it is assumed that there are intermittent defect in link, data packet transmission path is blocked by faulty link at this time, has passed through faulty link The pseudo- tail flit of data addition, carry out the release of resource occupied by corresponding data packet;It is not added by the data of faulty link pseudo- Head flit re-routes selection path transmission, mitigates influence of the faulty link to system performance, has with lower hardware spending The reliability of system is ensured to effect.
Description of the drawings
Fig. 1 is router integrated stand composition in the present invention;
Fig. 2 is RRU internal structures and retransmission logic figure in the present invention;
Fig. 3 is TRU internal structures and logic chart in the present invention;
Fig. 4 is fault type definition and corresponding operating figure in the present invention;
Fig. 5 a are Analysis of Error Resilience Approaches schematic diagrames when transient fault occurring in the present invention;
Fig. 5 b are Analysis of Error Resilience Approaches schematic diagrames when intermittent defect occurring in the present invention;
Fig. 6 is separate type ECC data coded format figure in the present invention.
Specific implementation mode
In the present embodiment, it is directed to the fault-tolerant module of highly reliable link of transient fault and intermittent defect in network-on-chip, is Applied to by input port module, routing calculation module, crossbar switch, crossbar switch distribution module, Virtual Channel arbitration modules and In the router that output port module is formed;As shown in Figure 1, comprising n Virtual Channel VC in router input mouth mold block, it is more Circuit-switched data distributor and multi-channel data selector;It is distributed by the data of the input port module by the data packet of link transmission Device enters Buffer in n Virtual Channel VC and is stored, if data packet wins arbitration in crossbar switch distribution module, passes through Data selector carries out selection and is transmitted to crossbar switch, to be transmitted to downstream router;
Data packet is divided into several flit and is transmitted, and according to data packet along the suitable of the routing node passed through Sequence, any one routing node that definition is passed through are that the above routing node is upstream node, are saved with next routing Point is the downstream node of current routing node;Current routing node is denoted as i-th of routing node;Then upstream node is (i-1)-th A routing node;Downstream node is i+1 routing node;
In the present embodiment, the input terminal of the input port module of i-th of routing node is provided with the first error detection list First ECC1, for detecting whether the data through link transmission to router input mouth occur mistake;On n Virtual Channel respectively There is triple gate and block recovery unit TRU, head flit of the triple gate for gated data packet is backed up in TRU;It is each empty logical Road and recovery unit TRU is blocked accordingly select 1 multi-channel data selector to be transmitted in multi-channel data selector by 2;More The output end of circuit-switched data selector, which is provided with, retransmits recovery unit RRU and the second error detection units ECC2;It can to constitute height By the fault-tolerant module of link;
Pass through its second error detection units when i-th of routing node receives (i-1)-th routing node by link When the data packet of ECC2 codings, the data bit in the first error detection units ECC1 detection data packets of i-th of routing node is No error, if not malfunctioning, data packet by the data distributor of i-th of input port module enter in n Virtual Channel into Row normal transmission, if whether error, the first error detection units ECC1 misjudgments of i-th of routing node can correctly entangle Just, it if can correctly correct, is transmitted after being corrected automatically by ECC, otherwise, informs the input terminal of (i-1)-th routing node The re-transmission recovery unit RRU of mouth mold block, retransmits the data of error, while the meter of i-th of routing node and (i-1)-th routing node Number device adds one respectively;Indicate that there are transient faults in the link between (i-1)-th routing node and i-th of routing node;
When counter, which continuously adds, reaches fault threshold " 3 " together, (i-1)-th routing node and i-th of routing section are indicated There are intermittent defects in link between point, then 1 multi-channel data selector, gating i-th are selected in 2 by i-th of routing node Block recovery unit TRU corresponding to intermittent defect link in a routing node, in i-th of routing node data packet due to Lack releases of the tail flit to its occupied resource, then carry out resource release by gating TRU, mitigates overall network congestion;It is logical It crosses the 2 of (i-1)-th routing node and selects 1 multi-channel data selector, it is right to gate (i-1)-th routing node intermittent faulty link institute That answers blocks recovery unit TRU reroutings and is transmitted to crossbar switch, mitigates since failure is to transmission of data packets Influence.
In specific implementation, the re-transmission recovery unit RRU of input port module is as shown in Fig. 2, packet in (i-1)-th routing node It includes:Memory space is the re-transmission buffer of two flit, the multiple selector for selecting 1 for one 2, counter, RRU controllers and one VC trace tables;The Virtual Channel ID retransmitted in buffer is stored in VC trace tables;The gating that RRU controllers are used to control MUX is defeated Go out and control the transmission of signal, counter is protected for counting the number that RRU controllers continuously receive NACK signal, VC trace tables Deposit the Virtual Channel ID for retransmitting the original place of data in Buffer.Downstream router ECC Counters are continuously detected for counting ECC Corrupt data and the number that cannot correctly correct.
When in the link between (i-1)-th routing node and i-th of routing node there are when transient fault, i-th of routing First error detection units ECC1 of node sends RRU controller of the NACK signal to (i-1)-th routing node;
The RRU controller control counters of (i-1)-th routing node add one, and control 2 and 1 multiple selector is selected to gate two Data in the re-transmission buffer of a flit are retransmitted;
When in the link between (i-1)-th routing node and i-th of routing node there are when intermittent defect, (i-1)-th The RRU controllers of routing node send RX signals and are used to reselect road to the recovery unit TRU that blocks of (i-1)-th routing node Diameter;First error detection units ECC1 of i-th routing node sends TX signals and blocks recovery unit to i-th routing node TRU is discharged for resource.
In specific implementation, recovery unit TRU is blocked as shown in figure 3, including in i-th of routing node:Memory space is The buffer of one flit, the multiple selector for selecting 1 for one 2, a 2 circuit-switched data distributors, pseudo- head flit modification accesses Head Access Tail, TRU controller is changed with pseudo- tail flit;The head flit of data packet is stored in buffer;
When in the link between (i-1)-th routing node and i-th of routing node there are when intermittent defect, i-th of tunnel By the first error detection units ECC1 transmission TX signals for blocking recovery unit TRU and receiving i-th of routing node of node, choosing Logical puppet tail flit modification accesses Tail carries out resource release;
The RRU controllers transmission of (i-1)-th routing node blocked recovery unit TRU and receive (i-1)-th routing node RX signals gate pseudo- head flit modifications access access Head and are re-routed;
TRU controllers delete the head flit stored in buffer after the transmission for completing data packet.
In the present embodiment, the fault-tolerant side of highly reliable link of transient fault and intermittent defect is directed in a kind of network-on-chip Method, and arbitrated applied to by input port module, routing calculation module, crossbar switch, crossbar switch distribution module, Virtual Channel In the router that module and output port module are formed;Include n Virtual Channel VC, multichannel data distribution in input port module Device and multi-channel data selector;
The input terminal of the input port module of i-th of routing node is provided with the first error detection units ECC1, is used for Whether data of the detection through link transmission to router input mouth occur mistake;Have respectively on n Virtual Channel triple gate and Recovery unit TRU is blocked, head flit of the triple gate for gated data packet is backed up in TRU;Each Virtual Channel and corresponding Block recovery unit TRU selects 1 multi-channel data selector to be transmitted in multi-channel data selector by 2;It is selected in multichannel data The output end of device, which is provided with, retransmits recovery unit RRU and the second error detection units ECC2;It is fault-tolerant to constitute highly reliable link Module;
Router period flowing water includes router-level topology, Virtual Channel distribution, crossbar switch distribution, crossbar switch four ranks of transmission Section malfunctions and cannot correctly correct when first data flit is transmitted downstream to router detection by crossbar switch, and notifies During upstream router retransmits, the crossbar switch transmission stage is completed in second data flit, obtains the secondary of crossbar switch It cuts out, therefore before current router receives and retransmits data, another data can be received.For this fault type definition and Corresponding operating as shown in figure 4, in figure 1 indicate detect corrupt data and cannot correct, 0 indicate data not malfunction or can be correct It corrects.2. 3. when occurring in one or two flit mistake such as table 1., can passing through to retransmit Buffer and retransmit data and ensure data Normal transmission, RRU controllers are said by sending the data for deleting and retransmitting and correctly being transmitted in data-signal deletion re-transmission Buffer Counter clear 0;When continuous three flit errors of transmission and when cannot correctly correct, between as 4. thinking that this link exists in table Having a rest property failure, counter threshold reach 3, ECC and send Tx signals to local TRU;RRU controllers count continuous receive by counter To the number of Nack signals, when reaching threshold value 3, RRU controllers send Rx signals to TRU.
In specific implementation, highly reliable link fault-tolerance approach is encoded using a kind of separate type ECC, in a network inspection in real time Whether measured data occurs mistake, realizes the definition to transient fault and intermittent defect;Use the weight being arranged in router interior Buffering area is passed, leads to corrupt data and when cannot correctly correct when transient fault occurs in link, by standby in retransmission buffer The data of part re-start transmission;Using head microplate of the backup in Virtual Channel, cause to count when intermittent defect occurs in link When cannot correct according to error and correctly, data packet transmission is truncated, and head microplate or tail are added again by the data to being truncated Microplate re-route or resource discharges, is to carry out as follows specifically:
Step 1 passes through its second error detection when i-th of routing node receives (i-1)-th routing node by link When the data packet of unit ECC2 codings, the data in the first error detection units ECC1 detection data packets of i-th of routing node Whether position malfunctions, if not malfunctioning, data packet enters n Virtual Channel by the data distributor of i-th of input port module In be transmitted, if error, then follow the steps 2;
Whether the first error detection units ECC1 misjudgments of step 2, i-th of routing node can correctly be corrected, if energy It is correct to correct, then it is transmitted after automatic correct, otherwise, executes step 3;
Step 3, i-th of routing node the first error detection units ECC1 inform the input terminal of (i-1)-th routing node The re-transmission recovery unit RRU of mouth mold block, retransmits the data of error;The meter of i-th of routing node and (i-1)-th routing node simultaneously Number device adds one respectively;
The re-transmission of step 4, the first error detection units ECC1 of i-th routing node and (i-1)-th routing node restores Unit R RU judges whether respective counter is continuously to add one and reach fault threshold 3 respectively;If counter is not continuously plus once Reach fault threshold, then it represents that there are transient faults in the link between (i-1)-th routing node and i-th of routing node;And Execute step 1;Otherwise, indicate that there are intermittent defects in the link between (i-1)-th routing node and i-th of routing node; And execute step 5;
Step 5, i-th routing node 2 select 1 multi-channel data selector, gate the intermittence in i-th of routing node therefore The recovery unit TRU that blocks corresponding to barrier link carries out resource release;1 multichannel data is selected to select by the 2 of (i-1)-th routing node Device is selected, gates and blocks recovery unit TRU reroutings simultaneously corresponding to (i-1)-th routing node intermittent faulty link It is transmitted to crossbar switch.
For the fault condition occurred in link, analysis is carried out to transient fault and intermittent defect fault-tolerance approach and has been said It is bright.In 4 × 4mesh networks in Fig. 5 a, source node 9 is to 4 transmission data packet of destination node, and solid black lines are that it route road in figure Diameter.When data reach node 11 and detect corrupt data, then there are transient faults for the link, that is, there is the 1. situation of Fig. 4, It is transmitted again by the re-transmission Buffer of RRU in node 10, retransmission detection data fault-free has then restored normal transmission;In Fig. 5 b Reach 3 times when being consecutively detected corrupt data in node 11, then node 10 toward link between node 11 there are intermittent defect, into 4. operation in row Fig. 4, head flit is revised as the release that pseudo- tail flit carries out resource by TRU in node 11, in node 10 Head flit is revised as pseudo- head flit and re-routed by TRU, and the path that the data that do not transmit follow pseudo- head flit to re-route carries out Transmission, two parts data packet are reassembled into a data packet in destination node.
Second error detection units ECC2 encodes data packet using coding is intersected;I.e.:By any in data packet A flit is evenly dividing as m groups, includes k data in every group;One group of new data of every group of identical bits data restructuring, to Form k group data;K groups data are encoded respectively, to form a new flit.
ECC data coded format is as shown in fig. 6, by the way of intersecting and encoding, for k data bit in data packet, [k%4] identical is same coding groups, to form 4 groups of data.Corrupt data has while tolerating in different grouping The ability that four continuous datas malfunction and correct.128-bit data bit is intersected herein the data bit for being divided into four groups of 32-bit, The data bit of every group of 32-bit needs 6 bit check positions, and every group can correct 1 bit-errors, then can correct the number of 4 different groupings simultaneously According to mistake.

Claims (5)

1. being directed to the fault-tolerant module of highly reliable link of transient fault and intermittent defect in a kind of network-on-chip, it is applied to by defeated Inbound port module, routing calculation module, crossbar switch, crossbar switch distribution module, Virtual Channel arbitration modules and output end mouth mold In the router that block is formed;Include n Virtual Channel VC, multichannel data distributor and multichannel data in the input port module Selector;N Virtual Channel VC is entered by the data distributor of the input port module by the data packet of link transmission, and is led to It crosses data selector and carries out selection transmission;
The data packet is divided into several flit and is transmitted, and according to data packet along the suitable of the routing node passed through Sequence, any one routing node that definition is passed through are that the above routing node is upstream node, are saved with next routing Point is the current routing node of downstream node;The current routing node is denoted as i-th of routing node;Then upstream node is the I-1 routing node;Downstream node is i+1 routing node;It is characterized in that:
The input terminal of the input port module of i-th of routing node is provided with the first error detection units ECC1;Institute State on n Virtual Channel has triple gate and blocks recovery unit TRU respectively;Each Virtual Channel and block recovery unit TRU accordingly 1 multi-channel data selector is selected to transmit in the data packet to the multi-channel data selector by 2;It is selected in the multichannel data The output end for selecting device is provided with re-transmission recovery unit RRU and the second error detection units ECC2;To constitute the highly reliable chain appearance of a street Mismatch block;
It is compiled by its second error detection units ECC2 when i-th of routing node receives (i-1)-th routing node by link When the data packet of code, whether the first error detection units ECC1 of i-th of routing node detects the data bit in the data packet Error, if not malfunctioning, the data packet enters n void by the data distributor of i-th of input port module and leads to It is transmitted in road, if whether error, the first error detection units ECC1 misjudgments of i-th of routing node can correctly entangle Just, it if can correctly correct, is transmitted after automatic correct, otherwise, informs the input port module of (i-1)-th routing node Re-transmission recovery unit RRU, retransmit the data of error, while the counter of i-th of routing node and (i-1)-th routing node point Not plus one;Indicate that there are transient faults in the link between (i-1)-th routing node and i-th of routing node;
When the counter continuously plus when reaching fault threshold together, indicate (i-1)-th routing node and i-th routing node it Between link in there are intermittent defects, then 1 multi-channel data selector is selected in 2 by i-th of routing node, gates i-th of tunnel Resource release is carried out by the recovery unit TRU that blocks corresponding to the intermittent defect link in node;It is saved by (i-1)-th routing 1 multi-channel data selector is selected in the 2 of point, gates and blocks recovery list corresponding to (i-1)-th routing node intermittent faulty link First TRU reroutings simultaneously transmit the data packet to crossbar switch.
2. being directed to the fault-tolerant mould of highly reliable link of transient fault and intermittent defect in network-on-chip according to claim 1 Block, characterized in that the re-transmission recovery unit RRU of input port module includes in (i-1)-th routing node:Memory space is The re-transmission buffer of two flit, one 2 select 1 multiple selector, counter, RRU controllers and a VC trace table;It is described The Virtual Channel ID being stored in VC trace tables in the re-transmission buffer;
When in the link between (i-1)-th routing node and i-th of routing node there are when transient fault, i-th of routing node The first error detection units ECC1 send RRU controller of the NACK signal to (i-1)-th routing node;
The RRU controllers of (i-1)-th routing node control the counter and add one, and control described 2 multi-path choices for selecting 1 The data that device gates in the re-transmission buffer of described two flit are retransmitted;
When in the link between (i-1)-th routing node and i-th of routing node there are when intermittent defect, described (i-1)-th The RRU controllers of routing node send RX signals and block recovery unit TRU for selecting again to (i-1)-th routing node Routing diameter;First error detection units ECC1 of i-th of routing node sends TX signals blocking to i-th of routing node Recovery unit TRU is discharged for resource.
3. being directed to the fault-tolerant mould of highly reliable link of transient fault and intermittent defect in network-on-chip according to claim 1 Block, characterized in that the recovery unit TRU that blocks in i-th of routing node includes:Memory space is a flit Buffer, the multiple selector for selecting 1 for one 2, a 2 circuit-switched data distributors, pseudo- head flit modification access Head and pseudo- tail flit Change access Tail, TRU controller;The head flit of data packet is stored in the buffer;
When in the link between (i-1)-th routing node and i-th of routing node there are when intermittent defect, i-th of tunnel By the first error detection units ECC1 transmission TX signals for blocking recovery unit TRU and receiving i-th of routing node of node, choosing Logical puppet tail flit modification accesses Tail carries out resource release;
The RRU controllers transmission of (i-1)-th routing node blocked recovery unit TRU and receive (i-1)-th routing node RX signals gate pseudo- head flit modifications access access Head and are re-routed;
The TRU controllers delete the head flit stored in buffer after the transmission for completing data packet.
4. being directed to the highly reliable link fault-tolerance approach of transient fault and intermittent defect in a kind of network-on-chip, it is applied to by defeated Inbound port module, routing calculation module, crossbar switch, crossbar switch distribution module, Virtual Channel arbitration modules and output end mouth mold In the router that block is formed;Include n Virtual Channel VC, multichannel data distributor and multichannel data in the input port module Selector;Data packet is divided into several flit and is transmitted, and according to data packet along the sequence of the routing node passed through, Any one routing node that definition is passed through is that the above routing node is upstream node, is with next routing node The current routing node of downstream node;The current routing node is denoted as i-th of routing node;Then upstream node is (i-1)-th A routing node;Downstream node is i+1 routing node;It is characterized in that
The input terminal of the input port module of i-th of routing node is provided with the first error detection units ECC1;Institute State on n Virtual Channel has triple gate and blocks recovery unit TRU respectively;Each Virtual Channel and block recovery unit TRU accordingly 1 multi-channel data selector is selected to transmit in the data packet to the multi-channel data selector by 2;It is selected in the multichannel data The output end for selecting device is provided with re-transmission recovery unit RRU and the second error detection units ECC2;To constitute the highly reliable chain appearance of a street Mismatch block;
The highly reliable link fault-tolerance approach is to carry out as follows:
Step 1 passes through its second error detection units when i-th of routing node receives (i-1)-th routing node by link When the data packet of ECC2 codings, the first error detection units ECC1 of i-th of routing node detects the data in the data packet Whether position malfunctions, if not malfunctioning, the data packet enters n by the data distributor of i-th of input port module It is transmitted in a Virtual Channel, if error, thens follow the steps 2;
Whether the first error detection units ECC1 misjudgments of step 2, i-th of routing node can correctly be corrected, if can be correct It corrects, is then transmitted after automatic correct, otherwise, execute step 3;
Step 3, i-th of routing node the first error detection units ECC1 inform the input terminal mouth mold of (i-1)-th routing node The re-transmission recovery unit RRU of block, retransmits the data of error;The counter of i-th of routing node and (i-1)-th routing node simultaneously Respectively plus one;
Step 4, i-th routing node the first error detection units ECC1 and (i-1)-th routing node re-transmission recovery unit RRU judges whether respective counter is continuously to add one and reach fault threshold 3 respectively;If counter continuously adds once not up to Fault threshold, then it represents that there are transient faults in the link between (i-1)-th routing node and i-th of routing node;And it executes Step 1;Otherwise, indicate that there are intermittent defects in the link between (i-1)-th routing node and i-th of routing node;And it holds Row step 5;
Step 5, i-th routing node 2 select 1 multi-channel data selector, gate the intermittent defect chain in i-th of routing node The recovery unit TRU that blocks corresponding to road carries out resource release;1 multichannel data is selected to select by the 2 of (i-1)-th routing node Device gates blocking recovery unit TRU reroutings and passing corresponding to (i-1)-th routing node intermittent faulty link The defeated data packet is to crossbar switch.
5. being directed to the fault-tolerant side of highly reliable link of transient fault and intermittent defect in network-on-chip according to claim 4 Method, it is characterized in that the second error detection units ECC2 encodes data packet using coding is intersected;I.e.:
Any one flit in the data packet is evenly dividing as m groups, includes k data in every group;Every group of identical digit According to one group of new data are reconstructed, to form k group data;K groups data are encoded respectively, it is new to form one flit。
CN201610184999.9A 2016-03-24 2016-03-24 The fault-tolerant module of highly reliable link and its method of transient fault and intermittent defect are directed in network-on-chip Expired - Fee Related CN105656773B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610184999.9A CN105656773B (en) 2016-03-24 2016-03-24 The fault-tolerant module of highly reliable link and its method of transient fault and intermittent defect are directed in network-on-chip

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610184999.9A CN105656773B (en) 2016-03-24 2016-03-24 The fault-tolerant module of highly reliable link and its method of transient fault and intermittent defect are directed in network-on-chip

Publications (2)

Publication Number Publication Date
CN105656773A CN105656773A (en) 2016-06-08
CN105656773B true CN105656773B (en) 2018-10-02

Family

ID=56495756

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610184999.9A Expired - Fee Related CN105656773B (en) 2016-03-24 2016-03-24 The fault-tolerant module of highly reliable link and its method of transient fault and intermittent defect are directed in network-on-chip

Country Status (1)

Country Link
CN (1) CN105656773B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106804048B (en) * 2017-02-17 2019-06-18 合肥工业大学 A kind of communication mechanism of the wireless network-on-chip based on two-dimensional grid
CN108900284B (en) * 2018-07-12 2020-11-06 合肥工业大学 High-efficiency fault-tolerant wireless interface in on-chip wireless network
CN115190069B (en) * 2022-04-26 2023-12-05 中国人民解放军国防科技大学 High-performance network-on-chip fault-tolerant router device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103973482A (en) * 2014-04-22 2014-08-06 南京航空航天大学 Fault-tolerant on-chip network system with global communication service management capability and method
CN104052622A (en) * 2014-06-23 2014-09-17 合肥工业大学 Router fault-tolerant method based on fault channel separation detection in NoC
CN104579951A (en) * 2014-12-29 2015-04-29 合肥工业大学 Fault-tolerance method in on-chip network under novel fault and congestion model

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103973482A (en) * 2014-04-22 2014-08-06 南京航空航天大学 Fault-tolerant on-chip network system with global communication service management capability and method
CN104052622A (en) * 2014-06-23 2014-09-17 合肥工业大学 Router fault-tolerant method based on fault channel separation detection in NoC
CN104579951A (en) * 2014-12-29 2015-04-29 合肥工业大学 Fault-tolerance method in on-chip network under novel fault and congestion model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
TM:一种新的片上网络拓扑结构;王新玉等;《计算机学报》;20141130;第37卷(第11期);全文 *
基于故障粒度划分的NoC链路自适应容错方法;欧阳一鸣等;《电子测量与仪器学报》;20150831;第29卷(第8期);全文 *

Also Published As

Publication number Publication date
CN105656773A (en) 2016-06-08

Similar Documents

Publication Publication Date Title
Rossi et al. Configurable error control scheme for NoC signal integrity
Lehtonen et al. Self-adaptive system for addressing permanent errors in on-chip interconnects
US5768300A (en) Interconnect fault detection and localization method and apparatus
CN103973482A (en) Fault-tolerant on-chip network system with global communication service management capability and method
CN105656773B (en) The fault-tolerant module of highly reliable link and its method of transient fault and intermittent defect are directed in network-on-chip
CN103124224B (en) Multiple faults for Industry Control allows Ethernet
CN105706388A (en) Lane error detection and lane removal mechanism of reduce the probability of data corruption
CN102629912B (en) Fault-tolerant deflection routing method and device for bufferless network-on-chip
CN104052622B (en) Router fault-tolerance approach based on faulty channel isolation detection in network-on-chip
CN106487673B (en) A kind of error detection re-transmission fault tolerance rout ing unit based on triplication redundancy
US6999411B1 (en) System and method for router arbiter protection switching
Fochi et al. An integrated method for implementing online fault detection in NoC-based MPSoCs
Yu et al. Error control integration scheme for reliable NoC
CN102831037B (en) Data path fragmentation redundancy protection structure
Castro et al. A fault tolerant NoC architecture based upon external router backup paths
Ghiribaldi et al. System-level infrastructure for boot-time testing and configuration of networks-on-chip with programmable routing logic
CN102710530A (en) Configurable network-on-chip fault tolerance method
Št’áva Efficient error recovery scheme in fault-tolerant NoC architectures
CN102904807A (en) Method for realizing fault-tolerant reconfigurable network on chip through split data transmission
CN103346862B (en) A kind of network-on-chip data transmission device of cascade protection and method
Lucas et al. Crosstalk fault tolerant NoC: design and evaluation
Boraten et al. Energy-efficient runtime adaptive scrubbing in fault-tolerant network-on-chips (nocs) architectures
CN107682118A (en) A kind of NoC error correction and detections based on duplication redundancy retransmit fault-tolerance approach
Stava On precise fault localization and identification in NoC architectures
Vinodhini et al. A fault tolerant NoC architecture with runtime adaptive double layer error control and crosstalk avoidance

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20181002