CN105656773B - The fault-tolerant module of highly reliable link and its method of transient fault and intermittent defect are directed in network-on-chip - Google Patents
The fault-tolerant module of highly reliable link and its method of transient fault and intermittent defect are directed in network-on-chip Download PDFInfo
- Publication number
- CN105656773B CN105656773B CN201610184999.9A CN201610184999A CN105656773B CN 105656773 B CN105656773 B CN 105656773B CN 201610184999 A CN201610184999 A CN 201610184999A CN 105656773 B CN105656773 B CN 105656773B
- Authority
- CN
- China
- Prior art keywords
- routing node
- data
- link
- node
- fault
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L45/00—Routing or path finding of packets in data switching networks
- H04L45/28—Routing or path finding of packets in data switching networks using route fault recovery
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0654—Management of faults, events, alarms or notifications using network fault recovery
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0654—Management of faults, events, alarms or notifications using network fault recovery
- H04L41/0668—Management of faults, events, alarms or notifications using network fault recovery by dynamic selection of recovery network elements, e.g. replacement by the most appropriate element after failure
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L45/00—Routing or path finding of packets in data switching networks
- H04L45/60—Router architectures
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
- Multi Processors (AREA)
Abstract
The invention discloses the fault-tolerant modules of highly reliable link and its method that transient fault and intermittent defect are directed in a kind of network-on-chip, it is characterized in that:It is encoded using a kind of separate type ECC, whether real-time detection data occurs mistake in a network, realizes the definition to transient fault and intermittent defect;Using being arranged in the retransmission buffer of router interior, leads to corrupt data when transient fault occurs in link and when cannot correctly correct, transmission is re-started by the data backed up in retransmission buffer;Using head microplate of the backup in Virtual Channel, intermittent defect leads to corrupt data and when cannot correctly correct, data packet transmission is truncated, and head microplate or tail microplate are added again by the data to being truncated when occurring in link, re-route or resource release.The present invention is using smaller hardware spending as cost, when failure occurs, can effectively improve the reliability of network, safeguards system performance.
Description
Technical field
The invention belongs to the fault-toleranr technique field of design of integrated circuit, for instantaneous in especially a kind of network-on-chip
The fault-tolerant module of highly reliable link and its method of failure and intermittent defect.
Background technology
With the development of semiconductor technology, on one single chip integrate check figure mesh it is more and more, compared to it is traditional based on
The system on chip (System-on-Chip, SoC) of bus architecture, network-on-chip (Network-on-Chip, NoC) is as a kind of
The solution of new multi-processor system-on-chip interconnection communication construction, the advantages of due to the high and low delay of its scalability and high bandwidth
It is suggested.
The major function of NoC systems is to ensure that data packet correctly lossless can be transferred to mesh from source node by router
Node.Link is played a crucial role as the critical data path connected between router.However due to soft error
The problems such as mistake, line-to-line crosstalk, temperature and aging, link transmission reliability receive great challenge.When link failure occurs,
Even if router fault-free, its normal routing function can not be played, overall network performance is greatly reduced.Therefore it is directed to chain
The fault-tolerant design on road is particularly important.
The failure occurred on the link can be divided into permanent fault, transient fault and intermittent defect.Link once occurs
Will always exist will not disappear, and controllability is good, fault-tolerant general to be solved using heavy-route or hardware redundancy.
The generation of transient fault is random and does not have rule, is generally instantaneity and can restore.About 80%
Communication failure is transient fault.It is fault-tolerant for transient fault, it can be generally divided into following two major classes:The first kind is based on Random Communication
Fault tolerant mechanism, such as flooding, by broadcasting and spreading, destination node will receive many redundancies data packet backup, bring
Prodigious power dissipation overhead;Second class is the request retransmission mechanism based on error-detecging code and error correcting code, mainly there is end-to-end (end-
To-end, e2e) re-transmission and hop to hop (switch-to-switch, s2s) re-transmission, e2e retransmission mechanism in transmitting terminal and
ECC encoding and decoding are carried out in the network interface of receiving terminal, this method only carries out error detection in destination node, can be led when retransmitting
Cause delay double.S2s retransmission mechanism keeps in the data of transmission in each router interior setting retransmission buffer (Buffer), but
A data mistake can only be covered by being ECC, and long numeric data can trigger retransmission mechanism when malfunctioning, also will increase network delay.
Intermittent defect be due to the influence of the factors such as temperature, voltage cause failure intermittence occur, and continue it is multiple when
Clock period, poor controllability.It can neither be solved by retransmission mechanism, permanent fault can not be defined as and solved,
When having a rest property failure occurs, the transmission path of data packet is blocked by faulty link.By the data of faulty link due to lacking tail
Release of the microplate (flit) to its occupied resource, prolonged resource occupation can cause network congestion, reduce network performance;
Similarly, due to the presence of faulty link, the routing for not lacking a flit by the data of faulty link guides, and occupies for a long time
Buffer resources can cause network congestion, it could even be possible to leading to deadlock.In conclusion considering tolerance transient fault and interval
Property failure in terms of seem very necessary.
Invention content
The present invention is in order to avoid in place of above-mentioned the shortcomings of the prior art, providing in a kind of network-on-chip for instantaneous
The fault-tolerant module of highly reliable link and its method of failure and intermittent defect, are directed to transient fault respectively and intermittent defect carries out
Detailed analysis, the corresponding fault-tolerant module of addition carry out the fault-tolerant of failure, to can using smaller hardware spending as cost,
So as to which in transient fault and intermittent defect generation, the reliability of Logistics networks improves the performance of system.
The technical proposal for solving the technical problem of the invention is:
It is directed to the fault-tolerant module of highly reliable link of transient fault and intermittent defect in a kind of network-on-chip of the present invention, is to answer
For by input port module, routing calculation module, crossbar switch, crossbar switch distribution module, Virtual Channel arbitration modules and defeated
In the router that exit port module is formed;In the input port module include n Virtual Channel VC, multichannel data distributor with
Multi-channel data selector;Enter n void by the data distributor of the input port module by the data packet of link transmission to lead to
Road VC, and selection transmission is carried out by data selector;
The data packet is divided into several flit and is transmitted, and according to data packet along the routing node passed through
Sequentially, it is upstream node that any one routing node that definition is passed through, which is the above routing node, with next routing
Node is the current routing node of downstream node;The current routing node is denoted as i-th of routing node;Then upstream node is
(i-1)-th routing node;Downstream node is i+1 routing node;Its main feature is that:
The input terminal of the input port module of i-th of routing node is provided with the first error detection units ECC1;
There is triple gate respectively on the n Virtual Channel and blocks recovery unit TRU;Each Virtual Channel and block recovery unit accordingly
TRU selects 1 multi-channel data selector to be transmitted in the multi-channel data selector by 2;In the multi-channel data selector
Output end, which is provided with, retransmits recovery unit RRU and the second error detection units ECC2;To constitute the fault-tolerant module of highly reliable link;
Pass through its second error detection units when i-th of routing node receives (i-1)-th routing node by link
When the data packet of ECC2 codings, the first error detection units ECC1 of i-th of routing node detects the data in the data packet
Whether position malfunctions, if not malfunctioning, the data packet enters n by the data distributor of i-th of input port module
It is transmitted in a Virtual Channel, if whether error, the first error detection units ECC1 misjudgments of i-th of routing node can
It is correct to correct, if can correctly correct, it is transmitted after automatic correct, otherwise, informs the input terminal of (i-1)-th routing node
The re-transmission recovery unit RRU of mouth mold block, retransmits the data of error, while the meter of i-th of routing node and (i-1)-th routing node
Number device adds one respectively;Indicate that there are transient faults in the link between (i-1)-th routing node and i-th of routing node;
When the counter, which continuously adds, reaches fault threshold together, (i-1)-th routing node and i-th of routing section are indicated
There are intermittent defects in link between point, then 1 multi-channel data selector, gating i-th are selected in 2 by i-th of routing node
The recovery unit TRU that blocks corresponding to intermittent defect link in a routing node carries out resource release;Pass through (i-1)-th road
Select 1 multi-channel data selector, blocking corresponding to (i-1)-th routing node intermittent faulty link of gating extensive by the 2 of node
Multiple unit TRU reroutings are simultaneously transmitted to crossbar switch.
The fault-tolerant module of highly reliable link of transient fault and intermittent defect is directed in network-on-chip of the present invention
Feature is lain also in:
The re-transmission recovery unit RRU of input port module includes in (i-1)-th routing node:Memory space is two
The re-transmission buffer of flit, one 2 select 1 multiple selector, counter, RRU controllers and a VC trace table;The VC is chased after
The Virtual Channel ID being stored in track table in the re-transmission buffer;
When in the link between (i-1)-th routing node and i-th of routing node there are when transient fault, i-th of routing
First error detection units ECC1 of node sends RRU controller of the NACK signal to (i-1)-th routing node;
The RRU controllers of (i-1)-th routing node control the counter and add one, and control described 2 multichannels for selecting 1
The data that selector gates in the re-transmission buffer of described two flit are retransmitted;
When in the link between (i-1)-th routing node and i-th of routing node there are when intermittent defect, described i-th-
The RRU controllers of 1 routing node send RX signals to (i-1)-th routing node block recovery unit TRU for weight
New selection path;First error detection units ECC1 of i-th of routing node sends TX signals to i-th of routing node
Recovery unit TRU is blocked to discharge for resource.
The recovery unit TRU that blocks in i-th of routing node includes:Memory space be a flit buffer,
Select 1 multiple selector, a 2 circuit-switched data distributors, pseudo- head flit modification access Head and pseudo- tail flit modification accesses for one 2
Tail, TRU controller;The head flit of data packet is stored in the buffer;
When in the link between (i-1)-th routing node and i-th of routing node there are when intermittent defect, described i-th
The the first error detection units ECC1 transmission TX letters for blocking recovery unit TRU and receiving i-th of routing node of a routing node
Number, it gates pseudo- tail flit modification accesses Tail and carries out resource release;
The RRU controllers for blocking recovery unit TRU and receiving (i-1)-th routing node of (i-1)-th routing node
RX signals are sent, pseudo- head flit modifications access access Head is gated and is re-routed;
The TRU controllers delete the head flit stored in buffer after the transmission for completing data packet.
It is directed to the highly reliable link fault-tolerance approach of transient fault and intermittent defect in a kind of network-on-chip of the present invention, is to answer
For by input port module, routing calculation module, crossbar switch, crossbar switch distribution module, Virtual Channel arbitration modules and defeated
In the router that exit port module is formed;In the input port module include n Virtual Channel VC, multichannel data distributor with
Multi-channel data selector;Its main feature is that
The input terminal of the input port module of i-th of routing node is provided with the first error detection units ECC1;
There is triple gate respectively on the n Virtual Channel and blocks recovery unit TRU;Each Virtual Channel and block recovery unit accordingly
TRU selects 1 multi-channel data selector to be transmitted in the multi-channel data selector by 2;In the multi-channel data selector
Output end, which is provided with, retransmits recovery unit RRU and the second error detection units ECC2;To constitute the fault-tolerant module of highly reliable link;
The highly reliable link fault-tolerance approach is to carry out as follows:
Step 1 passes through its second error detection when i-th of routing node receives (i-1)-th routing node by link
When the data packet of unit ECC2 codings, the first error detection units ECC1 of i-th of routing node is detected in the data packet
Whether data bit malfunctions, if not malfunctioning, the data packet by the data distributor of i-th of input port module into
Enter and be transmitted in n Virtual Channel, if error, thens follow the steps 2;
Whether the first error detection units ECC1 misjudgments of step 2, i-th of routing node can correctly be corrected, if energy
It is correct to correct, then it is transmitted after automatic correct, otherwise, executes step 3;
Step 3, i-th of routing node the first error detection units ECC1 inform the input terminal of (i-1)-th routing node
The re-transmission recovery unit RRU of mouth mold block, retransmits the data of error;The meter of i-th of routing node and (i-1)-th routing node simultaneously
Number device adds one respectively;
The re-transmission of step 4, the first error detection units ECC1 of i-th routing node and (i-1)-th routing node restores
Unit R RU judges whether respective counter is continuously to add one and reach fault threshold 3 respectively;If counter is not continuously plus once
Reach fault threshold, then it represents that there are transient faults in the link between (i-1)-th routing node and i-th of routing node;And
Execute step 1;Otherwise, indicate that there are intermittent defects in the link between (i-1)-th routing node and i-th of routing node;
And execute step 5;
Step 5, i-th routing node 2 select 1 multi-channel data selector, gate the intermittence in i-th of routing node therefore
The recovery unit TRU that blocks corresponding to barrier link carries out resource release;1 multichannel data is selected to select by the 2 of (i-1)-th routing node
Device is selected, gates and blocks recovery unit TRU reroutings simultaneously corresponding to (i-1)-th routing node intermittent faulty link
It is transmitted to crossbar switch.
For the highly reliable link fault-tolerance approach of transient fault and intermittent defect in network-on-chip of the present invention
Feature is lain also in, and the second error detection units ECC2 encodes data packet using coding is intersected;I.e.:
Any one flit in the data packet is evenly dividing as m groups, includes k data in every group;Every group identical
One group of new data of position data restructuring, to form k group data;K groups data are encoded respectively, it is new to form one
Flit.
Compared with the prior art, effect of the present invention is embodied in:
1, the present invention proposes fault-tolerant for the highly reliable link of transient fault and intermittent defect in a kind of network-on-chip
Module and its method add fault-tolerant module in input port, respectively to the transient fault that may occur in link and intermittence event
Barrier progress is fault-tolerant, can be solved in the prior art for the excessive disadvantage of failure tolerant hardware spending with smaller hardware spending
End, so as to when failure occurs, effectively increase network reliability, has ensured network performance.
2, present invention uses a kind of separate type ECC coding strategies, are detected and encoded in router input mouth addition ECC
Module can detect the error in data occurred in link, due to the use of separate type ECC coding, can tolerate four differences simultaneously
The mistake of grouping, maximization improve the fault-tolerant abilities of ECC, hence it is evident that the reliability in data transmission procedure is improved, compared to
Under conventional router mechanism, it can efficiently reduce network delay and improve network throughput.
3, the present invention uses a kind of transient fault link transmission fault-tolerance approach, inside RRU addition retransmit Buffer and corresponding
Control logic part and signal will retransmit the backup of Buffer when ECC detects corrupt data and cannot correctly correct
Correct data re-starts transmission, restores the data of error, can tolerate the situation of transient fault.
4, the present invention blocks re-transmission fault-tolerance approach using a kind of intermittent defect link, has backup data package inside TRU
Head flit, two data accesses and corresponding control logic signal, when the counter records data number that continuously malfunctions reaches threshold value
3, then it is assumed that there are intermittent defect in link, data packet transmission path is blocked by faulty link at this time, has passed through faulty link
The pseudo- tail flit of data addition, carry out the release of resource occupied by corresponding data packet;It is not added by the data of faulty link pseudo-
Head flit re-routes selection path transmission, mitigates influence of the faulty link to system performance, has with lower hardware spending
The reliability of system is ensured to effect.
Description of the drawings
Fig. 1 is router integrated stand composition in the present invention;
Fig. 2 is RRU internal structures and retransmission logic figure in the present invention;
Fig. 3 is TRU internal structures and logic chart in the present invention;
Fig. 4 is fault type definition and corresponding operating figure in the present invention;
Fig. 5 a are Analysis of Error Resilience Approaches schematic diagrames when transient fault occurring in the present invention;
Fig. 5 b are Analysis of Error Resilience Approaches schematic diagrames when intermittent defect occurring in the present invention;
Fig. 6 is separate type ECC data coded format figure in the present invention.
Specific implementation mode
In the present embodiment, it is directed to the fault-tolerant module of highly reliable link of transient fault and intermittent defect in network-on-chip, is
Applied to by input port module, routing calculation module, crossbar switch, crossbar switch distribution module, Virtual Channel arbitration modules and
In the router that output port module is formed;As shown in Figure 1, comprising n Virtual Channel VC in router input mouth mold block, it is more
Circuit-switched data distributor and multi-channel data selector;It is distributed by the data of the input port module by the data packet of link transmission
Device enters Buffer in n Virtual Channel VC and is stored, if data packet wins arbitration in crossbar switch distribution module, passes through
Data selector carries out selection and is transmitted to crossbar switch, to be transmitted to downstream router;
Data packet is divided into several flit and is transmitted, and according to data packet along the suitable of the routing node passed through
Sequence, any one routing node that definition is passed through are that the above routing node is upstream node, are saved with next routing
Point is the downstream node of current routing node;Current routing node is denoted as i-th of routing node;Then upstream node is (i-1)-th
A routing node;Downstream node is i+1 routing node;
In the present embodiment, the input terminal of the input port module of i-th of routing node is provided with the first error detection list
First ECC1, for detecting whether the data through link transmission to router input mouth occur mistake;On n Virtual Channel respectively
There is triple gate and block recovery unit TRU, head flit of the triple gate for gated data packet is backed up in TRU;It is each empty logical
Road and recovery unit TRU is blocked accordingly select 1 multi-channel data selector to be transmitted in multi-channel data selector by 2;More
The output end of circuit-switched data selector, which is provided with, retransmits recovery unit RRU and the second error detection units ECC2;It can to constitute height
By the fault-tolerant module of link;
Pass through its second error detection units when i-th of routing node receives (i-1)-th routing node by link
When the data packet of ECC2 codings, the data bit in the first error detection units ECC1 detection data packets of i-th of routing node is
No error, if not malfunctioning, data packet by the data distributor of i-th of input port module enter in n Virtual Channel into
Row normal transmission, if whether error, the first error detection units ECC1 misjudgments of i-th of routing node can correctly entangle
Just, it if can correctly correct, is transmitted after being corrected automatically by ECC, otherwise, informs the input terminal of (i-1)-th routing node
The re-transmission recovery unit RRU of mouth mold block, retransmits the data of error, while the meter of i-th of routing node and (i-1)-th routing node
Number device adds one respectively;Indicate that there are transient faults in the link between (i-1)-th routing node and i-th of routing node;
When counter, which continuously adds, reaches fault threshold " 3 " together, (i-1)-th routing node and i-th of routing section are indicated
There are intermittent defects in link between point, then 1 multi-channel data selector, gating i-th are selected in 2 by i-th of routing node
Block recovery unit TRU corresponding to intermittent defect link in a routing node, in i-th of routing node data packet due to
Lack releases of the tail flit to its occupied resource, then carry out resource release by gating TRU, mitigates overall network congestion;It is logical
It crosses the 2 of (i-1)-th routing node and selects 1 multi-channel data selector, it is right to gate (i-1)-th routing node intermittent faulty link institute
That answers blocks recovery unit TRU reroutings and is transmitted to crossbar switch, mitigates since failure is to transmission of data packets
Influence.
In specific implementation, the re-transmission recovery unit RRU of input port module is as shown in Fig. 2, packet in (i-1)-th routing node
It includes:Memory space is the re-transmission buffer of two flit, the multiple selector for selecting 1 for one 2, counter, RRU controllers and one
VC trace tables;The Virtual Channel ID retransmitted in buffer is stored in VC trace tables;The gating that RRU controllers are used to control MUX is defeated
Go out and control the transmission of signal, counter is protected for counting the number that RRU controllers continuously receive NACK signal, VC trace tables
Deposit the Virtual Channel ID for retransmitting the original place of data in Buffer.Downstream router ECC Counters are continuously detected for counting ECC
Corrupt data and the number that cannot correctly correct.
When in the link between (i-1)-th routing node and i-th of routing node there are when transient fault, i-th of routing
First error detection units ECC1 of node sends RRU controller of the NACK signal to (i-1)-th routing node;
The RRU controller control counters of (i-1)-th routing node add one, and control 2 and 1 multiple selector is selected to gate two
Data in the re-transmission buffer of a flit are retransmitted;
When in the link between (i-1)-th routing node and i-th of routing node there are when intermittent defect, (i-1)-th
The RRU controllers of routing node send RX signals and are used to reselect road to the recovery unit TRU that blocks of (i-1)-th routing node
Diameter;First error detection units ECC1 of i-th routing node sends TX signals and blocks recovery unit to i-th routing node
TRU is discharged for resource.
In specific implementation, recovery unit TRU is blocked as shown in figure 3, including in i-th of routing node:Memory space is
The buffer of one flit, the multiple selector for selecting 1 for one 2, a 2 circuit-switched data distributors, pseudo- head flit modification accesses Head
Access Tail, TRU controller is changed with pseudo- tail flit;The head flit of data packet is stored in buffer;
When in the link between (i-1)-th routing node and i-th of routing node there are when intermittent defect, i-th of tunnel
By the first error detection units ECC1 transmission TX signals for blocking recovery unit TRU and receiving i-th of routing node of node, choosing
Logical puppet tail flit modification accesses Tail carries out resource release;
The RRU controllers transmission of (i-1)-th routing node blocked recovery unit TRU and receive (i-1)-th routing node
RX signals gate pseudo- head flit modifications access access Head and are re-routed;
TRU controllers delete the head flit stored in buffer after the transmission for completing data packet.
In the present embodiment, the fault-tolerant side of highly reliable link of transient fault and intermittent defect is directed in a kind of network-on-chip
Method, and arbitrated applied to by input port module, routing calculation module, crossbar switch, crossbar switch distribution module, Virtual Channel
In the router that module and output port module are formed;Include n Virtual Channel VC, multichannel data distribution in input port module
Device and multi-channel data selector;
The input terminal of the input port module of i-th of routing node is provided with the first error detection units ECC1, is used for
Whether data of the detection through link transmission to router input mouth occur mistake;Have respectively on n Virtual Channel triple gate and
Recovery unit TRU is blocked, head flit of the triple gate for gated data packet is backed up in TRU;Each Virtual Channel and corresponding
Block recovery unit TRU selects 1 multi-channel data selector to be transmitted in multi-channel data selector by 2;It is selected in multichannel data
The output end of device, which is provided with, retransmits recovery unit RRU and the second error detection units ECC2;It is fault-tolerant to constitute highly reliable link
Module;
Router period flowing water includes router-level topology, Virtual Channel distribution, crossbar switch distribution, crossbar switch four ranks of transmission
Section malfunctions and cannot correctly correct when first data flit is transmitted downstream to router detection by crossbar switch, and notifies
During upstream router retransmits, the crossbar switch transmission stage is completed in second data flit, obtains the secondary of crossbar switch
It cuts out, therefore before current router receives and retransmits data, another data can be received.For this fault type definition and
Corresponding operating as shown in figure 4, in figure 1 indicate detect corrupt data and cannot correct, 0 indicate data not malfunction or can be correct
It corrects.2. 3. when occurring in one or two flit mistake such as table 1., can passing through to retransmit Buffer and retransmit data and ensure data
Normal transmission, RRU controllers are said by sending the data for deleting and retransmitting and correctly being transmitted in data-signal deletion re-transmission Buffer
Counter clear 0;When continuous three flit errors of transmission and when cannot correctly correct, between as 4. thinking that this link exists in table
Having a rest property failure, counter threshold reach 3, ECC and send Tx signals to local TRU;RRU controllers count continuous receive by counter
To the number of Nack signals, when reaching threshold value 3, RRU controllers send Rx signals to TRU.
In specific implementation, highly reliable link fault-tolerance approach is encoded using a kind of separate type ECC, in a network inspection in real time
Whether measured data occurs mistake, realizes the definition to transient fault and intermittent defect;Use the weight being arranged in router interior
Buffering area is passed, leads to corrupt data and when cannot correctly correct when transient fault occurs in link, by standby in retransmission buffer
The data of part re-start transmission;Using head microplate of the backup in Virtual Channel, cause to count when intermittent defect occurs in link
When cannot correct according to error and correctly, data packet transmission is truncated, and head microplate or tail are added again by the data to being truncated
Microplate re-route or resource discharges, is to carry out as follows specifically:
Step 1 passes through its second error detection when i-th of routing node receives (i-1)-th routing node by link
When the data packet of unit ECC2 codings, the data in the first error detection units ECC1 detection data packets of i-th of routing node
Whether position malfunctions, if not malfunctioning, data packet enters n Virtual Channel by the data distributor of i-th of input port module
In be transmitted, if error, then follow the steps 2;
Whether the first error detection units ECC1 misjudgments of step 2, i-th of routing node can correctly be corrected, if energy
It is correct to correct, then it is transmitted after automatic correct, otherwise, executes step 3;
Step 3, i-th of routing node the first error detection units ECC1 inform the input terminal of (i-1)-th routing node
The re-transmission recovery unit RRU of mouth mold block, retransmits the data of error;The meter of i-th of routing node and (i-1)-th routing node simultaneously
Number device adds one respectively;
The re-transmission of step 4, the first error detection units ECC1 of i-th routing node and (i-1)-th routing node restores
Unit R RU judges whether respective counter is continuously to add one and reach fault threshold 3 respectively;If counter is not continuously plus once
Reach fault threshold, then it represents that there are transient faults in the link between (i-1)-th routing node and i-th of routing node;And
Execute step 1;Otherwise, indicate that there are intermittent defects in the link between (i-1)-th routing node and i-th of routing node;
And execute step 5;
Step 5, i-th routing node 2 select 1 multi-channel data selector, gate the intermittence in i-th of routing node therefore
The recovery unit TRU that blocks corresponding to barrier link carries out resource release;1 multichannel data is selected to select by the 2 of (i-1)-th routing node
Device is selected, gates and blocks recovery unit TRU reroutings simultaneously corresponding to (i-1)-th routing node intermittent faulty link
It is transmitted to crossbar switch.
For the fault condition occurred in link, analysis is carried out to transient fault and intermittent defect fault-tolerance approach and has been said
It is bright.In 4 × 4mesh networks in Fig. 5 a, source node 9 is to 4 transmission data packet of destination node, and solid black lines are that it route road in figure
Diameter.When data reach node 11 and detect corrupt data, then there are transient faults for the link, that is, there is the 1. situation of Fig. 4,
It is transmitted again by the re-transmission Buffer of RRU in node 10, retransmission detection data fault-free has then restored normal transmission;In Fig. 5 b
Reach 3 times when being consecutively detected corrupt data in node 11, then node 10 toward link between node 11 there are intermittent defect, into
4. operation in row Fig. 4, head flit is revised as the release that pseudo- tail flit carries out resource by TRU in node 11, in node 10
Head flit is revised as pseudo- head flit and re-routed by TRU, and the path that the data that do not transmit follow pseudo- head flit to re-route carries out
Transmission, two parts data packet are reassembled into a data packet in destination node.
Second error detection units ECC2 encodes data packet using coding is intersected;I.e.:By any in data packet
A flit is evenly dividing as m groups, includes k data in every group;One group of new data of every group of identical bits data restructuring, to
Form k group data;K groups data are encoded respectively, to form a new flit.
ECC data coded format is as shown in fig. 6, by the way of intersecting and encoding, for k data bit in data packet,
[k%4] identical is same coding groups, to form 4 groups of data.Corrupt data has while tolerating in different grouping
The ability that four continuous datas malfunction and correct.128-bit data bit is intersected herein the data bit for being divided into four groups of 32-bit,
The data bit of every group of 32-bit needs 6 bit check positions, and every group can correct 1 bit-errors, then can correct the number of 4 different groupings simultaneously
According to mistake.
Claims (5)
1. being directed to the fault-tolerant module of highly reliable link of transient fault and intermittent defect in a kind of network-on-chip, it is applied to by defeated
Inbound port module, routing calculation module, crossbar switch, crossbar switch distribution module, Virtual Channel arbitration modules and output end mouth mold
In the router that block is formed;Include n Virtual Channel VC, multichannel data distributor and multichannel data in the input port module
Selector;N Virtual Channel VC is entered by the data distributor of the input port module by the data packet of link transmission, and is led to
It crosses data selector and carries out selection transmission;
The data packet is divided into several flit and is transmitted, and according to data packet along the suitable of the routing node passed through
Sequence, any one routing node that definition is passed through are that the above routing node is upstream node, are saved with next routing
Point is the current routing node of downstream node;The current routing node is denoted as i-th of routing node;Then upstream node is the
I-1 routing node;Downstream node is i+1 routing node;It is characterized in that:
The input terminal of the input port module of i-th of routing node is provided with the first error detection units ECC1;Institute
State on n Virtual Channel has triple gate and blocks recovery unit TRU respectively;Each Virtual Channel and block recovery unit TRU accordingly
1 multi-channel data selector is selected to transmit in the data packet to the multi-channel data selector by 2;It is selected in the multichannel data
The output end for selecting device is provided with re-transmission recovery unit RRU and the second error detection units ECC2;To constitute the highly reliable chain appearance of a street
Mismatch block;
It is compiled by its second error detection units ECC2 when i-th of routing node receives (i-1)-th routing node by link
When the data packet of code, whether the first error detection units ECC1 of i-th of routing node detects the data bit in the data packet
Error, if not malfunctioning, the data packet enters n void by the data distributor of i-th of input port module and leads to
It is transmitted in road, if whether error, the first error detection units ECC1 misjudgments of i-th of routing node can correctly entangle
Just, it if can correctly correct, is transmitted after automatic correct, otherwise, informs the input port module of (i-1)-th routing node
Re-transmission recovery unit RRU, retransmit the data of error, while the counter of i-th of routing node and (i-1)-th routing node point
Not plus one;Indicate that there are transient faults in the link between (i-1)-th routing node and i-th of routing node;
When the counter continuously plus when reaching fault threshold together, indicate (i-1)-th routing node and i-th routing node it
Between link in there are intermittent defects, then 1 multi-channel data selector is selected in 2 by i-th of routing node, gates i-th of tunnel
Resource release is carried out by the recovery unit TRU that blocks corresponding to the intermittent defect link in node;It is saved by (i-1)-th routing
1 multi-channel data selector is selected in the 2 of point, gates and blocks recovery list corresponding to (i-1)-th routing node intermittent faulty link
First TRU reroutings simultaneously transmit the data packet to crossbar switch.
2. being directed to the fault-tolerant mould of highly reliable link of transient fault and intermittent defect in network-on-chip according to claim 1
Block, characterized in that the re-transmission recovery unit RRU of input port module includes in (i-1)-th routing node:Memory space is
The re-transmission buffer of two flit, one 2 select 1 multiple selector, counter, RRU controllers and a VC trace table;It is described
The Virtual Channel ID being stored in VC trace tables in the re-transmission buffer;
When in the link between (i-1)-th routing node and i-th of routing node there are when transient fault, i-th of routing node
The first error detection units ECC1 send RRU controller of the NACK signal to (i-1)-th routing node;
The RRU controllers of (i-1)-th routing node control the counter and add one, and control described 2 multi-path choices for selecting 1
The data that device gates in the re-transmission buffer of described two flit are retransmitted;
When in the link between (i-1)-th routing node and i-th of routing node there are when intermittent defect, described (i-1)-th
The RRU controllers of routing node send RX signals and block recovery unit TRU for selecting again to (i-1)-th routing node
Routing diameter;First error detection units ECC1 of i-th of routing node sends TX signals blocking to i-th of routing node
Recovery unit TRU is discharged for resource.
3. being directed to the fault-tolerant mould of highly reliable link of transient fault and intermittent defect in network-on-chip according to claim 1
Block, characterized in that the recovery unit TRU that blocks in i-th of routing node includes:Memory space is a flit
Buffer, the multiple selector for selecting 1 for one 2, a 2 circuit-switched data distributors, pseudo- head flit modification access Head and pseudo- tail flit
Change access Tail, TRU controller;The head flit of data packet is stored in the buffer;
When in the link between (i-1)-th routing node and i-th of routing node there are when intermittent defect, i-th of tunnel
By the first error detection units ECC1 transmission TX signals for blocking recovery unit TRU and receiving i-th of routing node of node, choosing
Logical puppet tail flit modification accesses Tail carries out resource release;
The RRU controllers transmission of (i-1)-th routing node blocked recovery unit TRU and receive (i-1)-th routing node
RX signals gate pseudo- head flit modifications access access Head and are re-routed;
The TRU controllers delete the head flit stored in buffer after the transmission for completing data packet.
4. being directed to the highly reliable link fault-tolerance approach of transient fault and intermittent defect in a kind of network-on-chip, it is applied to by defeated
Inbound port module, routing calculation module, crossbar switch, crossbar switch distribution module, Virtual Channel arbitration modules and output end mouth mold
In the router that block is formed;Include n Virtual Channel VC, multichannel data distributor and multichannel data in the input port module
Selector;Data packet is divided into several flit and is transmitted, and according to data packet along the sequence of the routing node passed through,
Any one routing node that definition is passed through is that the above routing node is upstream node, is with next routing node
The current routing node of downstream node;The current routing node is denoted as i-th of routing node;Then upstream node is (i-1)-th
A routing node;Downstream node is i+1 routing node;It is characterized in that
The input terminal of the input port module of i-th of routing node is provided with the first error detection units ECC1;Institute
State on n Virtual Channel has triple gate and blocks recovery unit TRU respectively;Each Virtual Channel and block recovery unit TRU accordingly
1 multi-channel data selector is selected to transmit in the data packet to the multi-channel data selector by 2;It is selected in the multichannel data
The output end for selecting device is provided with re-transmission recovery unit RRU and the second error detection units ECC2;To constitute the highly reliable chain appearance of a street
Mismatch block;
The highly reliable link fault-tolerance approach is to carry out as follows:
Step 1 passes through its second error detection units when i-th of routing node receives (i-1)-th routing node by link
When the data packet of ECC2 codings, the first error detection units ECC1 of i-th of routing node detects the data in the data packet
Whether position malfunctions, if not malfunctioning, the data packet enters n by the data distributor of i-th of input port module
It is transmitted in a Virtual Channel, if error, thens follow the steps 2;
Whether the first error detection units ECC1 misjudgments of step 2, i-th of routing node can correctly be corrected, if can be correct
It corrects, is then transmitted after automatic correct, otherwise, execute step 3;
Step 3, i-th of routing node the first error detection units ECC1 inform the input terminal mouth mold of (i-1)-th routing node
The re-transmission recovery unit RRU of block, retransmits the data of error;The counter of i-th of routing node and (i-1)-th routing node simultaneously
Respectively plus one;
Step 4, i-th routing node the first error detection units ECC1 and (i-1)-th routing node re-transmission recovery unit
RRU judges whether respective counter is continuously to add one and reach fault threshold 3 respectively;If counter continuously adds once not up to
Fault threshold, then it represents that there are transient faults in the link between (i-1)-th routing node and i-th of routing node;And it executes
Step 1;Otherwise, indicate that there are intermittent defects in the link between (i-1)-th routing node and i-th of routing node;And it holds
Row step 5;
Step 5, i-th routing node 2 select 1 multi-channel data selector, gate the intermittent defect chain in i-th of routing node
The recovery unit TRU that blocks corresponding to road carries out resource release;1 multichannel data is selected to select by the 2 of (i-1)-th routing node
Device gates blocking recovery unit TRU reroutings and passing corresponding to (i-1)-th routing node intermittent faulty link
The defeated data packet is to crossbar switch.
5. being directed to the fault-tolerant side of highly reliable link of transient fault and intermittent defect in network-on-chip according to claim 4
Method, it is characterized in that the second error detection units ECC2 encodes data packet using coding is intersected;I.e.:
Any one flit in the data packet is evenly dividing as m groups, includes k data in every group;Every group of identical digit
According to one group of new data are reconstructed, to form k group data;K groups data are encoded respectively, it is new to form one
flit。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610184999.9A CN105656773B (en) | 2016-03-24 | 2016-03-24 | The fault-tolerant module of highly reliable link and its method of transient fault and intermittent defect are directed in network-on-chip |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610184999.9A CN105656773B (en) | 2016-03-24 | 2016-03-24 | The fault-tolerant module of highly reliable link and its method of transient fault and intermittent defect are directed in network-on-chip |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105656773A CN105656773A (en) | 2016-06-08 |
CN105656773B true CN105656773B (en) | 2018-10-02 |
Family
ID=56495756
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610184999.9A Expired - Fee Related CN105656773B (en) | 2016-03-24 | 2016-03-24 | The fault-tolerant module of highly reliable link and its method of transient fault and intermittent defect are directed in network-on-chip |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105656773B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106804048B (en) * | 2017-02-17 | 2019-06-18 | 合肥工业大学 | A kind of communication mechanism of the wireless network-on-chip based on two-dimensional grid |
CN108900284B (en) * | 2018-07-12 | 2020-11-06 | 合肥工业大学 | High-efficiency fault-tolerant wireless interface in on-chip wireless network |
CN115190069B (en) * | 2022-04-26 | 2023-12-05 | 中国人民解放军国防科技大学 | High-performance network-on-chip fault-tolerant router device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103973482A (en) * | 2014-04-22 | 2014-08-06 | 南京航空航天大学 | Fault-tolerant on-chip network system with global communication service management capability and method |
CN104052622A (en) * | 2014-06-23 | 2014-09-17 | 合肥工业大学 | Router fault-tolerant method based on fault channel separation detection in NoC |
CN104579951A (en) * | 2014-12-29 | 2015-04-29 | 合肥工业大学 | Fault-tolerance method in on-chip network under novel fault and congestion model |
-
2016
- 2016-03-24 CN CN201610184999.9A patent/CN105656773B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103973482A (en) * | 2014-04-22 | 2014-08-06 | 南京航空航天大学 | Fault-tolerant on-chip network system with global communication service management capability and method |
CN104052622A (en) * | 2014-06-23 | 2014-09-17 | 合肥工业大学 | Router fault-tolerant method based on fault channel separation detection in NoC |
CN104579951A (en) * | 2014-12-29 | 2015-04-29 | 合肥工业大学 | Fault-tolerance method in on-chip network under novel fault and congestion model |
Non-Patent Citations (2)
Title |
---|
TM:一种新的片上网络拓扑结构;王新玉等;《计算机学报》;20141130;第37卷(第11期);全文 * |
基于故障粒度划分的NoC链路自适应容错方法;欧阳一鸣等;《电子测量与仪器学报》;20150831;第29卷(第8期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN105656773A (en) | 2016-06-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Feng et al. | Addressing transient and permanent faults in NoC with efficient fault-tolerant deflection router | |
US5768300A (en) | Interconnect fault detection and localization method and apparatus | |
CN103973482A (en) | Fault-tolerant on-chip network system with global communication service management capability and method | |
CN105656773B (en) | The fault-tolerant module of highly reliable link and its method of transient fault and intermittent defect are directed in network-on-chip | |
CN103124224B (en) | Multiple faults for Industry Control allows Ethernet | |
CN102629912B (en) | Fault-tolerant deflection routing method and device for bufferless network-on-chip | |
CN105359468A (en) | Link transfer, bit error detection and link retry using flit bundles asynchronous to link fabric packets | |
CN106487673B (en) | A kind of error detection re-transmission fault tolerance rout ing unit based on triplication redundancy | |
CN104052622B (en) | Router fault-tolerance approach based on faulty channel isolation detection in network-on-chip | |
US20240283746A1 (en) | Method and system for robust streaming of data | |
Fochi et al. | An integrated method for implementing online fault detection in NoC-based MPSoCs | |
Yu et al. | Error control integration scheme for reliable NoC | |
US6999411B1 (en) | System and method for router arbiter protection switching | |
CN102710530B (en) | Configurable network-on-chip fault tolerance method | |
Castro et al. | A fault tolerant NoC architecture based upon external router backup paths | |
Ghiribaldi et al. | System-level infrastructure for boot-time testing and configuration of networks-on-chip with programmable routing logic | |
CN111726288A (en) | Real-time data transmission and recovery method and system for power secondary equipment | |
Št’áva | Efficient error recovery scheme in fault-tolerant NoC architectures | |
CN102904807A (en) | Method for realizing fault-tolerant reconfigurable network on chip through split data transmission | |
CN103346862B (en) | A kind of network-on-chip data transmission device of cascade protection and method | |
Boraten et al. | Energy-efficient runtime adaptive scrubbing in fault-tolerant network-on-chips (nocs) architectures | |
Lucas et al. | Crosstalk fault tolerant NoC: design and evaluation | |
CN107682118A (en) | A kind of NoC error correction and detections based on duplication redundancy retransmit fault-tolerance approach | |
Stava | On precise fault localization and identification in NoC architectures | |
Ghiribaldi et al. | Power efficiency of switch architecture extensions for fault tolerant NoC design |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20181002 |