WO2013041752A1

WO2013041752A1 - Method, system and router for avoiding blockages in an interconnection network

Info

Publication number: WO2013041752A1
Application number: PCT/ES2012/070656
Authority: WO
Inventors: Gonzalo Zarza; Daniel Franco Puntes; Diego Lugones; Emilio LUQUE FADÓN
Original assignee: Universitat Autonoma De Barcelona
Priority date: 2011-09-22
Filing date: 2012-09-19
Publication date: 2013-03-28
Also published as: ES2377087A1; ES2377087B1

Abstract

The method comprises: a) detecting a situation prone to a blockage; and b) identifying a routing cycle involved in the detected situation prone to a blockage by virtue of a router (rⁱ) carrying out the following substeps by means of an asynchronous intra-router and inter-router search mechanism which does not require the use of timers: b1) composing and sending an identification message from an input buffer (aⁱj) of the router (rⁱ) to an output buffer (b^hk) of another router (r^h); and b2) receiving the identification message in the output buffer (bⁱk) associated with said input buffer (aⁱj) of the router (rⁱ) which composed said message, following the retransmission thereof by at least another router (r^h) from an input buffer (a^hj) thereof. The system and the router are adapted to implement the method proposed by the invention.

Description

Method, system and blocking avoidance router in an interconnection network

Technical sector

The present invention generally concerns, in a first aspect, a method of blocking avoidance in an interconnection network based on the identification of routing cycles involved in situations prone to a blockage, and more particularly a method comprising carrying carry out said identification by means of a mechanism that does not require the use of timers.

The invention belongs to the field of parallel computer interconnection networks and focuses on the area of fault tolerance for high speed interconnection networks.

The object of the present invention is to identify cycles of resource dependencies in high-speed interconnection networks of asynchronous intra-router and inter-routers. The purpose of the identification is to prevent the occurrence of failures in the network devices from generating blocking situations in the interconnection network, thus allowing the termination of the applications that are running in the parallel computer.

A second aspect of the invention concerns a block avoidance system in an interconnection network adapted to implement the method of the first aspect.

A third aspect of the invention concerns a router intended for the implementation of the method of the first aspect of the invention.

Prior art

The performance of the latest generation parallel computing systems is closely related to the performance of the interconnection network that communicates its computing elements. This interrelation gives vital importance to the fault tolerance mechanisms of the interconnection network since, ultimately, it is the network that allows the operation of such computers as coherent and cohesive entities.

Under these circumstances, the occurrence of only one network failure is capable of generating potentially harmful anomalies in the computer system, and preventing the correct termination of the applications that are running in said system. Among these anomalies, the blocking scenarios, "deadlocks", are perhaps the most likely to occur because most of the current computer systems use routing algorithms that were not designed to tolerate failures (a situation that increases probability of occurrence of blockages in these systems). This problem becomes critical in situations where the computer system must deal with failures that appear randomly, that is, with dynamic failures. Some examples may be the systems of bank transactions or the systems used for geological studies and natural disasters, among others.

In the context of interconnection networks, blockages occur when at least one message cannot reach its destination, because in its path it requests resources that are not available, more specifically, storage space in the queue of a device Network This situation results in total system blocking because the messages do not release the resources that other messages need, while requesting resources that are in possession of other messages. This leads to a group of messages being permanently blocked, causing the entire computer system to be out of order. Unfortunately, the appearance of blockages is frequent in situations in which one or more network devices fail. This is because even a relatively low number of failures can generate cyclic resource dependencies, derived from changes in the characteristics of the network, mainly in the topology.

During the last decades, several authors have proposed solutions to the problem of blockages in interconnection networks, among which are: limit the injection of messages in the network [1], [2]; use virtual channels [3], [4]; and modify the network devices and routing algorithms [5], [6], [7]. Unfortunately, all the solutions proposed so far have been designed to avoid blockages in fault-free networks, so their applicability in the field of fault tolerance is very low.

The presence of network failures causes the vast majority - if not all - of the current blockage avoidance mechanisms to be seriously affected and limited, even not being effective. These limitations relate to important aspects of the general theory of fault tolerance, such as the timing and manner of occurrence of failures, the number of failures to tolerate, the duration of failures, the characteristics of the systems used, etc.

The way in which the fault occurs is the most important aspect to consider because it is closely related to the applicability and suitability of blockage avoidance techniques. If a static fault mode is assumed, it is essential to know in advance the location of such failures in the network. In contrast, when assuming a dynamic fault mode, the errors appear at random times and locations during system execution. The static failure model allows simpler solutions to be implemented, but the dynamic failure model reliably captures the behavior of failures in real computer systems.

If the simplest routing algorithms are used, those based on static failure mode, it is impossible to avoid dynamic failures because such algorithms do not allow (by themselves) the use of alternative and adaptive paths. In contrast, the more complex routing algorithms, the adaptive ones, allow alternative paths to be used to avoid failures. Even so, despite emerging as a very solution promising, from the point of view of fault tolerance, the adaptive algorithms approach presents a new and important problem: the appearance of deadlocks. This problem becomes critical in situations where the computer system must deal with failures that appear randomly throughout the execution of the system, that is, with dynamic failures.

Currently, the vast majority of block avoidance techniques capable of tolerating failures in interconnection networks use virtual channels to break the cyclic dependencies of resources between messages. The use of virtual channels limits the scalability of these techniques because the number of virtual channels needed is directly proportional to the number of failures to tolerate. For this reason, most of the proposed techniques are based on static failure models, in order to reduce the number of virtual channels required. Another group of commonly used techniques is based on the concept of limiting the injection of packets, avoiding the hoarding of network resources in order to avoid indefinite blocking situations. However, these techniques are useful for routing algorithms based on the use of minimum paths, which makes it difficult or simply impossible to tolerate a medium or high number of failures. Virtually all of these methods and techniques to avoid blockages are based on different types of cycle identification mechanisms. Unfortunately, most of these methods were designed to operate in the absence of failures, so they become inapplicable to the occurrence of network failures.

From the point of view of the state of the art, the closest antecedent to the invention proposed here is a three-stage blockage avoidance mechanism, called "Non-blocking Adaptive Oyeles" (NAO), proposed by the present authors. invention in [8] and [9]. From the explanation of the method and its application environment, it is possible to infer directly that the NAO method presents a series of synchronization problems that limit its application in high-speed interconnection networks. More specifically, NAO uses a "best effort" type cycle identification mechanism, which is applied step by step serially and is based on the control of router clock cycles to try to avoid situations known as "fake" negative. " That is, the method uses the clocks of the routers to avoid situations in which a routing cycle prone to generate blockages is not correctly identified, despite the existence of it. In this method, the identification is performed serially from the first router that detects the preset or trigger conditions. During this process, the identification is applied to a buffer, then, when a time predefined by the clock expires, the identification is made on the next input buffer, and so on. This situation makes it impossible to apply the mechanism proposed in [8] and [9] in high-speed interconnection networks because its application is restricted to cases in which the synchronization of router clocks is not lost. This is It is because in NAC situations can occur in which more than one router starts the identification process on the same dependency cycle. In this situation, each router handles requests serially, so that responses to identification requests may suffer certain delays of varying duration. In cases where the accumulation of delays causes the identification time to exceed the threshold value of the clock, at least one router will reach the wrong conclusion that there is no dependency cycle, so the corresponding actions will not be applied and it will generate a permanent blocking or "deadlock" situation in the network. In summary, the identification mechanism proposed in [8] and [9] can reach both blocking situations as well as "false negative" situations. Additionally, the block avoidance mechanism proposed in [9] has other restrictions that make its application very expensive, although not impossible. These limitations are based on the logical arrangement of the block avoidance elements, more specifically of the block avoidance buffers, which are in the critical path of the network devices, which considerably slows down the operation of the router since it is It is necessary to constantly carry out checks on which of the two physical paths connected to the input buffer should be used, both in the presence and in the absence of situations prone to generate blockages in the interconnection network.

In [12] a detailed explanation of the block avoidance method proposed in the documents [8] and [9] is made with the depth and level of detail of the document of a doctoral thesis, and a modification at the hardware level ( DAB buffers) described in the similarities section of this document.

In the method detailed in the thesis [12], the identification of routing cycles prone to generate blockages is done through a series of special flow control packets, which are sent along the same lines of communication as the data packets. This implies the mandatory existence of tacit intra and inter-router synchronism, so that in this method it is necessary to carry out an implicit control of the cycle identification time. This control, referred to in equations 4.8 and 4.9 of [12] (section 4.2.3, page 63), is what defines the time when the "gradual recovery of packet shipment" protocol is initiated in the method proposed in [12] and [9]. In this method, if a notification of cycle identification is not received after the waiting time defined in equation 4.9, it is assumed that there are no routing cycles prone to generate blockages and, therefore, proceed to move - even if possible - the first packet of the corresponding input buffer (Be) to the appropriate output buffer (Bs), without previously passing through the block avoidance buffer (DAB). The problem with this method is that the operation of the same allows blocking situations to be generated, as explained in the description of the state of the art of the patent application for methods [8] and [9], invalidating therefore its application universal. More specifically, what is explained in the thesis [12] does not contradict the explanation of the limitations set forth above with reference to documents [8] and [9].

The only modification / improvement at the hardware level that has been introduced in [12] with respect to the method presented in [9], is the relocation of block avoidance buffers (DABs) out of the critical path of network devices.

Explanation of the invention.

It seems necessary to offer an alternative to the state of the art that covers the gaps found therein, and that in particular overcomes the aforementioned limitations of those proposed in [8], [9] and [12].

To this end, the present invention provides, in a first aspect, a method of preventing blockages in an interconnection network, which comprises sequentially performing the following steps:

a) detect at least one situation prone to blockage; and

b) identify at least one routing cycle involved in said situation prone to a detected blockage, also known as a resource dependency cycle, by performing the following steps by a router of said interconnection network:

b1) compose and send an identification message from an input buffer of said router to at least one output buffer, of at least one other router, connected thereto; Y

b2) receiving said identification message in the output buffer associated with said input buffer of said router that composed it, after its retransmission by at least said other router from an input buffer thereof.

Unlike the method disclosed in [9], which requires the use of timers, the method proposed by the present invention comprises performing said step b) by means of an asynchronous intra-router and inter-router search mechanism that does not require the use of timers

The term buffer should be understood herein as that corresponding to a storage device used to compensate for differences that may occur in the rate of data transmission between devices, as well as in the time of occurrence of such transmission events.

The method proposed by the first aspect of the invention is an asynchronous and scalable method, compatible with current network technologies, capable of identifying and tracing cyclic resource dependencies in an asynchronous intra-router and inter-router manner, to ensure the avoidance of interconnection network blockages with multiple dynamic failures. According to an exemplary embodiment, the method comprises, in order to identify a plurality of routing cycles involved in situations prone to blockage, to carry out, in parallel, a plurality of stages b) by said router, initiated by a corresponding plurality of stages. b1) of composition and sending, in parallel, of a plurality of said identification messages from respective input buffers, and finalized by a corresponding plurality of steps b2) of receiving said identification messages in the output buffer associated with said input buffer of said router. This exemplary embodiment is possible thanks to the intra-router asynchronousness characteristics that the search mechanism used by the method of the invention uses, which allows for an input buffer of a router to initiate an identification process is not necessary. that another identification process initiated by another input buffer ends, both being able to be performed in parallel.

For an exemplary embodiment, said asynchronous intra-router and inter-router search mechanism is an amplitude search mechanism, or BFS ("Breadth First Search").

Said step a) comprises, according to one embodiment, locally assessing in each input buffer of said router operating conditions whose compliance establishes that the router is in a state prior to a blocking situation, which is interpreted as said detection of a situation prone to blockage.

The method comprises, by at least said other router, in general by several other routers, carrying out step a) for their input buffers and performing said retransmission of the message or identification messages received through the buffer or input buffers where said operating conditions are met, before being received or received, in said sub-stage b2), in the output buffer or buffers of the router that composed the message or identification messages sent.

Also, for another embodiment, the method comprises at least initiating a corresponding sub-stage b1) by said other router or at least one router of said plurality of routers, during the realization of step b) by said router. That is to say that it is possible for a router involved in an identification process initiated by another router to initiate his or her own identification process before the end of the one in which he or she is involved is terminated since, although this implies accumulation of delays , thanks to the inter-router asynchronousness characteristics that the search mechanism used by the method of the invention uses, no router will reach the wrong conclusion that there is no dependency cycle even if the reception of the sub-message is delayed step b2), since, unlike what is proposed in [9], said reception is not limited to any time threshold value. For an exemplary embodiment, the method comprises, after identifying a routing cycle in step b) associated with an input buffer and an output buffer of said router, freeing a space in said input buffer by moving a first message of said buffer of entry to a block avoidance buffer attached to it outside its critical path.

The method comprises, after said release of a space in the input buffer of said router, to initiate a protocol for gradual recovery of message movement that guarantees the movement of messages in said routing cycle identified in step b).

According to an exemplary embodiment, the method comprises initiating a plurality of said gradual message movement recovery protocols, in parallel, for a corresponding plurality of routing cycles identified in step b).

The method proposed by the first aspect of the invention constitutes a method of identification and monitoring of resource-tolerant failure-tolerant cycles for high-speed interconnection networks, based on a novel technique that allows parallel and distributed priority searches. in breadth. The proposed method consists in asynchronously identifying the cycles of resource dependencies through an identification process applied in the opposite direction to the one used in the process of allocating resources for routing messages through the network.

The method aims to provide interconnection networks with an asynchronous cycle identification mechanism, scalable and easy to implement by means of hardware solutions in current network devices, to avoid situations of blockages in the event of one or more failures in the network devices In summary, the new method proposed here is a non-blocking asynchronous cycle identification mechanism that, unlike previous proposals, dispenses with the use and control of router clocks and allows parallel application of searches in breadth.

By dispensing with these controls, the new mechanism can be applied asynchronously to any circumstance and network topology. In addition, this new mechanism allows both identification processes as well as recovery processes to be applied in parallel, which represents a novel advance in terms of resource optimization that allows to improve the total system performance under all operating conditions of the interconnection network

The method proposed by the first aspect of the invention allows searches to be carried out: 1. Asynchronous: since not using the clocks of the routers during the process of identifying resource dependency cycles, the method does not depend on any type of inter-synchronization. - routers. Unlike the previous methods, the invention proposed here does not present problems in situations where more than one router can initiate the process of identifying cycles of form simultaneous or almost simultaneous on the same cycle of resource dependencies;

2. In parallel: since the intra-router replication of the cycle identification packages can be performed in parallel and simultaneously on all the input buffers that meet the pre-established conditions, when two or more cycles occur simultaneously that share or they overlap part of the way. Unlike the previous methods, which apply the serial identification process, the proposal presented here allows parallel identification, regardless of the use of clocks in routers;

3. Non-blocking: because it differentiates blocking situations from congestion situations in the interconnection network without the need to use router clocks as time limits. This contribution of the invention prevents the cycle identification process from being blocked under any circumstances (unlike the previously published methods). Thanks to the structural modification consisting in the relocation of block avoidance buffers to the side of the critical path of the router, carried out for an embodiment, it prevents the operation of the router from blocking and / or slowing down during the process of identification and treatment of resource dependency cycles.

A second aspect of the invention concerns a block avoidance system in an interconnection network, comprising two or more routers provided for the implementation of the method of the first aspect, with input buffers, output buffers and a unit of control suitable for performing said detection of said stage a) and said identification of said stage b).

For an exemplary embodiment, at least one of said two routers comprises a block avoidance buffer attached to an input buffer thereof outside its critical path, its control unit being provided to implement the method of the first aspect according to the example of embodiment explained above with reference to the release of a space in the router's input buffer.

Also, according to an embodiment of the system of the second aspect, said control unit is intended to execute at least one protocol for gradual recovery of message movement by implementing the method of the first aspect.

A third aspect of the invention concerns a block avoidance router in an interconnection network, intended for the implementation of the first aspect method, comprising input buffers, output buffers and a control unit capable of detecting of stage a) and the identification of stage b). According to an embodiment, the router proposed by the third aspect of the invention comprises a block avoidance buffer attached to an input buffer thereof outside its critical path, said control unit being provided to implement the first aspect method. according to the exemplary embodiment explained above with reference to the release of a space in the router's input buffer.

Said control unit is provided, according to an exemplary embodiment, to execute at least one protocol for gradual recovery of message movement by implementing the method of the first aspect. Comparison between the present invention and the state of the art:

As described above, the method proposed by the present invention describes an identification of asynchronous and intra-router cycles and parallel inter-routers, based on the sending of special flow control signals through dedicated communication lines (ACKe and ACKs in Figure 2, which will be described in detail in a later section), instead of using the same data packet communication lines that are used in [12] and [9]. This is one of the key features of the invention, since it allows the routers to carry out the identification independently, both of the data communication process and also of the mechanisms for controlling the waiting times of receipt of identification notifications. of cycles prone to generate blockages.

This allows more than one process of identification of routing cycles (inter-routers) likely to generate blockages to be carried out in parallel. This is the case detailed in Figure 4 (described in the detailed description section of some examples of embodiment) in which the parallel identification of two routing cycles simultaneously is exemplified. In addition, and unlike methods [12], [8] and [9], the method of the present invention allows parallel processes of identification of cycles belonging to independent routing functions and cycles of non-overlapping resource dependencies. Note the difference with the example in Figure 4.11 on page 65 of the description, the method of the doctoral thesis [12], in which identifications of parallel routing cycles are not represented. This is due to the fact that the method proposed in [12] does not provide a solution that allows simultaneous identification of parallel routing cycles.

In the method of the present invention, once the cycle identification process has begun, the packets of the input buffer (Be) are only moved to the output buffer (Bs) if the blocking conditions disappear; and they never move because the maximum identification time has expired, unlike the methods proposed in [12] and [9]. In the method detailed in the thesis [12], packets always move from the input buffer (Be) to the output buffer (Bs) at the expiration of the time defined by equations 4.8 and 4.9 of [12], which It enables the appearance of blockages when the network is operating in specific situations, as described previously in the prior art section of the prior art.

The method of the present invention is not based on an explicit notification of the absence of routing cycles (which in [12] is generated at the expiration of the maximum time defined by equations 4.8 and 4.9), but continues to operate asynchronously to the indefinite waiting one of two situations: that the notification of a correct identification of the cycle is received; or, otherwise, the conditions of existence of potential blockages generated by congestion situations on the network disappear. Therefore, in the proposal of the present invention, the packets of the input buffers (Be) involved in possible routing cycles are only moved to the block avoidance buffers (DAB) when the corresponding cycle identification signal is received .

Additionally, the control of the status of the input (Be), block avoidance (DAB), and output (Bs) buffers is made explicitly, by means of a series of independent communication paths, directly communicated with the unit of routing and arbitration (EA). These communication paths are now explicitly shown in Figure 1 (a more detailed definition of Figure 2). This is another of the differences regarding the method proposed in [9] and explained in detail in the doctoral thesis [12].

Other differences between the present invention and [12] are that, in [12] it is explained that the injection of flow control packets is asynchronous with respect to routing (page 57, section 4.2). However, this does not imply that communication lines are not shared (in fact they are shared). In the method proposed in [12], the composition of the flow control packets is really asynchronous, but the processing and transmission of these packets is done on the same communication channel, so there is undoubtedly reference to serial operations and with a synchronism given by the use of the (same) communication channels for data transmission and flow control packets.

Reference is made to [12] that a cycle identification method is applied that can be represented as a BFS, but the reference corresponds only to the method applicability test (page 64, section 4.2.4). This does not imply that the identification can be performed asynchronously and in parallel (as proposed in the patent application). Moreover, on the same page 64 it is explicitly stated that the BFS is used to find "a cycle in the graph" without referring to the parallel and asynchronous identification of more than one cycle, as the method proposed in the patent application. In the method proposed by the present invention, however, it is proposed, for some embodiments, to apply the BFS search in a parallel and distributed manner in each of the network devices. Additionally, in [12] it is mentioned that the proposed technique is scalable, referring to the fact that the use of virtual channels is not required (page 53, chapter 4). However, this statement has no relation to synchronism or the parallelism of the method.

In [12] it is not mentioned anywhere that the proposed cycle identification method is parallel, nor can it be performed in parallel in any way. Specifically, the method detailed in [12] does not allow the identification process or the recovery process to be carried out in parallel, which is expressed in the upper limit of equation 4.9 of the worst case time (page 64, section 4.2. 3) [12], as well as in Figures 4.9 and 4.11 of [12]. In the method of the present invention, on the other hand, the identification processes can be overlapped, thus advancing the movement recovery process, reducing the recovery time thanks to the synchronism typical of the proposed identification process, as shown in Figures 3 and 4.

Most apparent similarities between the method proposed by the present invention and that of [12] are based on generic definitions and standards of the general theory of blockage avoidance, such as: the (generic) requirement of identification of potential blocking situations ; the requirement (also generic) to identify the routing cycles involved; and the conditions that must be reached to ensure that these cycles do not represent potential blocking situations. These definitions / requirements are well known and are present in the specialized literature, such as [3], [4] and [10], for more than 30 years. Note that the contribution of the present invention is based on the definition of how said generic processes materialize in real physical devices, and the mechanisms that allow such actions to be performed asynchronously, parallel and distributed (in addition to scalable), in order to give solution to a greater number of situations prone to generate blockages in high speed interconnection networks. Brief description of the drawings

The foregoing and other advantages and features will be more fully understood from the following detailed description of some embodiments with reference to the attached drawings, which should be taken by way of illustration and not limitation, in which:

Figure 1 is a summary table of the notation and the operators used in the description of the present invention;

Figure 2 is a diagram of the simplified architecture of the router or network device proposed by the third aspect of the invention;

Figure 3 illustrates the detail of the activation and deactivation process of the cycle identifiers in a router, according to an embodiment of the method proposed by the first aspect of the invention;

Figure 4 is a graph representation of two resource dependency cycle identifications in the interconnection network based on the network device of Figure 2, for an exemplary embodiment of the method proposed by the first aspect of the invention; Figure 5 is a graphic representation of the three stages that make up the method proposed by the first aspect of the invention, for an exemplary embodiment;

Figure 6 illustrates two interconnected routers that include the proposed notation for the description of some embodiments that will be performed in the following section with respect to the method proposed by the first aspect of the invention;

Figure 7 A illustrates the interconnection network of Figure 6, where a router has three input buffers full and one output buffer also full and connected to an input buffer full of a second router, and Figure 7B is a representation of bonuses of said interconnection network;

Figure 8 is a graphic representation of the conditions for detecting situations prone to generate blockages to be detected according to step a) the method proposed by the first aspect of the invention, illustrated on the scheme of Figure 6; Y

Figure 9 is a graphical representation, on the scheme of Figure 6, of the recovery conditions in which, by applying the method proposed by the first aspect of the invention for an embodiment, a space has been released in the buffer of input of one of the routers illustrated as a previous step to the beginning of a protocol of gradual recovery of message movement.

Detailed description of some embodiments

The notation used in this section for the description of the invention is detailed in the table of Figure 1, as well as in the interconnection network illustrated in Figure 6.

Likewise, a simplified diagram of the architecture of a network device or router based on the postulates of the invention presented here is shown in Figure 2, including the incoming Acknowledge (ACKe) and outgoing (ACKs) lines, which represents , for an exemplary embodiment, both the router of the third aspect of the invention, and those included in the system of the second aspect, as well as the one used by the method of the first aspect.

In said Figure 2 the input buffers have been generically referenced as

Be and the output ones like Bs, a switch like C, as well as a series of input link controllers like CEE and a series of output link controllers like CES, all of them conventional elements in the kind of network device illustrated.

Likewise, following Figure 2, the block avoidance buffers such as DAB and the routing and arbitration unit such as EA have been indicated. The function of these elements will be described later.

The examples of embodiment that are presented and described in detail in this section are based on the context of the method proposed in [12] and [9], but introducing the modifications of the present invention that allow inter-router and parallel parallel asynchronous communication. router. The novelty of the proposal made by the present invention is based on the use of an asynchronous search mechanism intra-router and inter-routers that allows to identify in real time the appearance of cyclic dependencies in the allocation of routing resources, as well as to perform said identification in a parallel and distributed manner.

The cycle identification mechanism proposed in the invention is, for an exemplary embodiment, an original extension of the amplitude priority search processes or "Breadth-First Search" (BFS), based on the parallel application of distributed searches in each of the network devices that are part of the resource dependency cycle. During the search, a series of actions are applied in parallel and asynchronous starting from a network device, or router, in charge of starting the cycle identification process.

Said network device, indicated in Figures 6 to 9 as node r ', constitutes the initial point of the identification process and is the device that sends the identification messages through at least one input buffer to', which does not possess available storage space and, in turn, need to access an output buffer b ' _k of the same device that also does not have available space, that is to say that it complies with the aforementioned operating conditions whose compliance establishes that router H is is in a state prior to a blocking situation. These actions are part of the intra-router identification process.

The network devices r ^h (see Figure 9) that are physically connected to the input buffers a ', of the network device H, for which the identification messages are sent, repeat the same procedure applied in the device r' , that is, they replicate the identification message to the network devices connected to their input buffers to ^h _j that meet the conditions detailed above, extending the identification process to the inter-router search space. Otherwise, network devices r ^h that do not have input buffers a ^h that meet these conditions drain - eliminate - identification messages instead of replicating them. In other words, the network devices that replicate the identification messages activate the cycle existence indicators asynchronously, while the network devices that drain the messages deactivate them. This process is exemplified in Figure 3, where the gray circles correspond to the activated indicators and the white circles to the deactivated indicators.

In said Figure 3 five nodes r ^and , r ^h , r ² , r ', r ^m are illustrated and it can be seen that the central node r' has four input buffers a, a ' ₂ , a' ₃ , a ' ₄ and four output b'i, b ' ₂ , b' ₃ , b ' ₄ and which has initiated three parallel identification cycles through three respective input buffers a, a' ₃ , a ' ₄ in which the three operating conditions previously mentioned and indicative of a state prior to a blocking situation were fulfilled, and which are respectively connected to an output buffer b ^and ₄ of node r ^and , to a buffer of output b ^h ₃ of the node r ^h and an output buffer b ^z i of the node for the transmission of respective cycle identification messages.

The cycles are identified when the network device r ¹ that initiated the identification and tracking process receives the corresponding cycle identification message by the correct output buffer b ' _k , which in the case illustrated by Figure 3 is buffer b ' ₃ , which has received said identification message through the input buffer to ^m ₃ of node r ^m .

This last process is exemplified in the diagram of Figure 4, where two processes of identification of cycles applied asynchronously on the same cycle of resource dependencies are shown, from the network devices r ¹ and r ^h , respectively. Explained colloquially, the method of identifying and monitoring cycles consists of illuminating the path step by step (dependency cycles), lighting (activating) bulbs (cycle indicators) along the path. As the network devices themselves - both those that replicate and those that drain the messages - are responsible for activating and deactivating the cycle indicators, the invention proposed here allows the cycle identification method to be applied in a distributed and asynchronous manner. throughout the cycle of resource dependencies.

This asynchronous distributed search approach in amplitude has not been previously proposed or applied by any method of blocking avoidance.

The new identification method proposed by the present invention in its first aspect is composed, for an exemplary embodiment, in three stages, represented graphically in Figure 5:

1. The first identification of the resource dependency cycle, initiated by the first router r '(corresponding to node 1 of Figure 5 as 1) that detects the buffer occupation conditions explained above. This stage ends when the router r 'receives the corresponding cycle identification message through the correct output buffer b' _k . These actions correspond to stage 1 of Figure 5;

2. Secondary identifications, which are made on the same cycle of resource dependencies but are initiated by at least one other router r ^h from an input buffer to ^h _j of the same. These actions correspond to stage 2 of Figure 5, illustrated for four of said secondary identifications initiated from four respective routers (or nodes 2, 3, 4 and 5 in Figure 5);

3. The propagation of the "recovery" signal, initiated by the router r 'immediately after the completion of step 1. The use of the recovery signal is explained in detail below, for an example of application of the invention . These actions (R1 to R5) correspond to stage 3 of Figure 5. The demonstration of the applicability and suitability of the mechanism for identifying cycles belonging to the invention presented here is based on the fact that one of the purposes of BFS searches is to "find cycles in a graph or prove that such cycles do not exist. ". Said BFS type search approach is based on the mathematical concepts from which the architecture of a network can be represented as a graph (abstract representation of a set of vertices or nodes and edges) where the vertices represent the network devices or processing, as appropriate, and the edges represent the links that connect the devices [11]. If the input (a ',) and output (b' _k ) buffers that make up the logical buffers (c '' ^m _kJ ) in Figure 6 are considered as nodes, and the connections of the flow and signal control system ACK - with origin and termination in the routing and arbitration unit EA - of the network device such as edges, an equivalent representation of the network can be obtained as a graph, as exemplified in Figures 7a and 7b, respectively. From this graph, asynchronous and parallel intra-router and asynchronous and distributed inter-router BFS type searches can be performed, according to the method proposed in the invention presented herein.

In order to improve the compression of the invention, an example of application of the method of identification and monitoring of cycles proposed herein is described below. The application example includes the detection of situations prone to generate blockages, and the identification and monitoring of the routing cycles involved in said situations, which is the object of the present invention.

Conceptually, the proposal is based on preventing the fulfillment of two of the four main conditions that cause blockages: the "hold-and-wait" condition, and the "circular wait" condition (" circular wait ") [10]. To achieve this objective and thus avoid the appearance of blockages, a DAB block avoidance buffer (by the English name "Deadlock Avoidance Buffer") is attached to each input buffer as exemplified in Figure 2, and the application of a group of simple actions when accessing and using the output buffers that do not have available free space. It is important to clarify that these actions only apply under specific circumstances directly related to the space available in the buffers of certain network devices, as explained above.

In order to detect what are the situations that have the ability to generate blockages in the network, three operating conditions are proposed (illustrated in Figure 8, and indicated as (1), (2) and (3)) that they act as triggers for blockage avoidance mechanisms. The situation described by these conditions is that of an input buffer of any node that does not have available storage space (Condition (3)) and tries to choose as an next step an output buffer (within the same network node) that neither It has available space (Condition (2)) and is physically linked to Another network node whose corresponding input buffer also has no available space (Condition (1)).

These three conditions are evaluated locally on each network device r 'before each message routing cycle. If all three of the above conditions are met, the network device is in a state prior to a blocking situation (since there are still resources available in DAB buffers). In these circumstances, the network device in which these conditions arise (which is the node r 'in Figure 8) applies the following actions:

1. Stops the injection of messages in the logical buffer (c '' ^m _kJ );

2. Stops the injection of new messages from the input buffer connected to the local processing node (if necessary), ie from the input buffer of node r 'marked in gray.

After applying these two actions, the network device r 'must identify and follow the routing cycles between origin-destination pairs prone to generate blockages. This process is responsible for identifying the start and end of the routing cycles between origin-destination pairs in each network device involved in a possible blocking situation. Note that this process corresponds to stages 1 and 2 of Figure 5. The identification of resource dependency cycles can be performed asynchronously and parallel intra-router and asynchronous and distributed inter-routers, making the identification in the opposite direction the one used in the process of allocating resources for routing messages through the network. This process is intended to ensure access to the last free resources in a routing cycle by the correct input buffer.

The necessary steps for the identification and monitoring of routing cycles applied by a network device are:

1. Compose a new identification message that includes as information: the identifier of the network device H that created it; the identification of the input buffer to ' _j where the firing conditions were met; and the identifier of the output buffer b ' _k by which the message should be received if it is part of the routing cycle;

2. Send the identification message in the opposite direction to the message routing address, through the point-to-point flow control mechanism of the network device r '. The message is sent to the output buffers (b ^h _k ec ^hl _kj ) of the network devices r ^h (see Figure 9) physically connected to the input buffers a ', which meet the firing conditions. Upon receiving one of these identification messages, each network device r ^h must verify the conditions and, if at least one of the input buffers at ^h meets those conditions, the network device replicates (retransmits) this identification message to the appropriate buffers. Otherwise, the network device drains (deletes) the message. The first stage of identifying the cycle of Figure 5 concludes when the identification message arrives at the network device that composed it r 'through the correct output buffer b' _k , as illustrated by the arrow on the gray line of Figure 4 for the node or that of the black for the node r ^h .

Once the input and output buffers a ' _j , b' _k that are part of the cycle have been identified, a space in the input buffer a ' _{j is released} by the move operation (a' _j , d '). This operation is the one that finally introduces a new free space in the routing cycle by moving the first message from the input buffer a ' _j to the DAB buffer' (see Figure 9), unlocking the semi-blocking situation of the input buffer to ', and allowing the messages of the cycle to move towards their respective destinations. The cases in which the identification does not conclude are due to the fact that the identification message has been sent through the input buffer a ', correct but there is at least one network device r * throughout the cycle in which the conditions they are not fulfilled so the identification message is deleted. In this case, the routing cycle is not blocked but congested.

As a first step, the network device r 'in which the risk situation was identified must compose and send (in the opposite direction to the routing of application messages) the recovery signal corresponding to step 3 of Figure 5, in order that the network device r ^h that precedes it in the routing cycle initiates the gradual motion recovery protocol. This signal is used to alert previous network devices that the new space generated in their output buffer b ^h _k is actually a space generated from the use of the DAB buffer. This signal allows the network device r ^{h to} distinguish between the following two situations:

• The new space was generated by the disappearance of the situation prone to generate a blockade;

• The new space was generated by the use of the DAB buffer in the network device connected to its output buffer b ^h _k .

After receiving the recovery signal, the network device r ^h uses the information obtained from its own identification process (one of the secondary identifications in step 2 of Figure 5) to identify locally which of the input buffers (a ^h _j ) belongs to the routing cycle. This action is critical since each network device that is part of the routing cycle must apply the move (a ^h _j , d ^h _j ) operation on the correct input buffer to ^h . Once the routing cycles involved in situations prone to generating blockages are identified, a set of actions that guarantee the movement of messages in said cycles without blocking situations occurring. This set of actions has as its starting point the secondary identification stage of Figure 5, in which at least one router r ^h starts its own identification of the dependency cycle to know which of its input buffers a ^h and output b ^h _k are involved in the cycle and then be able to ensure the gradual recovery of message movement until the normal operating conditions of the network are reached, that is, that at least one of the conditions in Figure 9 (indicated) is not met as (4), (5), (6) and (7)). If any of these four conditions is not met, the protocol of gradual recovery of message movement must be finalized since this implies the network has returned to reach its normal operating conditions. As a first step, the injection of messages in the logical buffer (c ^hl _kj ) is restarted by the start_st (b ^h _k ) action (giving priority to the messages stored in the DAB buffers). Finally, the injection of messages from the local processing node is again allowed by the start_fw (a ^h _j ) action.

Possible industrial application and production of the invention:

The goal of the invention is to achieve its incorporation, use and implementation - at industrial and production level - in high performance interconnection networks. It is for this reason that the invention seeks to provide a method of blocking avoidance that allows a high-speed interconnection network to tolerate one or more faults in its components. The scheme of Figure 2 represents a possible starting point for the design of devices compatible with the invention, although it is only illustrative and not limiting.

A person skilled in the art could introduce changes and modifications in the described embodiments without departing from the scope of the invention as defined in the appended claims.

References

[1] A. W. Roscoe, "Routing messages through networks: an exercise in deadlock avoidance," in Programming of Transputer Based Machines: Proceedings of 7th occam User Group Technical Meeting, 1987.

[2] V. Puente, C. Izu, R. Beivide, J. A. Gregorio, F. Vallejo, and J. M. Prellezo, "The adaptive bubble router," J. Parallel Distrib. Comput., Vol. 61, no. 9, pp. 1180-1208, 2001.

[3] J. Duato, "A theory of deadlock-free adaptive multicast routing in wormhole networks," Parallel and Distributed Systems, IEEE Transactions on, vol. 6, pp. 976-987, Sep 1995.

[4] J. Duato, "A necessary and sufficient condition for deadlock-free routing in cut-through and store-and-forward networks," IEEE Trans. Parallel Distrib. Syst., Vol. 7, no. 8, pp. 841-854, [5] S. Konstantinidou and L. Snyder, "The chaos router: a practical application of randomization in network routing," in SPAA '90: Proc. of the 2nd Annual ACM Symposium on Parallel

Algorithms and Architectures, (New York, USA), pp. 21-30, 1990.

[6] S. Konstantinidou and L. Snyder, "Chaos router: architecture and performance," SIGARCH Comput. Archit. News, vol. 19, no. 3, pp. 212-221, 1991.

[7] S. Konstantinidou and L. Snyder, "The chaos router," Computers, IEEE Transactions on, vol. 43, pp. 1386-1397, Dec 1994.

[8] G. Zarza, D. Lugones, D. Franco, and E. Luque, "Deadlock avoidance for interconnection networks with multiple dynamic faults," in 18th Euromicro International Conference on Parallel, Distributed and Network-Based Computing, pp. 276-280, Feb 2010.

[9] G. Zarza, D. Lugones, D. Franco, and E. Luque, "Non-blocking adaptive cycles: Deadlock avoidance for fault-tolerant interconnection networks," in Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS), 2010 IEEE International Conference on, pp. 1 -4, Sep 20th 2010.

[10] E. G. Coffman, M. Elphick, and A. Shoshani, "System deadlocks," ACM Comput. Surv., Vol. 3) No. 2, pp. 67-78, 1971.

[11] L.-H. Hsu and C.-K. Lin, Graph Theory and Interconnection Networks. CRC Press, 2008.

[12] G. Zarza, "Multipath Fault-tolerant Routing Policies to deal with Dynamic Link Failures in High Speed Interconnection Networks" - Autonomous University of Barcelona, School of Engineering - Dept. of Computer Architecture and Operating Systems. 2011, Chapter 4, pages 53-66.

Claims

one . - Method of avoiding blockages in an interconnection network, which comprises sequentially performing the following steps:

a) detect at least one situation prone to blockage; and

b) identify at least one routing cycle involved in said situation prone to a blockade detected, by performing the following sub-stages by a router (r ¹ ) of said interconnection network:

b1) compose and send an identification message from an input buffer (a ' _j ) of said router (r ¹ ) to at least one output buffer (b ^h _k ), of at least one other router (r ^h ), connected the same; Y

b2) receiving said identification message in the output buffer (b ' _k ) associated with said input buffer (a' _j ) of said router (r ¹ ) that composed it, after its retransmission by at least said other router (r ^h ) from an input buffer (a ^h _j ) thereof;

the method being characterized in that it comprises performing said step b) by means of an asynchronous intra-router and inter-router search mechanism that does not require the use of timers.

2. - Method according to claim 1, characterized in that, to identify a plurality of routing cycles involved in situations prone to a blockage, the method comprises carrying out, in parallel, a plurality of stages b) by said router (r ¹ ), initiated by a corresponding plurality of sub-stages b1) of composition and sending, in parallel, of a plurality of said identification messages from respective input buffers (a, a ' ₂ , a' ₃ , a ' ₄ ) , and terminated by a corresponding plurality of steps b2) of receiving said identification messages in the output buffer (b'i _, b ' _2, b' _3, b ' ₄ ) associated with said input buffer (a, a ' ₂ , a' ₃ , a ' ₄ ) of said router (r ¹ ).

3. - Method according to any one of the preceding claims, characterized in that said asynchronous intra-router and inter-router search mechanism is an amplitude search mechanism.

4. - Method according to any one of the preceding claims, characterized in that said identification message includes at least information referring to the identifier of the router (r ¹ ) that created it, to the identification of the input buffer (a ' _j ) by the that was sent and to the identifier of the output buffer (b ' _k ) through which the identification message should be received if it is part of the routing cycle.

5. - Method according to any one of the preceding claims, characterized in that said step a) comprises assessing locally at each input buffer (a, a ' ₂ , a' ₃ , a ' ₄ ) of said router (r ¹ ) conditions of operation whose compliance establishes that the router (r ¹ ) is in a state prior to a blocking situation, which is interpreted as said detection of a situation prone to a blocking.

6. - Method according to claim 5, characterized in that it comprises, by at least said other router (r ^h ), carrying out said step a) for its input buffers (a ^h i, a ^h ₂ , a ^h ₃ , a ^h ₄ ) and carry out said retransmission of the message or identification messages received through the buffer or input buffers (a ^h i, a ^h ₂ , a ^h ₃ , a ^h ₄ ) where said operating conditions are met.

7. - Method according to claim 6, characterized in that it comprises carrying out said retransmission of the message or identification messages through a plurality of routers (r ^h , r ^and , r ^m ), by their input buffers that meet said conditions of operation, before being received or received, in said sub-stage b2), in the output buffer (b ' _k ) of said router (r ¹ ) that composed the message or identification messages sent.

8. - Method according to claim 5, characterized in that said operating conditions describe a situation in which an input buffer (a ',) of a router

(r ¹ ) does not have available storage space and tries to choose as an next step an output buffer (b ' _k ), of the same router (r ¹ ), which also does not have available space and is physically linked to the input buffer ( a ^m _j ) from another router (r ^m ) that also has no available space.

9. Method according to any one of the preceding claims, characterized in that it comprises at least initiating a corresponding sub-stage b1) by at least said other router (r ^h ) or at least one router of said plurality of routers (r ^h , r ^m ), during the realization of step b) by said router (r ¹ ).

10. - Method according to any one of the preceding claims, characterized in that, after identifying a routing cycle in said stage b) associated with an input buffer (a ',) and an output buffer (b' _k ) of said router (r ¹ ), freeing a space in said input buffer (to ¹ ,) by moving a first message of said input buffer (to ¹ ,) to a block avoidance buffer (d ¹ ) attached thereto out of its way critical.

11. - Method according to claim 10, characterized in that, after said release of a space in the input buffer (a ',) of said router (r ¹ ), initiate a protocol for gradual recovery of message movement that guarantees the message movement in said routing cycle identified in step b).

12. - Method according to claim 11, characterized in that it comprises initiating a plurality of said protocols for gradual recovery of message movement, in parallel, for a corresponding plurality of routing cycles identified in step b).

13. - Block avoidance system in an interconnection network, characterized in that it comprises at least two routers (r ¹ , r ^h ) provided for the implementation of the method according to any one of the preceding claims, with input buffers (a ' _j , a ^h _j ), output buffers (b' _k , b ^h _k ) and a control unit suitable for performing said detection of said stage a) and said identification of said stage b).

14. - System according to claim 13, characterized in that at least one (r ¹ ) of said two routers (r ¹ , r ^h ) comprises a block avoidance buffer (d ¹ ) attached to an input buffer (a ' _j ) thereof outside its critical path, its control unit being provided to implement the method according to claim 9 to perform said release of a space in said input buffer (a ' _j ).

15. - System according to claim 14, characterized in that said control unit is provided to execute at least one protocol for gradual recovery of message movement by implementing the method according to claim 11 or 12.

16. - Block avoidance router in an interconnection network, characterized in that it is intended for the implementation of the method according to any one of claims 1 to 12, comprising input buffers (a ' _j ), output buffers (b ' _k ) and a control unit capable of performing said detection of said stage a) and said identification of said stage b).

17. - Router according to claim 16, characterized in that it comprises a block avoidance buffer (d ') attached to an input buffer (a' _j ) thereof outside its critical path, said control unit being provided to implement the method according to claim 10 for carrying out said release of a space in said input buffer (a ' _j ).

18. - Router according to claim 17, characterized in that said control unit is intended to execute at least one protocol for gradual recovery of message movement by implementing the method according to claim 1 1 or 12.