WO2007139542A1

WO2007139542A1 - Uninterrupted network control message generation during local node outages

Info

Publication number: WO2007139542A1
Application number: PCT/US2006/020681
Authority: WO
Inventors: Dieter Stoll; Georg Wenzel; Wolfgang Thomas
Original assignee: Lucent Technologies Inc.; Lucent Technologies Network Systems Gmbh
Priority date: 2006-05-30
Filing date: 2006-05-30
Publication date: 2007-12-06
Also published as: KR101017540B1; EP2030378A4; EP2030378A1; KR20090016676A; CN101461196A; JP2009539305A

Abstract

A caching mechanism is provided to prevent packet network reconfiguration and associated traffic loss in case of temporary control plane outages.

Description

UNINTERRUPTED NETWORK CONTROL MESSAGE GENERATION DURING LOCAL NODE OUTAGES

FIELD OF THE INVENTION The present invention generally relates to computer networks. In particular, the present invention relates to packet switching and control plane protocols.

BACKGROUND OF THE INVENTION Packet switching networks include control plane protocols, such as the spanning tree protocol (STP), the generic attribute registration protocol (GARP) and its version for virtual local area networks, the VLAN registration protocol (GVRP), the link aggregation control protocol (LACP), Y.1711 fast failure detection (FFD), and reservation protocol (RSVP) refresh. Control protocols have the responsibility to, for example, control the topology and distribution of how layer 2 (L2) traffic flows through the network. These protocols are realized in the state machines running on each participating network element. Once a stable network configuration has been reached, the protocols tend to repeat the same messages they send to the network. Different messages usually result from an operator or defect driven change in the network. A failure in participating in the protocol by a network element leads to traffic rearrangements once a timeout period ranging from a few milliseconds to a few seconds is exceeded. In some cases, traffic rearrangements involve the entire network. In current network elements, the packet control protocols fall into one of three categories. They are (1 ) unprotected; (2) protected via proprietary communication with the neighbor network elements prior to control plane outages; or (3) protected by standardized graceful restart technology, which requires interaction with neighbor network elements shortly before or after a protocol outage. In the unprotected case, the result will, in general, be that the traffic flow through the network is reconfigured. During the time of reconfiguration, traffic loss will occur in parts of the network that can be as large as the entire network domain. When the failed network element recovers, a

468327 1 second reconfiguration will occur to re-establish the traffic flow distribution prior to the failure. Again, traffic loss will occur in a similar order of magnitude as before. In the proprietary implementation, there are two disadvantages. First, it covers only part of the problem scenarios, namely those that are voluntarily entered (e.g., in case of an operator driven software upgrade in a network element) and which allows the failing network element to inform its neighbors of the control plane failure to come. Second, it is restricted to interacting network elements that possess these capabilities, i.e., it will not function in general interworking scenarios with other equipment vendors. In the standardized graceful restart case, only a small set of protocols are covered. If time constraints are small for telling neighbor elements after the failure that a graceful restart is to be applied, then the likelihood of missing the constraint for unintended failures is high. Missing the time limit will result in traffic loss, as the neighbor elements will detect the control plane outage and trigger network reconfiguration.

Accordingly, there is a need for a mechanism to prevent packet network reconfiguration and associated traffic loss in case of temporary packet control plane outages.

SUMMARY

Exemplary embodiments of the present invention prevent packet network reconfiguration and associated traffic loss by providing uninterrupted network control message generation, during local node outages.

One embodiment is a method for providing uninterrupted network control message generation during local node outages. A message cache receives a number of sent messages from a protocol state machine for a local node and forwards them to other nodes in the network. The message cache also receives messages from the nodes. The message cache stores both the sent and received messages in a buffer. Upon failure of the protocol state machine, the message cache sends messages to and receives messages from the nodes, so long as the buffer remains valid. The messages may be sent periodically to the nodes. The message cache may determine whether the buffer is valid based on the messages in the buffer and messages received from the nodes after the

468327 1 failure. The method may also include switching to a standby protocol state machine, upon failure of the active protocol state machine, where the standby protocol state machine includes another buffer replicating the first buffer.

Another embodiment is a computer readable medium storing instructions for performing this method for providing uninterrupted network control message generation during local node outages.

Yet another embodiment is a system for providing uninterrupted network control message generation during local node outages, including a protocol state machine and a message cache. The protocol state machine generates messages. The message cache receives the messages from the protocol state machine and forwards them to nodes in the network. The message cache stores both the sent and received messages in one or more buffers. Upon failure of the protocol state machine, the message cache sends messages to and receives message from the nodes, so long as the message cache remains valid. The message cache may include a timer for sending periodic messages to the nodes and a status control determining whether the message cache is valid. The system may include a worker node and a protection node, each having protocol state machines and message caches so that the protection node is able to become active when the worker node fails. The protection message cache may replicate the worker message cache, while the worker protocol state machine is active.

BRIEF DESCRIPTION OF THE DRAWINGS

The teachings of the present invention can be readily understood by considering the following detailed description in conjunction with the accompanying drawings, in which:

Figure 1 is a block diagram illustrating an exemplary embodiment of a cache concept for a default case, when a state machine for a control plane protocol is active; Figure 2 is a block diagram illustrating the exemplary embodiment of the cache concept of Figure 1 for a control plane failure case, when the protocol state machine is unavailable and the network state is stable;

468327 1 Figure 3 is a block diagram illustrating the exemplary embodiment of the cache concept of Figure 1 for a control plane failure case, when the protocol state machine is unavailable and the network state is unstable;

Figure 4 is a block diagram illustrating an exemplary embodiment of a cache concept for a default case, when two instances of a state machine exist (worker and protection), the worker state machine being active, the protection state machine being standby, and each being associated with a cache;

Figure 5 is a block diagram illustrating the exemplary embodiment of the cache concept of Figure 4 for an intermediate state when the worker state machine was active and failed, the protection state machine in standby state is recovering (from standby to full operation), but the network state is stable;

Figure 6 is a block diagram illustrating the exemplary embodiment of the cache concept of Figure 4 when the protection state machine is active and the worker state machine is standby (after a switch over from worker to protection); Figure 7 is a chart showing selected state transitions and events on a time line for the exemplary embodiment of the cache concept of Figure 4; and

Figure 8 is a block diagram illustrating an exemplary embodiment of a distributed cache.

To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures.

DETAILED DESCRIPTION OF THE INVENTION

The description of the present invention is primarily within a general context of packet switched networks and control plane protocols. However, those skilled in the art and informed by the teachings herein will realize that main concept of the invention is generally applicable to computer networks and may be broadly applied to any network architecture and design, communication protocols, network software, network technologies, network services and applications, and network operations management. Accordingly, the general concepts of the present invention are broadly applicable and are not limited to any particular implementation.

468327 1 Introduction - L2 Ethernet Example in Conjunction with Equipment Protection

There is a need to maintain a stable network configuration for L2 Ethernet services under the condition of equipment protection switches that will affect the L2 control plane, i.e., spanning tree protocols and link aggregation control protocols, generic attribute registration protocol (GARP) and variants of it, and other protocols. It is possible for a local protection switch to lead to a reconfiguration of the entire spanning tree in the network, if protocol data unit (PDU) distribution is interrupted for about three seconds. This may cause traffic outages of several tens of seconds, until the network converges to a stable state again. Therefore, immediately after a protection switch, it is desirable for a network element to do the following. First, the network element should maintain a stable network if the only cause of instability is the equipment protection switch, i.e., for the case of a single failure (e.g., circuit pack defect) but also for the case of operator driven events such as manual switches. Second, the network element should minimize network impact in case a network is already undergoing a reconfiguration, e.g., due to a remote network element failure, while simultaneously the protection switch is required due to local defect (double failure) or operator commands. Exemplary embodiments of the present invention achieve these goals not only for this L2 Ethernet example, but more broadly for any failure (e.g., hardware defect) causing a temporary unavailability of the local control plane of any network for many protocols.

High Level Description of the Network Element Behavior The network element behavior may be described by three states. In the first state, the state machine is fully operable and reacting to all requests. In the second state, the state machine is not available but the cache maintains PDU sending until a change in the network happens, which invalidates the cache, or the state machine becomes operable. In the third state, both the state machine and the cache are not available, e.g., due to an ongoing reconfiguration in the network while the state machine is inoperable, or due to the protocol state machine and cache not being synchronized.

468327 1 High-Level Cache Concept- STP Example

Exemplary embodiments of the caching concept are derived from the observation that in a stable network, the spanning tree protocol nodes distribute identical PDUs to their neighbors repeatedly. A network defect or network change is detected, if no PDUs have been received by a spanning tree node during three consecutive sending periods or the content of a PDU is different from the preceding PDU. Thus, in an otherwise stable network topology, the activity of a spanning tree protocol machine can be suspended for an indefinite amount of time, as long as the periodic sending of PDUs is maintained. Thus, the caching concept uses this fact so that the network demands for PDUs are satisfied from the cache, without the need for all of the configuration, protocol state machines, and the like being started and synchronized. Thus, the caching concept relieves the demand regarding recovery speed of all software components, except the one operating the cache (which is in hot standby). There are certain times when the cache can be considered valid for PDU sending and other times when the cache needs to be invalidated. Note that within a stable network topology, to some extent, even new services can be established (e.g., forwarding traffic can be modified in terms of new quality of service (QoS) parameters, new customers (distinguished by C-VLANs) can be added to a service provider (802.1 ad) network, etc.).

High-Level Cache Concept - General

One embodiment includes a control plane and a message cache in a packet switched network. A packet switched network is a network in which messages or fragments of messages (packets) are sent to their destination through the most expedient route, as determined by a routing algorithm. A control plane is a virtual network function used to set up, maintain, and terminate data plane connections. It is virtual in the sense that it is distributed over network nodes that need to interoperate to realize the function. A data plane is a virtual network path used to distribute data between nodes. Some networks may disaggregate control and forwarding planes as well. The term cache refers to any storage managed to take advantage of locality of access. A message cache stores messages. The message cache is instantiated and its

468327 1 messages are kept in a synchronous state with the messages that the control plane sends/receives to/from the network. In case the control plane fails, the cache satisfies the demands of the network by sending the cached messages. Once the control plane recovers, the cache again follows the control operation and keeps in sync. The cache allows instances of the control plane state machines to fail while still transmitting all the traffic in the network. This concept works in most situations, except for unstable networks, double failures, and systems where the forwarding plane is not independent from the control plane. Unstable networks are those where the traffic flow distribution has not reached a stable state, such as power on scenarios of a network element. Double failures are those scenarios where, in addition to a control plane outage in one network element, other network elements experience defects or operator driven reconfigurations.

The present invention has many advantages, including significantly minimizing traffic loss in failure and software upgrade scenarios affecting the control plane. This gain is achieved locally if the network element supports a cache operation as described. A caching feature in a network element may be added to an existing network. Interoperability with other equipment is possible without the need for the other equipment to support a cache operation. Figure 1 illustrates an exemplary embodiment of a cache concept 100 for a default case, when a state machine 102 for a control plane protocol is active. The control plane protocol may be any kind of protocol, e.g., STP, VLAN registration protocol, LACP, Y.1711 FFD, or RSVP refresh. In a traditional network, the protocol state machine 102 communicates (via intermediate hardware layers) with the neighboring nodes 106 and the rest of the network 108. By contrast, this embodiment includes a message cache 104 interposed between the protocol state machine 102 and the network 108. The protocol state machine 102 sends messages to the message cache 104, which then forwards those messages to the network 108. The message cache 104 captures communication between the protocol state machine 102 and the network by storing both sent messages 110 and received messages 112 in buffers. The message cache 104 also includes a timer 114 and a status control 116. Optionally, the state machine 102 may convey additional state information

468327 1 to the status control 1 16 (i.e., in addition to the messages exchanged), depending on the particular protocol to be supported. The contents of the message cache 104 vary depending on the control plane protocol implemented. The message cache 104 stores what is needed to temporarily serve the needs of the network 108 in the case of a failure of the state machine 102.

Figure 2 illustrates the exemplary embodiment of the cache concept 100 of Figure 1 for a control plane failure case, when the protocol state machine 102 is unavailable and the network state is stable. The message cache 104 protects against situations where the protocol state machine is unavailable, for any reason by temporarily continuing to serve the network. For example, the processor holding the protocol state machine 102 may be rebooting. The message cache 104 generally continues to send messages from the buffers so that neighboring nodes 106 in the network 108 do not become aware that the protocol state machine 102 is unavailable. Communication to the neighboring nodes 106 is mimicked based on information stored in the message cache 104. Thus, the message cache 104 bridges at least a portion of the time that the protocol state machine 102 is unavailable. Protocols that periodically send the same message (e.g., hello message, update message) to the neighboring nodes 106 can easily be mimicked. The message cache 104 uses the timer 114 to send messages stored in the sent messages buffer 110 periodically in the same manner as the protocol state machine 102. As a result, the neighboring nodes 106 do not detect any change in the protocol state machine 102. The message cache 104 receives messages from neighboring nodes 106 and stores them in the received message buffer 112. The message cache 104 is able to detect any event or change (e.g., state change) in the network 108 that would make the message cache 104 invalid by examining the status control 116 and the received messages. The status control 116 determines whether the message cache 104 is valid or invalid. When the message cache 104 becomes invalid, it ceases sending messages because it cannot properly react to the event or change in the network 108.

The message cache 104 is a simplified component to simulate at least a portion of the protocol state machine 102. An efficient implementation of the message cache 104 probably does not simulate the complete behavior of the

468327 1 8 control plane protocol. The degree of simplicity or complexity of the message cache 104 may vary depending on the control plane protocol implemented. For example, the message cache may simulate transition between two or more states of the protocol state machine 102 with logic in the status control 116. The message cache may be implemented in hardware, firmware, or software (e.g., field-programmable gate array (FPGA) or application-specific integrated circuit (ASIC)). The message cache 104 continues to mimic the protocol state machine so long as it remains valid, which may be a short time or the entire time the protocol state machine is unavailable, depending on circumstances. Some protocols require updates in the milliseconds range, while others require updates in the seconds range. This embodiment is not limited to any particular protocol or degree of complexity of the status control logic 116.

Figure 3 illustrates the exemplary embodiment of the cache concept 100 of Figure 1 for a control plane failure case, when the protocol state machine 102 is unavailable and the network state is unstable. In this case, the message cache 104 transitions into an invalid state. Based on the received messages 112, the status control 116 determines that some event occurred, making the network state unstable so that simulation of the protocol state machine 102 by the message cache 104 must stop according to the particular protocol implemented. Once the message cache 104 stops simulating the protocol state machine 102, the neighboring nodes 106 may become aware that the protocol state machine 102 is failed or otherwise unavailable, as if no message cache 104 were present.

Figure 4 illustrates an exemplary embodiment of a cache concept 400 for a default case, when two instances of a state machine exist (worker and protection), the worker state machine being active, the protection state machine being standby, and each being associated with a cache. This embodiment is a particular realization of a control plane protocol in a particular context; however, the invention is not limited to any particular implementation. In this embodiment, network availability is improved by caching messages.

This embodiment is in the context of a blade server (not shown); however, the invention is not limited to any particular hardware. A blade server is a server chassis housing multiple thin, modular electronic circuit boards,

468327 1 known as server blades. Each blade is a server on a card, containing processors, memory, integrated network controllers, and input/output (I/O) ports. Blade servers increasingly allow the inclusion of functions, such as network switches and routers as individual blades. The state machines (SMs) for two such blades are shown in Figure 4: a worker state machine 406 for a worker packet switch (PS) 402 and a protection state machine 408 for a protection PS 404. The worker state machine 408 is initially active and the protection state machine 406 is initially standby and soon to become active. The two instances (active/standby) of the protocol state machine are located on different hardware (e.g., CPUs) but still within the same network node.

This embodiment illustrates the worker state machine 406 and the protection state machine 404 for a spanning tree protocol (STP); however, the invention is not limited to any particular protocol. A spanning tree protocol provides a loop free topology for any bridged network. The IEEE standard 802.1 D defines STP. The worker PS 402 and protection PS 404 each include a STP state machine 406, 408 for a specific independent bridge partition (IBP) (e.g., one Ethernet switch instance) and timers 416, 412. A network bridge (a/k/a network switch) connects multiple network segments (e.g., partitions, domains) and forwards traffic from one segment to another. These state machines 406, 408 are in a control plane and create messages for sending to neighboring nodes 106 in the rest of the network 108.

In this embodiment, a worker cache 410 is interposed between the worker state machine 406 and the network 108. Figure 4 illustrates an initial state where the worker state machine 406 is active, sending/receiving messages to/from the network 108 and storing messages in the worker cache 410. The worker cache 410 stores both the messages sent out 412 and the messages received 414. Bridge protocol data units (BPDUs) are the frames that carry the STP information. A switch sends a BPDU frame using a unique MAC address of a port itself as a source address and a destination address of the STP multicast address. A protection cache 418 is synchronized with the worker cache 410 by cache replication for the protection state machine 408, which is in a warm standby state, waiting to be started.

468327 1 10 Figure 5 illustrates the exemplary embodiment of the cache concept 400 of Figure 4 for an intermediate state when the worker state machine 406 was active and failed (e.g., software crash), the protection state machine 408 in standby state is recovering (from standby to full operation), but the network state is stable. This intermediate state occurs, because there is a delay between the time when the worker state machine 406 fails and the time when the protection state machine 408 is ready (i.e.,, started after boot up) to serve the network 108. During this intermediate state, the protection cache 418 is now the active cache and operates as described for Figure 2. Figure 6 illustrates the exemplary embodiment of the cache concept of

Figure 4 when the protection state machine 408 is active and the worker state machine is standby (after a switch over from worker to protection). Comparing Figures 4 and 6, the protection state machine 408 in the scenario illustrated by Figure 6 behaves similarly to the worker state machine 406 in the scenario illustrated by Figure 4, i.e., behaving as the active state machine. The protection cache 418 stores both the messages sent out 420 and the messages received 422 and, thus, operates in the same way as in Figure 4. While the protection state machine 408 is active, messages in the protection cache 418 are replicated to the worker cache 410. Figure 7 is a chart showing selected state transitions and events on a time line for the worker state machine 406, protection state machine 408, and protection cache 418 of Figure 4. (Table 1 below describes Figure 7 in tabular form.) Figure 7 illustrates various combinations of states when the protection cache 418 is valid and can be used temporarily to serve the needs of the network 108 and when the protection cache 418 is invalid and cannot be used. Figure 7 illustrates several scenarios. The first scenario is from Ti to T₅, the second is from T₅ to T₉, and the third is from T₉ to T₁₂.

The first scenario starts at Ti. At T-i, when the worker state machine 406 is in an active state and the protection state machine 408 is in a synchronizing state, the protection cache 418 is invalid and replicates the worker cache 410. For example, the protection state machine 408 is initially in the synchronizing state, because the protection PS 404 blade has been added to the network element. When synchronization is completed at T₂, the protection state

468327 1 1 1 machine 408 transitions from synchronizing to standby and the protection cache 418 is ready and inactive. When a failure occurs at T₃, the worker state machine 406 transitions from active to failed, the protection state machine 408 transitions from standby to starting-up (i.e., preparing to take over the active role), and the protection cache 418 is ready and sending, (i.e., temporarily serving the needs of the network 108). During the interval from T₃ onwards, the worker state machine 406 transitions from failed to synchronizing (e.g., as a consequence of a reboot). The exact times do not matter for the anticipated behavior of the network element. They depend on the implementation and, thus, are not shown explicitly. At T₄, the protection state machine 408 transitions from starting-up to active and the protection cache 418 is updating (i.e., taking a passive role by continuing the synchronizing with the active protocol state machine 408). During the interval from T₃ onwards, the worker state machine 406 transitions from synchronizing to standby. After this is done, at T₅, the protection state machine 408 is active and the worker state machine 406 is standby.

The second scenario starts at T₅. At T₅, the worker state machine 406 is active, the protection state machine 408 is synchronizing, and the protection cache 418 is invalid. At T₆, the protection state machine 408 transitions from synchronizing to standby and the protection cache 418 is ready and inactive. When a network reconfiguration occurs at T₇ (e.g., a network element fails), the worker state machine 406 transitions from active to reconfiguring and the protection cache 418 becomes invalid at T₇. During the interval from T₇ to Te, the worker state machine 406 handles changing state in the network. After the network has stabilized at Te, the worker state machine 406 transitions from reconfiguring to active and the protection cache 418 becomes ready and inactive again.

The third scenario starts at T₉ and differs from the second scenario in the ordering of the events. At T₉, the worker state machine 406 is active, the protection state machine 408 is synchronizing, and the protection cache 418 is invalid. A network reconfiguration occurs during the interval from T₉ to Tn. At T-io, the worker state machine 406 transitions from active to reconfiguring. At T-i 1, the protection state machine 408 transitions from synchronizing to standby.

468327 1 12 The protection cache 418 does not transition from invalid to ready, inactive, until T₁₂, when the worker state machine 406 transitions from reconfiguring to active.

Table 1. Description of PS state machine and cache states

In one embodiment, there is one cache instance per independent bridge partition. Each independent bridge partition has its own cache implementation to guarantee independent operations and reconfigurations.

In one embodiment, there are two cache entries per port: one for the incoming PDU and one for the outgoing PDU. Each port has a certain port state. Depending on the state of the bridge, PDUs are sent, received, or both. The cache not only remembers the PDUs that are sent or received, but also that no PDUs have to be sent or received. Note that on some ports PDU

468327 1 13 sending/receiving will stop at some point during the network convergence process, i.e., the cache is filled only after the network converges. In one embodiment, caches are kept in hot-standby mode. In one embodiment, caches carry a flag indicating whether they are valid for PDU generation. Various situations may lead to invalidating the cache, e.g., ongoing reconfigurations in the network, provisioning which demands calculation of the spanning tree and changes in BPDUs, etc.

In one embodiment, the cache on the active PS is updated by incoming and outgoing PDUs. In one embodiment, the cache on the standby PS is immediately invalidated in the following conditions: when network provided PDUs differ from the cache content and when PDUs differ from the cache content. Note that both differences indicate a change in the network, which can only be handled by a working spanning tree state machine. Any replication of outdated PDUs may lead to serious impact on customer traffic and convergence of the spanning tree. For example, loops could be created. Note that it is the cache on the protection (standby) PS that is invalidated in case of an active worker PS. In the case where the worker PS is failing and the protection PS is in transition from standby to active, the protection PS' cache is invalidated. Note that it may be necessary to change all port states to discarding when the cache is invalidated on a just recovering PS.

In one embodiment, the cache may be declared valid only when the topology has converged. During the convergence process, an active state machine is required. Note that the end of the network convergence period can either be told by the protocol state machine or it is derived from a sufficiently long stable network state. This may require tracking changes in PDUs over several seconds. This adds to the time the system (network) is vulnerable for equipment protection switches, but only after a possibly traffic affecting network configuration already happened. Note that after a switch-overswitchover and in a stable network, the PDUs generated from the state machine after its recovery will be unchanged to those in the cache, i.e., in this situation, the topology can be considered converged when both hold. The cache was active and is set to inactive by the first PDU send from the state machine. All PDUs in the cache

468327 1 14 have at least once been updated by PDUs from the state machine since the time the cache was deactivated.

In one embodiment, the cache may be declared valid only when the standby PS is fully synchronized. In one embodiment, there is timer triggering of PDU generation from the cache. In the event that the protection PS status changes to active PDUs is sent from the cache it is flagged valid. To this end an appropriate repetition timer (and distribution over the allowed period) is started. The state in which PDUs are created from the cache starts with the activation status, provided the cache is flagged valid. It ends when either different PDUs are received from the network or when the state machine has fully recovered. This can be recognized by the fact that the state machine starts sending PDUs to the network. The first PDU can be used as a trigger to stop the cache activity, because the state machine is capable of sending out all remaining PDUs in the required time interval.

Figure 8 illustrates an exemplary embodiment of a distributed cache. This example shows how the message cache may be distributed within a system as opposed to a single message cache for a system. In this example, the periodic message cache 810 is distributed on two input/output (I/O) packs 802. The number of I/O packs is, of course, not limited to two. Each I/O pack 802 includes packet forwarding hardware 810 and a board controller 808. A local node 804 includes packet forwarding hardware 812 and one or more central packet control plane processors 814. The central packet control plane processor 814 sends updates to the periodic message caches 810 on the board controllers 808 of the I/O packs 802. The periodic message cache 810 sends outgoing periodic messages via packet forwarding hardware 810 in the I/O pack 802. In this way, the periodic message caches 810 simulate a control plane protocol, when the control plane state machine is unavailable or fails. Application protocols include any protocols that have periodic outgoing messages with constant contents, such as (R)STP, GVRP, RSVP, open shortest path first (OSPF), intermediate system-to-intermediate system (IS-IS or ISIS), Y.1711 , FFD, etc. Of course, message caches may be implemented broadly in many other ways for many different system architectures. For

468327 1 15 example, message caches may be on several hardware blades, on several computer processing units (CPUs), on several threads within one CPU, in FPGAs, ASICs and the like.

Embodiments of the present invention may be implemented in one or more computers in a network system. Each computer comprises a processor as well as memory for storing various programs and data. The memory may also store an operating system supporting the programs. The processor cooperates with conventional support circuitry such as power supplies, clock circuits, cache memory, and the like as well as circuits that assist in executing the software routines stored in the memory. As such, it is contemplated that some of the steps discussed herein as software methods may be implemented within hardware, for example, as circuitry that cooperates with the processor to perform various method steps. The computer also contains input/output (I/O) circuitry that forms an interface between the various functional elements communicating with the computer. Embodiments of the present invention may also be implemented in hardware or firmware, e.g., in FPGAs or ASICs.

The present invention may be implemented as a computer program product wherein computer instructions, when processed by a computer, adapt the operation of the computer such that the methods and/or techniques of the present invention are invoked or otherwise provided. Instructions for invoking the inventive methods may be stored in fixed or removable media, transmitted via a data stream in a broadcast media or other signal-bearing medium, and/or stored within a working memory within a computing device operating according to the instructions. While the foregoing is directed to various embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof. As such, the appropriate scope of the invention is to be determined according to the claims, which follow.

468327 1 16

Claims

What is claimed is:

1. A method for providing uninterrupted network control message generation during local node outages, comprising: receiving a plurality of sent messages from a protocol state machine; forwarding the sent messages to a plurality of nodes in a network; receiving a plurality of received messages from the nodes; storing the sent and received messages in a buffer; and sending messages to and receiving messages from the nodes, upon failure of the protocol state machine, so long as the buffer remains valid.

2. The method of claim 1 , wherein the messages are sent periodically to the nodes.

3. The method of claim 1 , further comprising: determining whether the buffer is valid based on the sent and received message in the buffer and messages received from the nodes after the failure.

4. The method of claim 1 , further comprising: switching to a standby protocol state machine, upon failure of the protocol state machine, the standby protocol state machine including another buffer including replicas of the sent and received messages.

5. A system for providing uninterrupted network control message generation during local node outages, comprising: a protocol state machine for generating a plurality of messages; a message cache for receiving the messages from the protocol state machine and forwarding them to a plurality of nodes in a network, the message cache storing both messages sent to the nodes and messages received from the nodes in at least one buffer; wherein the message cache sends messages to and receives message from the nodes, upon failure of the protocol state machine, so long as the message cache remains valid.

468327 1 17

6. The system of claim 5, wherein the message cache includes a timer for sending periodic messages to the nodes.

7. The system of claim 5, wherein the message cache includes a status control for determining whether the message cache is valid.

8. The system of claim 7, wherein the protocol state machine is a worker protocol state machine, the message cache is a worker message cache, and a worker node includes the worker protocol state machine and the worker message cache; and further comprising: a protection node including a protection protocol state machine and a protection message cache; wherein the protection state machine is able to become active upon failure of the worker protocol state machine.

9. The system of claim 7, wherein the protection message cache replicates the worker message cache, while the worker protocol state machine is active.

10. A computer readable medium storing instructions for performing a method for providing uninterrupted network control message generation during local node outages, the method comprising: receiving a plurality of sent messages from a protocol state machine; forwarding the sent messages to a plurality of nodes in a network; receiving a plurality of received messages from the nodes; storing the sent and received messages in a buffer; and sending messages to and receiving messages from the nodes, upon failure of the protocol state machine, so long as the message cache remains valid.

468327 1 18