WO2018046986A1 - Techniques for efficient forwarding information base reconstruction using point of harvest identifiers - Google Patents

Techniques for efficient forwarding information base reconstruction using point of harvest identifiers Download PDF

Info

Publication number
WO2018046986A1
WO2018046986A1 PCT/IB2016/055408 IB2016055408W WO2018046986A1 WO 2018046986 A1 WO2018046986 A1 WO 2018046986A1 IB 2016055408 W IB2016055408 W IB 2016055408W WO 2018046986 A1 WO2018046986 A1 WO 2018046986A1
Authority
WO
WIPO (PCT)
Prior art keywords
util
nodes
node
trie
hybrid
Prior art date
Application number
PCT/IB2016/055408
Other languages
French (fr)
Inventor
Matias CAVUOTI
Original Assignee
Telefonaktiebolaget Lm Ericsson (Publ)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Telefonaktiebolaget Lm Ericsson (Publ) filed Critical Telefonaktiebolaget Lm Ericsson (Publ)
Priority to PCT/IB2016/055408 priority Critical patent/WO2018046986A1/en
Publication of WO2018046986A1 publication Critical patent/WO2018046986A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/02Topology update or discovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/54Organization of routing tables
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/74Address processing for routing
    • H04L45/745Address table lookup; Address filtering
    • H04L45/748Address table lookup; Address filtering using longest matching prefix
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/15Interconnection of switching modules
    • H04L49/1515Non-blocking multistage, e.g. Clos

Definitions

  • Embodiments relate to the field of computer networking; and more specifically, to techniques for efficient forwarding information base reconstruction using Point of Harvest identifiers.
  • a forwarding information database also referred to as a Forwarding Information Base, or "FIB”
  • FIB Forwarding Information Base
  • Embodiments disclosed herein can efficiently manage a FIB by identifying certain subsets of entries of the FIB that can be re-utilized without modification when performing operations on the FIB instead of completely rebuilding large subtrees of the FIB .
  • some embodiments utilize techniques where (potentially massively) parallel systems can safely operate in parallel to even further increase performance during FIB updates, without benefitting from special-purpose FIB hardware, and while maintaining deterministic lookup times in the FIB.
  • a method in a packet forwarder implemented by a device for efficiently reconstructing a forwarding information base (FIB) to reflect a new, changed, or deleted route of a communications network includes determining, by the packet forwarder, that an update to the FIB utilized by the packet forwarder to make forwarding decisions is to be performed to reflect the new, changed, or deleted route.
  • the FIB comprises a data structure having a plurality of levels.
  • the data structure includes one or more hybrid nodes - each acting as a root of a sub-tree of the data structure and each including a pointer to a child array of nodes at a next level of the plurality of levels.
  • the method also includes updating a control trie to reflect the new, changed, or deleted route by adding, removing, or modifying at least one external node in the control trie.
  • the control trie includes one or more external nodes each indicating routing information for one or more of a plurality of routes of the network.
  • the method also includes identifying, within a util trie, a util node that corresponds to a hybrid node of the FIB serving as a root of a subtree that is to be updated due to the new, changed, or deleted route.
  • the util trie also has the same plurality of levels, and includes a plurality of util nodes. Each of the plurality of util nodes is located at a boundary of one of the plurality of levels.
  • the method also includes obtaining a point of harvest (POH) identifier for each of one or more immediate descendant util nodes of the identified util node within the util trie.
  • POH point of harvest
  • Each POH identifier identifies, for the corresponding descendant util node, one of the nodes within the control trie that is at a corresponding location within the control trie or is a next node within the control trie beneath the corresponding location within the control trie.
  • the method also includes obtaining, via a harvesting process, data including one or more nodes that are to form the child array of the hybrid node corresponding to the identified util node.
  • the data further includes a POH identifier for each of the one or more nodes that are hybrid nodes.
  • the method also includes, responsive to determining that the POH identifier of one of the one or more nodes obtained from the harvesting process matches the obtained POH identifier of one of the one or more immediate descendant util nodes, reusing the hybrid node from the child array while reconstructing the child array instead of regenerating the hybrid node and any of its descendant nodes.
  • the method further includes, after the identifying of the util node, inserting a dirty node in a dirty util trie at a corresponding location of the dirty util trie as the location of the identified util node in the util trie and, at a later point in time, traversing the dirty util trie in a top-down breadth-first manner to identify those of the util nodes needing to have their corresponding child arrays reconstructed.
  • the dirty node comprises a pointer to the identified util node of the util trie.
  • obtaining the POH identifier for each of one or more immediate descendant util nodes of the identified util node within the util trie comprises caching, within the identified util node, each of the one or more immediate descendant util nodes.
  • Each of the one or more immediate descendant util nodes stores its corresponding POH identifier.
  • reconstructing the child array includes determining that one or more of the hybrid nodes of the child array can be reused, and generating a second child array, including copying each of the one or more of the hybrid nodes that can be reused to the second child array.
  • reconstructing the child array further includes updating a pointer from the hybrid node corresponding to the identified util node to point to the second child array instead of the child array.
  • at least one of the copied one or more of the hybrid nodes is placed at a different index within the second child array compared to its index within the child array, but in some embodiments, all of the copied one or more of the hybrid nodes is placed at a same index within the second child array as its index within the child array.
  • the method further includes updating the POH identifier of one or more of the util nodes of the util trie responsive to the update of the control trie.
  • control trie stores route information for the plurality of routes and is indexed by a routing prefix of a route
  • the control trie further includes one or more split nodes each identifying one or more bit locations of the routing prefix that can be utilized to determine how to traverse the control trie
  • the FIB further includes one or more leaf nodes that collectively store forwarding information for the plurality of routes.
  • a non-transitory machine readable medium provides instructions which, when executed by a processor of a device, will cause the device to implement a packet forwarder to perform operations for efficiently reconstructing a forwarding information base (FIB) to reflect a new, changed, or deleted route of a communications network.
  • the operations include determining that an update to the FIB utilized by the packet forwarder to make forwarding decisions is to be performed to reflect the new, changed, or deleted route.
  • the FIB comprises a data structure having a plurality of levels.
  • the data structure includes one or more hybrid nodes - each acting as a root of a sub-tree of the data structure and each including a pointer to a child array of nodes at a next level of the plurality of levels.
  • the operations also include updating a control trie to reflect the new, changed, or deleted route by adding, removing, or modifying at least one external node in the control trie.
  • the control trie includes one or more external nodes each indicating routing information for one or more of a plurality of routes of the network.
  • the operations also include identifying, within a util trie, a util node that corresponds to a hybrid node of the FIB serving as a root of a subtree that is to be updated due to the new, changed, or deleted route.
  • the util trie also has the same plurality of levels, and includes a plurality of util nodes. Each of the plurality of util nodes is located at a boundary of one of the plurality of levels. Each of the plurality of util nodes corresponds to one of the one or more hybrid nodes of the FIB.
  • the operations also include obtaining a point of harvest (POH) identifier for each of one or more immediate descendant util nodes of the identified util node within the util trie. Each POH identifier identifies, for the corresponding descendant util node, one of the nodes within the control trie that is at a corresponding location within the control trie or is a next node within the control trie beneath the corresponding location within the control trie.
  • POH point of harvest
  • the operations also include obtaining, via a harvesting process, data including one or more nodes that are to form the child array of the hybrid node corresponding to the identified util node.
  • the data further includes a POH identifier for each of the one or more nodes that are hybrid nodes.
  • the operations also include, responsive to determining that the POH identifier of one of the one or more nodes obtained from the harvesting process matches the obtained POH identifier of one of the one or more immediate descendant util nodes, reusing the hybrid node from the child array while reconstructing the child array instead of regenerating the hybrid node and any of its descendant nodes.
  • a device includes one or more processors and a non- transitory machine-readable storage medium.
  • the non-transitory machine readable medium provides instructions which, when executed by the one or more processors, will cause the device to implement a packet forwarder to perform operations for efficiently reconstructing a forwarding information base (FIB) to reflect a new, changed, or deleted route of a
  • FIB forwarding information base
  • the operations include determining that an update to the FIB utilized by the packet forwarder to make forwarding decisions is to be performed to reflect the new, changed, or deleted route.
  • the FIB comprises a data structure having a plurality of levels.
  • the data structure includes one or more hybrid nodes - each acting as a root of a sub-tree of the data structure and each including a pointer to a child array of nodes at a next level of the plurality of levels.
  • the operations also include updating a control trie to reflect the new, changed, or deleted route by adding, removing, or modifying at least one external node in the control trie.
  • the control trie includes one or more external nodes each indicating routing information for one or more of a plurality of routes of the network.
  • the operations also include identifying, within a util trie, a util node that corresponds to a hybrid node of the FIB serving as a root of a subtree that is to be updated due to the new, changed, or deleted route.
  • the util trie also has the same plurality of levels, and includes a plurality of util nodes. Each of the plurality of util nodes is located at a boundary of one of the plurality of levels. Each of the plurality of util nodes corresponds to one of the one or more hybrid nodes of the FIB.
  • the operations also include obtaining a point of harvest (POH) identifier for each of one or more immediate descendant util nodes of the identified util node within the util trie.
  • POH point of harvest
  • Each POH identifier identifies, for the corresponding descendant util node, one of the nodes within the control trie that is at a corresponding location within the control trie or is a next node within the control trie beneath the corresponding location within the control trie.
  • the operations also include obtaining, via a harvesting process, data including one or more nodes that are to form the child array of the hybrid node corresponding to the identified util node.
  • the data further includes a POH identifier for each of the one or more nodes that are hybrid nodes.
  • the operations also include, responsive to determining that the POH identifier of one of the one or more nodes obtained from the harvesting process matches the obtained POH identifier of one of the one or more immediate descendant util nodes, reusing the hybrid node from the child array while reconstructing the child array instead of regenerating the hybrid node and any of its descendant nodes.
  • Figure 1 is a block diagram illustrating an exemplary forwarding database according to some embodiments.
  • Figure 2 is a block diagram illustrating an exemplary control trie populated using a number of keys corresponding to exemplary network routes according to some embodiments.
  • Figure 3 is a block diagram illustrating an exemplary util trie and an exemplary control trie with the util trie overlaid upon it according to some embodiments.
  • Figure 4 is a block diagram illustrating a portion of the overlaid util and control tries of Figure 3 and a corresponding portion of the exemplary forwarding database of Figure 1 according to some embodiments.
  • Figure 5 is a block diagram illustrating a portion of an overlaid util and control trie with illustrated Point of Harvest (POH) locations according to some embodiments.
  • POH Point of Harvest
  • Figure 6 is a flow diagram illustrating a pre -reconstruction flow and a reconstruction flow for efficient forwarding information base reconstruction according to some embodiments.
  • Figure 7 is a block diagram illustrating an insertion of a route and some operations performed with various portions of a util trie and/or control trie according to some embodiments.
  • Figure 8 is a block diagram illustrating an exemplary overlaid util and control trie before and after a deletion according to some embodiments.
  • Figure 9 is a flow diagram illustrating another flow for efficient forwarding
  • Figure 10A illustrates connectivity between network devices (NDs) within an exemplary network, as well as three exemplary implementations of the NDs, according to some embodiments.
  • Figure 10B illustrates an exemplary way to implement a special-purpose network device according to some embodiments.
  • FIG. 1 illustrates various exemplary ways in which virtual network elements (VNEs) may be coupled according to some embodiments.
  • VNEs virtual network elements
  • Figure 10D illustrates a network with a single network element (NE) on each of the NDs, and within this straight forward approach contrasts a traditional distributed approach (commonly used by traditional routers) with a centralized approach for maintaining reachability and forwarding information (also called network control), according to some embodiments.
  • NE network element
  • Figure 10E illustrates the simple case of where each of the NDs implements a single NE, but a centralized control plane has abstracted multiple of the NEs in different NDs into (to represent) a single NE in one of the virtual network(s), according to some embodiments.
  • Figure 10F illustrates a case where multiple VNEs are implemented on different NDs and are coupled to each other, and where a centralized control plane has abstracted these multiple VNEs such that they appear as a single VNE within one of the virtual networks, according to some embodiments.
  • references in the specification to "one embodiment,” “an embodiment,” “an example embodiment,” etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
  • Bracketed text and blocks with dashed borders may be used herein to illustrate optional operations that add additional features to embodiments. However, such notation should not be taken to mean that these are the only options or optional operations, and/or that blocks with solid borders are not optional in certain embodiments.
  • Coupled is used to indicate that two or more elements, which may or may not be in direct physical or electrical contact with each other, co-operate or interact with each other.
  • Connected is used to indicate the establishment of communication between two or more elements that are coupled with each other.
  • forwarding information base FIB
  • forwarding database forwarding data base
  • forwarding table forwarding table
  • common variants thereof may be used synonymously in this description unless otherwise indicated either explicitly or as made obvious by the context of use.
  • control trie control database
  • control table etc.
  • util trie utility table
  • util trie tree etc.
  • Packet forwarding network elements have traditionally been implemented using special-purpose network equipment having specialized hardware support for performing certain tasks (e.g., lookups, forwarding table updating, etc.) as fast as possible to enable the most efficient forwarding of data.
  • tasks e.g., lookups, forwarding table updating, etc.
  • hardware-based approaches suffer from a variety of different problems, ranging from their high cost, significant hardware "real estate" required to implement them, increased power consumption and, perhaps most importantly in recent times, they are not applicable in virtualized platforms, which have become a tremendously important area.
  • Hardware -based approaches are also deficient because they lack flexibility - e.g., routing information necessarily must be stored within the specifications of the hardware handling it.
  • these systems deploy an extremely "compacted" forwarding database data structure, so when an update needs to be performed, the software-based packet forwarder typically must reconstruct either large portions of the forwarding database or, in some cases, the entire data structure.
  • some packet forwarders use a tree-based data structure for a forwarding database, and to make a change at a particular location of the tree, the packet forwarders will re-construct the entire subtree of the node containing the changes, often times resulting in unnecessary work as other subtrees within the affected region are actually unchanged.
  • embodiments provide techniques for efficient forwarding information base reconstruction utilizing a software -based approach that can efficiently reconstruct different parts of the forwarding information base with the minimal number of operations possible based on the provisioned changes.
  • Embodiments can operate by tracking the forwarding information base via specifically chosen data structures (or "control structures") which are easy to manipulate and map against the FIB. These control structures can be examined to allow for only minimal changes to be made in the FIB to correctly reflect the newly provisioned changes. Embodiments can then make the rebuilding process significantly faster, effectively reducing the latency seen by the client.
  • embodiments are flexibly applicable across different computing platforms, as embodiments can be software based and not rely on any specific hardware. Embodiments can also be very versatile and applicable to new trends like the virtualization of networking components.
  • embodiments disclosed herein can provide substantial processing gains compared to other software-based solutions.
  • the processing power used for performing operations needed for a change to the forwarding database will diminish since each update only rebuilds a sub-tree root node (and perhaps its immediate data) as opposed to the whole sub-tree.
  • Embodiments disclosed herein can greatly reduce transient memory usage compared to other software-based solutions. Because the forwarding plane should not be disturbed when updating the FIB, it follows that every sub-tree that will be updated will not be freed until the whole new tree is ready. This means that at some point both sub-trees will exist at the same time (i.e., the old one, and the new one). As some embodiments may only rebuild an extremely limited amount of data, only this limited amount data needs be present in transient memory during the rebuild at any particular moment in time, greatly reducing the transient memory usage compared to other approaches that reconstruct large portions of the FIB, thus requiring significant amounts of transient memory.
  • embodiments disclosed herein are widely applicable, and can be particularly efficient for longest prefix match types of forwarding tables, where data can be inserted at any prefix length. Accordingly, embodiments are very useful for most widely- deployed protocols, like Internet Protocol (IP) version 4 (IPv4) and version 6 (IPv6).
  • IP Internet Protocol
  • IPv4 Internet Protocol version 4
  • IPv6 version 6
  • Embodiments can also provide high overall performance gains compared to other approaches. Given the processing gains and the parallelism enabled by these embodiments, the provisioning throughput can soar depending on the application. For example, initial tests of an embodiment involving IPv4 provisioning resulting in a performance improvement of approximately 80% compared to another recent software -based provisioning system.
  • embodiments benefit by reducing the number of operations required to perform a change to the forwarding database effectively increasing the rate at which the changes occur. Embodiments can also ensure that forwarding traffic is never disrupted, that independent operations could be performed in parallel, and that only a minimum number of data structures will get re-processed.
  • FIG. 1 is a block diagram illustrating an exemplary forwarding database 100 according to some embodiments.
  • the forwarding database 100 comprises a hierarchical tree-type data structure having one or more hybrid nodes 104 (represented with an "H") and one or more leaf nodes 105 (represented with an "L").
  • a leaf node 105 can store forwarding information for a particular route (as "data"), and in some embodiments, a leaf node 105 can be a "POP" node, which indicates that, during a traversal of the database 100, the traversal has ended and thus, a back trace of the database 100 should be performed (e.g., to reveal a longest prefix match route at a previously visited leaf node 105). Reaching a POP node during a traversal may also cause additional nodes in the forwarding database 100, referred to as "pushdflt” nodes, to be "popped” as described later herein.
  • the hybrid nodes 104 in contrast, can be used to traverse the database 100 while searching for forwarding information by indicating how to access other nodes of the database 100 at the next level.
  • the forwarding database 100 can be traversed in the following manner. During a traversal, each time a hybrid node is landed upon, it is determined whether a "pushdflt" value (e.g., bit) of the hybrid node is set. If the pushdflt value is set, then a pushdflt node (e.g., an actual leaf node) resides in the child array of the hybrid node, and this child array may be "saved" to some temporary memory location. Eventually, at some point the traversal will arrive at a leaf node or a POP node. If a leaf is hit, the traversal process may return that leaf node.
  • a pushdflt e.g., bit
  • the traversal process may return the pushdflt node(s) that have been saved in temporary memory (each time, during the traversal a hybrid node was hit that had its pushdflt value set).
  • Figure 1 shows how the forwarding database 100 is arranged in different levels - level 0 102A, level 1 102B, level 2 102C, level 3 102D, level 4 102E, and level 5 102F.
  • a certain number of configurable bits from a packet' s key can be used in addition with data from the corresponding sub-tree root node (i.e., a hybrid node at that level).
  • the lookup may conclude when a data node is hit (i.e., a leaf node 105), which indicates how to forward a particular packet.
  • Each hybrid node 104 serves as a root 106 of a sub-tree.
  • the second hybrid node at level 2 serves as a root of a sub-tree 108 including a portion of levels 3-5 (102D-102F), which will be discussed in additional detail later herein with regard to Figure 4.
  • the forwarding database 100 can be constructed in a particular manner to ensure that a traversal of the forwarding database 100 requires at most a particular number of memory accesses.
  • forwarding database 100 can be logically and/or physically arranged in an extremely efficient manner to constrain the number of memory accesses to such a maximum value, e.g., by arranging the tree with a maximum number of levels, keeping the individual data structures (e.g., arrays) involved tightly arranged/packed, etc.
  • control trie table also referred to as a "control trie,” “control database,” etc.
  • util trie table also referred to as a "util trie,” “util trie data structure,” etc.
  • Figure 2 is a block diagram illustrating an exemplary control trie 200 populated using a number of keys 206 corresponding to exemplary network routes according to some
  • control trie 200 can be a binary-type tree (e.g., a PATRICIA trie, where PATRICIA is an acronym for "Practical Algorithm to Retrieve
  • External nodes 204 are those nodes containing data (e.g., route/forwarding information) and are illustrated with circles having solid white backgrounds, while split nodes 202 can be "internal" nodes that are used to split routing prefixes at a prefix depth where their first bit differs, and are illustrated with circles having striped backgrounds. As a result, external nodes 204 can be split nodes 202, but the same is not true the other way around.
  • control trie 200 data structure can be utilized as a first stop for any operation needing to be performed involving routing/forwarding information.
  • insertions, deletions, and updates of such data may be stored in the control tire 200.
  • the resulting tree would look like the control trie 200 of Figure 2: ⁇ 0.16.0.0.0.0/13, 0.16.32.32.32.0/37, 0.16.36.36.36.0/39, 0.16.48.48.48.0/37, 0.16.52.56.56.0/37, 0.16.52.56.56.0/45, 0.16.52.56.56.8/45, and 0.16.52.56.63.248/45 ⁇ .
  • these "keys" are exemplary and thus are not IPv4 or IPv6 routes themselves; it is to be understood that the keys can be of a number of useful values known to those of skill in the art.
  • the resulting control trie 200 includes a first split node 202 with a "/0" indicator, and because all of the keys begin with an initial zero bit, only one path down the trie 200 to the external node 204 exists. If a search of the tree is performed with a key that falls within the "0.16.0.0.0.0/13" subnet, the traversal will stop at this point with the external node 204, which stores routing/forwarding information for those routes.
  • the traversal will continue at the split node of "/19", where the corresponding bit of the key will be analyzed - if it is a "0", the traversal will continue down the left subtree, otherwise if the bit value is a "1”, the traversal will continue down the right subtree.
  • the traversal will continue through the trie 200 until the key is matched with one of the external nodes 204, which may or may not be at a "leaf location of the tree.
  • a split node 202 can be created each time a set of inserted keys/prefixes diverge at a particular bit location.
  • control trie 200 data structure stores all the information the "client" has inserted. Accordingly the actual forwarding database can be generated by parsing the control trie 200.
  • some embodiments further utilize a util trie.
  • Figure 3 is a block diagram illustrating an exemplary util trie 300 and an exemplary control trie with the util trie overlaid 350 upon it according to some embodiments.
  • the util trie 300 can be a binary tree type structure that is utilized to hold information regarding the forwarding table sub-tree root nodes, and can include util nodes 302 and sometimes even split nodes (as in the control trie 200). Each solid black dot on Figure 3 represents a util node 302.
  • a util node 302 can have all of the data needed to reproduce the corresponding hybrid node 104 in the forwarding database 100. Embodiments disclosed herein utilize such a util trie 300 because it is much more efficient to manage a binary tree than the flattened out and latency efficient forwarding table when performing provisioning operations.
  • the util nodes 302 of the util trie 300 are inserted only on "stride" boundaries.
  • the number of strides (e.g., stride/levels 301A-301E) in the util trie 300 corresponds to the number of levels of the forwarding database 100, and the "length" of each stride can be the number of bits used from the key to move from one level to the next.
  • the strides may be represented as 13-8-8-8-8 (which also means we have 5 levels after the root, and that the first stride 301A represents 13 bits, a second stride 301B (not illustrated) represents 8 bits, etc.).
  • Horizontal lines at each stride's depth are illustrated herein as extending over the control trie (which has been overlaid 350 with the util trie 300) to show where the util nodes will be generated - see, for example, stride boundary 304A, stride boundary 304B, etc., stride boundary 304F.
  • Util nodes will be generated at these particular positions of the util trie 300, as each time a stride ends, the next set of bits from the lookup prefix of the key are to be indexed in order to jump to the next level and process the following node.
  • the util trie 300 can in some embodiments be generated from control trie 200 for the purpose of creating (or managing/updating) the forwarding information base 100.
  • That util node is then "mapped" to the hybrid node at the root of the forwarding database 100, such as by storing a pointer to the hybrid node within the util node, or by storing enough information to allow for a pointer to the hybrid node to be ascertained, a pointer to the child array of the hybrid node to be ascertained, additional information to allow for the other elements of the hybrid node to be recreated, etc.
  • This hybrid node 104 is shown in Figure 1 at Level 0 102A.
  • This created hybrid node is to include a pointer to its child array in the forwarding database 100, which includes hybrid node(s) and/or leaf node(s).
  • the next phase of "harvesting” includes looking at everything that exists between the root node and the next stride boundary (304B) at "/13" - all of this will be
  • the external node and the util node share a same location - i.e., right at the stride boundary 304B - and thus, in some embodiments this raises a special case. Because of this shared location, each node would have the same index in the resultant child array, and thus, under the special case, only one of them can exist in the child array - in this case, a hybrid node for the util node. Thus, the data for the external node is not lost, it can instead be placed one level lower than the child array into a special location (within the array) called a "pushdflt" node as introduced earlier.
  • Such pushdflt nodes are the ones "popped" when a "POP node” is reached while traversing the forwarding database 100.
  • the hybrid node on level 1 will have a child array on level 2 that includes a special pushdflt node with the forwarding information of external node "0.16.0.0.0.0/13.”
  • the child array (within the forwarding database 100) of the first hybrid node 104 would have included one hybrid node (corresponding to the discovered util node) and one leaf node (corresponding to the external node), and one (or more) POP node(s).
  • this child array 122 may include a total of 2 A 13 nodes, where the exponent 13 is derived from the size of the stride - i.e., 13 bits. Accordingly, in such an embodiment, the child array 122 may include one hybrid node, one leaf node, and (2 A 13 - 2) POP nodes.
  • the process thus continues with performing another harvesting of another level.
  • the split nodes are not considered during the flattening.
  • two hybrid nodes will be constructed in a new child array for the two util nodes, and one or more POP nodes will similarly be constructed to fill the child array.
  • This child array is shown in Figure 1 as the child array at level 2 102C, in which one of the hybrid nodes is shown as being a root 106 of a sub-tree (though of course, all hybrid nodes act as roots of sub-trees, including the other hybrid node of that same child array).
  • the leftmost hybrid node is mapped to the corresponding left util node (e.g., by storing a pointer to the leftmost hybrid node in that util node), and construction will continue by again harvesting for that node by identifying all nodes within the next stride.
  • there are only two util nodes (at '729”) and no external nodes, so another child array is created (e.g., with 2 A 8 256 nodes) to include two hybrid nodes corresponding to the two harvested util nodes, and pointers to these hybrid nodes are stored within the util nodes of the util trie.
  • the second util node at '729" will also be harvested, which includes just one util node in the stride beneath it, leading to a child array with one hybrid node (corresponding to the util node) and other leaf nodes as POP nodes.
  • the process may continue with harvesting for this util node (at “/37”), where the flattening out process will "flatten” out everything between the /37 boundary to the /45 boundary (304F).
  • each shown route is the result of being "incremented” by 8 units (e.g., .0 to .8, .8 to .16, etc.) as we only use up to bit 45 (and thus, there will be three trailing bits left out). Due to this flattening, all of these routes will include the same routing information as that of the /39 route. Accordingly, this single /39 route may simply expand to instead be represented as 64 routes (/45 routes) in the child array.
  • a previous algorithm may simply identify a closest "sub-tree root" where the change is in the control trie, and then simply reconstruct the entire control trie 350 and/or util trie 300, and also reconstruct the corresponding subtree in the forwarding database 100. This requires, in most cases, a large number of "re-harvests" of the control-side tries and thus a large number of new child arrays of the forwarding database 100 to be constructed.
  • embodiments disclosed herein can intelligently update the forwarding database 100 without these large-scale reconstructions and significant re-harvests of the control- side structures.
  • Figure 4 is a block diagram illustrating a portion 400 of the overlaid 350 util and control tries of Figure 3 and a
  • some embodiments can include identifying the util node 402 above the location needing to be updated, as the insertion is in the "harvesting zone" of the corresponding hybrid node (and thus, the child array 452 of that hybrid node).
  • the child array 452 needs to be modified, and, upon a careful inspection, it can be determined that the sub-trees 454 of the child array 452 do not need to be modified/reconstructed because they, if regenerated via re -harvesting, would return exactly the same thing.
  • the hybrid nodes of the child array 452 may end up at different indices within the child array 452, the contents will remain the same.
  • some embodiments utilize a new data structure referred to as a "dirty util trie" (not illustrated), which can be used to track util nodes (of the util trie 300) needing to be updated (i.e., have their child arrays be reconstructed via a harvesting of these util nodes).
  • the dirty util trie can be a trie tree that keeps track of all util nodes that need to be updated after a set of operations in the control trie 350.
  • the dirty util trie can thus be a lightweight structure that may simply include pointers to those of the util nodes in the util trie 300 that are dirty.
  • the dirty util trie when an update is performed that will perturb a hybrid node (e.g., create a need to rebuild its child array), the dirty util trie will be modified to include a node that references that corresponding "dirty" util node.
  • the util trie itself could be used for this purpose (e.g., by using a bit value to mark a particular util node as "dirty")
  • embodiments can utilize a dirty util trie instead to gain additional benefits. For example, in many cases there could be millions of routes stored in a very large util trie data structure, and thus, using a dirty util trie eliminates the need to walk the entire util trie to check every single util node to see if it is dirty.
  • the util trie is tremendously lightweight and can reduce an amount of storage overhead as the entries don't need to have data aside from a pointer to a corresponding util node - in contrast, adding an additional field into the util trie for every util node could be tremendously wasteful when only a few (or none at all) of the nodes are dirty.
  • the util node 402 when it is determined that the util node 402 is "dirty," its child nodes 408 may be identified and then cached (e.g., within the util node 402 itself). These cached nodes may be used later as described herein as part of the forwarding database 100 reconstruction process, though in other embodiments it is not necessary to perform such a caching (at the expense of additional processing/time required to later re-acquire these nodes).
  • FIG. 5 is a block diagram illustrating a portion 500 of an overlaid util and control trie with illustrated Point of Harvest (POH) locations according to some embodiments.
  • every util node in the util trie may have an associated POH, which may simply be one of the nodes in the control trie.
  • Each sub-tree root node in the forwarding table gets all its information by "harvesting” or flattening all the external nodes between the util node that represents it and the beginning of the next level. Because the util node itself doesn't belong to the control trie, embodiments select the next closest node in line with the same prefix as the starting point, or point of harvest, for this process.
  • the POH can be a split node or external node.
  • the topmost util node ⁇ ' 302A has a POH that is the split node 'A' 502 (of the control trie) at the same level. This is one common POH location for a util node - the node of the control trie at a corresponding location.
  • Another common POH location for a util node is the next node in the control trie beneath the corresponding location.
  • the POH 504 is the next control node beneath it, which is the external node 'B' 525.
  • the util node 'B' 302B and its POH 504 are at different levels 550.
  • some nodes can be a POH for multiple util nodes and thus be "shared" by nodes at different levels 570, though as indicated above, the node cannot be a POH for multiple util nodes that are at a same level.
  • util node 'C 302C its POH 506 is at a level beneath it, and this POH 506 is also shared for the util node 'D' 302D.
  • the POH for a util node can be determined by selecting the control node in the control trie that is at a same location within the control trie (e.g., POH 502 for util node 'A' 302A, POH 506 for util node 'D' 302D), and, if one does not exist, selecting the next control node beneath that corresponding location in the control trie (e.g., POH 504 for util node 'B' 302B, POH 506 for util node 'C 302C).
  • the POH can be either at the same prefix length as the util node or lower in the control trie but never higher.
  • an identifier of the POH for each util node may be identified and cached within the particular util node.
  • these cached copies can include identifiers of the POHs for these util nodes.
  • these util nodes 408 are located on a same level, we know that these POHs must necessarily be different. This property leads to a useful benefit in that these POHs can uniquely identify the cached util nodes at this level despite the absence of other unique information.
  • Figure 6 is a flow diagram illustrating a pre- reconstruction phase flow 600 and a reconstruction phase flow 630 for efficient forwarding information base 100 reconstruction according to some embodiments.
  • the operations in the flow diagrams will be described with reference to the exemplary embodiments of the other figures. However, it should be understood that the operations of the flow diagrams can be performed by embodiments other than those discussed with reference to the other figures, and the
  • the operations of the flow 600 and/or operations of the flow 630 can be performed by a packet forwarder network element as described herein, a control plane entity (implemented by a same or different device or at a same or different physical/geographic location as a corresponding forwarding plane), etc.
  • a "pre -reconstruction" phase flow 600 and a “reconstruction” phase flow 630 are illustrated. These phases may be performed adjacent in time to each other, though in other embodiments, the pre-reconstruction phase may be performed one or more times (for one or more changes) and then the reconstruction phase may be performed at a later point in time (for one or more changes).
  • the pre-reconstruction flow 600 includes, at block 602, identifying a util trie node that is to be considered "dirty" based upon a change. In some embodiments, block 602 includes identifying a util node just "above" the location of the change (e.g., an insert, deletion, modification) that occurs within the control trie.
  • the flow 600 optionally includes caching the immediate "child” util nodes (including their POH pointers) of the identified util trie node. These child util nodes may be cached within that identified util trie node.
  • the flow 600 includes inserting a node in a "dirty util trie" that corresponds to the identified util trie node.
  • the node inserted includes a reference to the identified util trie node, such as a pointer.
  • the flow 600 includes updating the POH value for all affected util nodes due to the change of the control trie. For example, a newly-inserted node in the control trie might have just become the new POH for a util node.
  • This flow 600 may be performed one or multiple times, and thus, there may be one or multiple different dirty util nodes in the dirty util trie.
  • the reconstruction process (for perhaps multiple ones of the dirty util nodes, such as those at a same level, etc.) can be performed concurrently/simultaneously by different processing elements (e.g., threads, processes, processors, etc.)
  • the dirty util trie can be traversed using a breadth first, top-down process to easily identify dirty util nodes and launch the efficient "reconstruction" of the child arrays (corresponding to the util nodes corresponding to the identified dirty util nodes) using, for example, parallelism that could be provided by multithreading techniques.
  • the reconstruction phase flow 630 may begin with block 632 to identify a dirty util trie node, which identifies a util node in the util trie that needs to have its child array reconstructed. This may occur according to a breadth first, top down traversal of the dirty util trie, which can thus enable safe parallel reconstruction of the forwarding database 100.
  • the flow 630 may then include, at block 634, marking the util node for deletion in the util trie.
  • the util node will ultimately be deleted and replaced with a newly-constructed util node.
  • the flow 630 includes harvesting the util node, which can include block 638 and executing a harvesting algorithm (e.g., by a separate harvester process) to identify one or more nodes (including one or more hybrid nodes) to be placed into the child array of the hybrid node corresponding to the util node within the forwarding database.
  • a harvesting algorithm e.g., by a separate harvester process
  • This optionally can include, at block 640, walking at least some of the control trie starting with the node identified by the POH of the identified util trie node to thus identify all nodes within the level/stride beneath the util trie node.
  • This can optionally include block 642, where the harvesting algorithm returns a set of entries to construct the nodes for the child array.
  • Each of the entries that is for a hybrid node includes an index of where that hybrid node is to be placed in the child array, and a point of harvest (POH) pointer of the hybrid node that identifies a node within the control trie that serves as its POH.
  • POH point of harvest
  • the flow 630 includes creating a new util node (corresponding to the identified util node that was marked as dirty) and creating/updating the corresponding new child array in the forwarding database. This can include, at block 646, for every util node (and corresponding hybrid node) to be created, checking to determine whether an existing node (of the forwarding database can be re-used by comparing the returned POH (e.g., from block 642) with the cached POH (e.g., from block 604). If, at block 648, it is determined that the POHs are the same, then the flow can reuse the util (and hybrid) node instead of reconstructing them.
  • this entire reconstruction phase flow 630 can be performed multiple times, such as when the dirty util trie includes multiple dirty util nodes. Additionally, some or all of these operations can be performed in parallel.
  • Figure 7 is a block diagram 700 illustrating an insertion of a route and some operations performed with various portions of a util trie and/or control trie according to some embodiments.
  • This figure shows exemplary operations in response to a new route insertion, where an insertion point 404 is illustrated, in which its nearest parent util node is thus deemed "dirty" 705 (e.g., via identification block 602).
  • this util node can be identified, and thus, the harvesting process will be performed.
  • the harvesting algorithm can be performed by a "harvester" process, which examines the control trie starting at the POH and ending at the stride boundary, to create a data structure containing all the nodes to be placed in the child array including any hybrid nodes (and their POHs), leaf nodes, and their corresponding indices within the child array.
  • the harvesting process reveals the new data point (e.g., which will become a new leaf node in the child array of the forwarding database) and two util nodes (e.g., which will be two new hybrid nodes in the child array).
  • the harvesting process does not know that the hybrid nodes corresponding to the harvested util nodes are the same as what is currently deployed within the forwarding database.
  • the harvester process returns a set of entries corresponding to nodes that are to be constructed for the child array.
  • Each entry can include a variety of types of information, such as a type of node (e.g., hybrid type, leaf type, POP type), an index in which the node is to be placed in the child array, a POH pointer (for hybrid nodes), a leaf data pointer (for leaf nodes), etc.
  • a builder process can be used to translate this data into the format required for a child array in the forwarding database.
  • the builder process can utilize the "re-use" logic disclosed herein to determine whether the hybrid node requests (corresponding to the returned harvest data) are for the "same" hybrid nodes as the ones that are currently used in the forwarding database. Notably, it is non-trivial to determine whether these are the same - especially because they are standing on the same prefix. However, because we have the POH values, which are unique for each level, these POH values can be used to identify the "same" nodes. Thus, the builder process at this point examines the data returned by the harvester, and for each entry, populates the child array.
  • This re-use logic can be called each time one of the entries returned by the harvester is a hybrid node - if the POH matches one of the cached ones then the hybrid node is reused, otherwise, a new task needs to be created for a new hybrid node to be built.
  • the harvester process when the harvester returns a request to build a hybrid node (at a particular index), the harvester process returns a hybrid node type, the index in which it needs to be placed within the child array, and an identifier of the POH for the node indicating where one must go (e.g., a pointer to a node in the control trie) to start harvesting for the next level.
  • the harvester process can determine the POH when it walks the nodes in the stride. For example, upon going down to the control trie node (720) and determining that it has crossed its stride, the harvester process can determine that since it did cross the stride, that must be a util node above it, and that this control trie node that it landed upon is the POH for that util node.
  • the re-use logic can first compare (see 720, see also 646/648) the returned POH value (returned with the hybrid node data by the harvester) with the cached POH value of the cached util nodes.
  • the builder can thus be instructed to re-use the existing hybrid node - e.g., the builder can be instructed to make a copy of the hybrid node in a new child array that is being
  • hybrid node includes a pointer to any of its descendant nodes in the forwarding database, none of these descendant nodes need to be reconstructed.
  • the index of the hybrid node may or may not change within the child array, even though the rest of the hybrid node remains the same. However, it may also be the case that the order of the hybrid nodes will not change within the child array even though their particular indices may now be different; this property can be utilized to reduce the number of POH values of the cached child util nodes that need to be compared to a returned POH.
  • hybrid nodes A[0], B[l], and C[2] after one or more updates the same hybrid nodes may be located at different indices: A[0], B[4], and C[6].
  • these hybrid nodes are at different indices of the child array, they remain in the same relative order - among the three, A is still first, B is still second, and C is still third.
  • the process may stop searching for a match because it cannot exist.
  • a new child array can be constructed, which due to this insertion may include a new leaf node for a new route, and other leaf nodes may remain the same and the existing hybrid nodes may also remain the same (except for possibly their indices) and be reused due to this logic.
  • Some embodiments thus can utilize a "make before break” philosophy, and thus create a new child array and upon its completion, switch over the pointer from its parent hybrid node from the "old" child array to the "new" child array, which can be extremely fast so that no traffic being processed is interrupted due to temporarily missing forwarding information.
  • Figure 8 is a block diagram illustrating an exemplary overlaid util and control trie before 800 and after 850 a deletion according to some embodiments.
  • the change may involve a deletion of an external node 802 from the control trie.
  • the split node 804 in the control trie will disappear after the deletion.
  • the util node 852 will now have a new POH - the external node 854 on the bottom left.
  • the util nodes 856 on right side may still exist, but essentially become useless, and thus may have a null POH, and can be considered dirty and added to the dirty util trie (as they may need to be deleted), along with their forwarding table 100 counterparts, when the dirty util trie is processed.
  • these util nodes 852/856 were affected by the change, thus causing their POH to became NULL.
  • there are cases where when doing a single change e.g., deleting an entry in the control trie
  • Two of these util nodes 856 now have NULL POHs, and the util node 852 now has the lower left entry of the control trie 854 as its POH.
  • the POH values may be updated after every change to avoid a situation where, during a later harvesting and building process, the POH comparisons won't match even though they should (i.e., the hybrid nodes are the same).
  • FIG. 9 is a flow diagram illustrating another flow for efficient forwarding information base reconstruction according to some embodiments.
  • the flow 900 can be performed by, for example, a packet forwarder network element as described herein, a control plane entity (implemented by a same or different device or at a same or different physical/geographic location as a corresponding forwarding plane), etc.
  • the flow 900 includes determining that an update to a forwarding information base (FIB) utilized to make forwarding decisions is to be performed to reflect a new, changed, or deleted route.
  • the FIB comprises a data structure having a plurality of levels, and includes one or more hybrid nodes each acting as a root of a sub-tree of the data structure and each including a pointer to a child array of nodes at a next level of the plurality of levels.
  • the flow 900 includes updating a control trie to reflect the new, changed, or deleted route by adding, removing, or modifying at least one external node in the control trie.
  • the control trie includes one or more external nodes, each indicating routing information for one or more of a plurality of routes of the network.
  • the flow 900 also includes, at block 915, identifying, within a util trie data structure, a util node that corresponds to a hybrid node of the FIB serving as a root of a subtree that is to be updated due to the new, changed, or deleted route.
  • the util trie also has the plurality of levels and includes a plurality of util nodes. Each of the plurality of util nodes is located at a boundary of one of the plurality of levels, each of the plurality of util nodes corresponds to one of the one or more hybrid nodes of the FIB .
  • the flow 900 also includes, at block 920, obtaining a point of harvest (POH) identifier for each of one or more immediate descendant util nodes of the identified util node within the util trie.
  • POH point of harvest
  • Each POH identifier identifies, for the corresponding descendant util node, one of the nodes within the control trie that is at a corresponding location within the control trie or is a next node within the control trie beneath the corresponding location within the control trie.
  • the flow 900 includes obtaining, via a harvesting process, data including one or more nodes that are to form the child array of the hybrid node corresponding to the identified util node.
  • the data further includes a POH identifier for each of the one or more nodes that are hybrid nodes.
  • the flow 900 includes reusing the hybrid node within the existing child array while reconstructing the child array instead of regenerating the hybrid node and any of its descendant nodes.
  • Embodiments disclosed herein may involve the use of one or more electronic devices.
  • An electronic device stores and transmits (internally and/or with other electronic devices over a network) code (which is composed of software instructions and which is sometimes referred to as computer program code or a computer program) and/or data using machine-readable media (also called computer-readable media), such as machine-readable storage media (e.g., magnetic disks, optical disks, read only memory (ROM), flash memory devices, phase change memory) and machine-readable transmission media (also called a carrier) (e.g., electrical, optical, radio, acoustical or other form of propagated signals - such as carrier waves, infrared signals).
  • machine-readable media also called computer-readable media
  • machine-readable storage media e.g., magnetic disks, optical disks, read only memory (ROM), flash memory devices, phase change memory
  • machine-readable transmission media also called a carrier
  • carrier e.g., electrical, optical, radio, acoustical or other form of
  • an electronic device e.g., a computer
  • includes hardware and software such as a set of one or more processors coupled to one or more machine-readable storage media to store code for execution on the set of processors and/or to store data.
  • an electronic device may include non- volatile memory containing the code since the non-volatile memory can persist code/data even when the electronic device is turned off (when power is removed), and while the electronic device is turned on that part of the code that is to be executed by the processor(s) of that electronic device is typically copied from the slower non- volatile memory into volatile memory (e.g., dynamic random access memory (DRAM), static random access memory
  • DRAM dynamic random access memory
  • SRAM static random access memory
  • Typical electronic devices also include a set or one or more physical network interface(s) (NI) to establish network connections (to transmit and/or receive code and/or data using propagating signals) with other electronic devices.
  • NI physical network interface
  • One or more parts of an embodiment may be implemented using different combinations of software, firmware, and/or hardware.
  • a network device is an electronic device that communicatively interconnects other electronic devices on the network (e.g., other network devices, end-user devices).
  • Some network devices are "multiple services network devices” that provide support for multiple networking functions (e.g., routing, bridging, switching, Layer 2 aggregation, session border control, Quality of Service, and/or subscriber management), and/or provide support for multiple application services (e.g., data, voice, and video).
  • Figure 10A illustrates connectivity between network devices (NDs) within an exemplary network, as well as three exemplary implementations of the NDs, according to some embodiments.
  • Figure 10A shows NDs 1000A-1000H, and their connectivity by way of lines between 1000A-1000B, lOOOB-lOOOC, lOOOC-lOOOD, 1000D-1000E, 1000E-1000F, 1000F- 1000G, and 1000A-1000G, as well as between 1000H and each of 1000A, lOOOC, 1000D, and 1000G.
  • These NDs are physical devices, and the connectivity between these NDs can be wireless or wired (often referred to as a link).
  • NDs 1000A, 1000E, and 1000F An additional line extending from NDs 1000A, 1000E, and 1000F illustrates that these NDs act as ingress and egress points for the network (and thus, these NDs are sometimes referred to as edge NDs; while the other NDs may be called core NDs).
  • Two of the exemplary ND implementations in Figure 10A are: 1) a special-purpose network device 1002 that uses custom application-specific integrated-circuits (ASICs) and a special-purpose operating system (OS); and 2) a general purpose network device 1004 that uses common off-the-shelf (COTS) processors and a standard OS.
  • ASICs application-specific integrated-circuits
  • OS special-purpose operating system
  • COTS common off-the-shelf
  • the special-purpose network device 1002 includes networking hardware 1010 comprising compute resource(s) 1012 (which typically include a set of one or more processors), forwarding resource(s) 1014 (which typically include one or more ASICs and/or network processors), and physical network interfaces (NIs) 1016 (sometimes called physical ports), as well as non-transitory machine readable storage media 1018 having stored therein networking software 1020 comprising packet forwarder code 1090A (which, for example, can implement a packet forwarder described herein when executed).
  • networking hardware 1010 comprising compute resource(s) 1012 (which typically include a set of one or more processors), forwarding resource(s) 1014 (which typically include one or more ASICs and/or network processors), and physical network interfaces (NIs) 1016 (sometimes called physical ports), as well as non-transitory machine readable storage media 1018 having stored therein networking software 1020 comprising packet forwarder code 1090A (which, for example, can implement a packet forwarder described herein when executed).
  • a physical NI is hardware in a ND through which a network connection (e.g., wirelessly through a wireless network interface controller (WNIC) or through plugging in a cable to a physical port connected to a network interface controller (NIC)) is made, such as those shown by the connectivity between NDs 1000A-1000H.
  • the networking software 1020 may be executed by the networking hardware 1010 to instantiate a set of one or more networking software instance(s) 1022.
  • Each of the networking software instance(s) 1022, and that part of the networking hardware 1010 that executes that network software instance form a separate virtual network element 1030A-1030R.
  • Each of the virtual network element(s) (VNEs) 1030A-1030R includes a control communication and configuration module 1032A-1032R (sometimes referred to as a local control module or control communication module) and forwarding table(s) 1034A-1034R, such that a given virtual network element (e.g., 1030A) includes the control communication and configuration module (e.g., 1032A), a set of one or more forwarding table(s) (e.g., 1034A), and that portion of the networking hardware 1010 that executes the virtual network element (e.g., 1030A).
  • a control communication and configuration module 1032A-1032R sometimes referred to as a local control module or control communication module
  • forwarding table(s) 1034A-1034R forwarding table(s) 1034A-1034R
  • the special-purpose network device 1002 is often physically and/or logically considered to include: 1) a ND control plane 1024 (sometimes referred to as a control plane) comprising the compute resource(s) 1012 that execute the control communication and configuration module(s) 1032A-1032R; and 2) a ND forwarding plane 1026 (sometimes referred to as a forwarding plane, a data plane, or a media plane) comprising the forwarding resource(s) 1014 that utilize the forwarding table(s) 1034A-1034R and the physical NIs 1016.
  • a ND control plane 1024 (the compute resource(s) 1012 executing the control communication and
  • configuration module(s) 1032A-1032R) is typically responsible for participating in controlling how data (e.g., packets) is to be routed (e.g., the next hop for the data and the outgoing physical NI for that data) and storing that routing information in the forwarding table(s) 1034A-1034R, and the ND forwarding plane 1026 is responsible for receiving that data on the physical NIs 1016 and forwarding that data out the appropriate ones of the physical NIs 1016 based on the forwarding table(s) 1034A-1034R.
  • data e.g., packets
  • the ND forwarding plane 1026 is responsible for receiving that data on the physical NIs 1016 and forwarding that data out the appropriate ones of the physical NIs 1016 based on the forwarding table(s) 1034A-1034R.
  • Figure 10B illustrates an exemplary way to implement the special-purpose network device 1002 according to some embodiments.
  • Figure 10B shows a special-purpose network device including cards 1038 (typically hot pluggable). While in some embodiments the cards 1038 are of two types (one or more that operate as the ND forwarding plane 1026 (sometimes called line cards), and one or more that operate to implement the ND control plane 1024 (sometimes called control cards)), alternative embodiments may combine functionality onto a single card and/or include additional card types (e.g., one additional type of card is called a service card, resource card, or multi-application card).
  • additional card types e.g., one additional type of card is called a service card, resource card, or multi-application card.
  • a service card can provide specialized processing (e.g., Layer 4 to Layer 7 services (e.g., firewall, Internet Protocol Security (IPsec), Secure Sockets Layer (SSL) / Transport Layer Security (TLS), Intrusion Detection System (IDS), peer-to-peer (P2P), Voice over IP (VoIP) Session Border Controller, Mobile Wireless Gateways (Gateway General Packet Radio Service (GPRS) Support Node (GGSN), Evolved Packet Core (EPC) Gateway)).
  • Layer 4 to Layer 7 services e.g., firewall, Internet Protocol Security (IPsec), Secure Sockets Layer (SSL) / Transport Layer Security (TLS), Intrusion Detection System (IDS), peer-to-peer (P2P), Voice over IP (VoIP) Session Border Controller, Mobile Wireless Gateways (Gateway General Packet Radio Service (GPRS) Support Node (GGSN), Evolved Packet Core (EPC) Gateway)
  • GPRS General Pack
  • the general purpose network device 1004 includes hardware 1040 comprising a set of one or more processor(s) 1042 (which are often COTS processors) and network interface controller(s) 1044 (NICs; also known as network interface cards) (which include physical NIs 1046), as well as non-transitory machine readable storage media 1048 having stored therein software 1050 comprising packet forwarder code 1090B.
  • processor(s) 1042 execute the software 1050 to instantiate one or more sets of one or more applications 1064A-1064R. While one embodiment does not implement virtualization, alternative embodiments may use different forms of virtualization.
  • the virtualization layer 1054 represents the kernel of an operating system (or a shim executing on a base operating system) that allows for the creation of multiple instances 1062A-1062R called software containers that may each be used to execute one (or more) of the sets of applications 1064A-1064R; where the multiple software containers (also called virtualization engines, virtual private servers, or jails) are user spaces (typically a virtual memory space) that are separate from each other and separate from the kernel space in which the operating system is run; and where the set of applications running in a given user space, unless explicitly allowed, cannot access the memory of the other processes,
  • the virtualization layer 1054 represents a hypervisor (sometimes referred to as a virtual machine monitor (VMM)) or a hypervisor executing on top of a host operating system, and each of the sets of applications 1064A-1064R is run on top of a guest operating system within an instance 1062A-1062R called a virtual machine (which may in some cases be considered a tightly isolated
  • one, some or all of the applications are implemented as unikernel(s), which can be generated by compiling directly with an application only a limited set of libraries (e.g., from a library operating system (LibOS) including drivers/libraries of OS sendees) that provide the particular OS services needed by the application.
  • libraries e.g., from a library operating system (LibOS) including drivers/libraries of OS sendees
  • unikernel can be implemented to run directly on hardware 1040, directly on a hypervisor (in which case the unikernel is sometimes described as running within a LibOS virtual machine), or in a software container
  • embodiments can be implemented fully with unikemels running directly on a hypervisor represented by virtualization layer 1054, unikemels running within software containers represented by instances 1062A-1062R, or as a combination of unikemels and the above-described techniques (e.g., unikemels and virtual machines both run directly on a hypervisor, unikemels and sets of applications that are run in different software containers).
  • the instantiation of the one or more sets of one or more applications 1064A-1064R, as well as virtualization if implemented, are collectively referred to as software instance(s) 1052.
  • the virtual network element(s) 1060A-1060R perform similar functionality to the virtual network element(s) 1030A-1030R - e.g., similar to the control communication and configuration module(s) 1032A and forwarding table(s) 1034A (this virtualization of the hardware 1040 is sometimes referred to as network function virtualization (NFV)).
  • NFV network function virtualization
  • CPE customer premise equipment
  • each instance 1062A-1062R corresponding to one VNE 1060A-1060R alternative embodiments may implement this correspondence at a finer level granularity (e.g., line card virtual machines virtualize line cards, control card virtual machine virtualize control cards, etc.); it should be understood that the techniques described herein with reference to a correspondence of instances 1062A-1062R to VNEs also apply to embodiments where such a finer level of granularity and/or unikernels are used.
  • a finer level granularity e.g., line card virtual machines virtualize line cards, control card virtual machine virtualize control cards, etc.
  • the virtualization layer 1054 includes a virtual switch that provides similar forwarding services as a physical Ethernet switch. Specifically, this virtual switch forwards traffic between instances 1062A-1062R and the NIC(s) 1044, as well as optionally between the instances 1062A-1062R; in addition, this virtual switch may enforce network isolation between the VNEs 1060A-1060R that by policy are not permitted to communicate with each other (e.g., by honoring virtual local area networks (VLANs)).
  • VLANs virtual local area networks
  • the third exemplary ND implementation in Figure 10A is a hybrid network device 1006, which includes both custom ASICs/special-purpose OS and COTS processors/standard OS in a single ND or a single card within an ND.
  • a platform VM i.e., a VM that that implements the functionality of the special- purpose network device 1002 could provide for para-virtualization to the networking hardware present in the hybrid network device 1006.
  • NE network element
  • each of the VNEs receives data on the physical NIs (e.g., 1016, 1046) and forwards that data out the appropriate ones of the physical NIs (e.g., 1016, 1046).
  • the physical NIs e.g., 1016, 1046
  • a VNE implementing IP router functionality forwards IP packets on the basis of some of the IP header information in the IP packet; where IP header information includes source IP address, destination IP address, source port, destination port (where "source port” and “destination port” refer herein to protocol ports, as opposed to physical ports of a ND), transport protocol (e.g., user datagram protocol (UDP), Transmission Control Protocol (TCP), and differentiated services code point (DSCP) values.
  • transport protocol e.g., user datagram protocol (UDP), Transmission Control Protocol (TCP), and differentiated services code point (DSCP) values.
  • UDP user datagram protocol
  • TCP Transmission Control Protocol
  • DSCP differentiated services code point
  • Figure IOC illustrates various exemplary ways in which VNEs may be coupled according to some embodiments.
  • Figure IOC shows VNEs 1070A.1-1070A.P (and optionally VNEs 1070A.Q-1070A.R) implemented in ND 1000A and VNE 1070H.1 in ND 1000H.
  • VNEs 1070A.1-1070A.P are separate from each other in the sense that they can receive packets from outside ND 1000A and forward packets outside of ND 1000A; VNE 1070A.1 is coupled with VNE 1070H.1, and thus they communicate packets between their respective NDs; VNE 1070A.2-1070A.3 may optionally forward packets between themselves without forwarding them outside of the ND 1000A; and VNE 1070A.P may optionally be the first in a chain of VNEs that includes VNE 1070A.Q followed by VNE 1070A.R (this is sometimes referred to as dynamic service chaining, where each of the VNEs in the series of VNEs provides a different service - e.g., one or more layer 4-7 network services). While Figure IOC illustrates various exemplary relationships between the VNEs, alternative embodiments may support other relationships (e.g., more/fewer VNEs, more/fewer dynamic service chains, multiple different dynamic service chains with some common VNEs and
  • the NDs of Figure 10A may form part of the Internet or a private network; and other electronic devices (not shown; such as end user devices including
  • VOIP Voice Over Internet Protocol
  • GPS Global Positioning Satellite
  • wearable devices gaming systems, set-top boxes, Internet enabled household appliances
  • VPNs virtual private networks
  • Such content and/or services are typically provided by one or more servers (not shown) belonging to a service/content provider or one or more end user devices (not shown) participating in a peer-to- peer (P2P) service, and may include, for example, public webpages (e.g., free content, store fronts, search services), private webpages (e.g., username/password accessed webpages providing email services), and/or corporate networks over VPNs.
  • end user devices may be coupled (e.g., through customer premise equipment coupled to an access network (wired or wirelessly)) to edge NDs, which are coupled (e.g., through one or more core NDs) to other edge NDs, which are coupled to electronic devices acting as servers.
  • one or more of the electronic devices operating as the NDs in Figure 10A may also host one or more such servers (e.g., in the case of the general purpose network device 1004, one or more of the software instances 1062A-1062R may operate as servers; the same would be true for the hybrid network device 1006; in the case of the special-purpose network device 1002, one or more such servers could also be run on a virtu alization layer executed by the compute resource(s) 1012); in which case the servers are said to be co-located with the VNEs of that ND.
  • the servers are said to be co-located with the VNEs of that ND.
  • a virtual network is a logical abstraction of a physical network (such as that in Figure 10A) that provides network services (e.g., L2 and/or L3 services).
  • a virtual network can be implemented as an overlay network (sometimes referred to as a network virtualization overlay) that provides network services (e.g., layer 2 (L2, data link layer) and/or layer 3 (L3, network layer) services) over an underlay network (e.g., an L3 network, such as an Internet Protocol (IP) network that uses tunnels (e.g., generic routing encapsulation (GRE), layer 2 tunneling protocol (L2TP), IPSec) to create the overlay network).
  • IP Internet Protocol
  • a network virtualization edge sits at the edge of the underlay network and participates in implementing the network virtualization; the network-facing side of the NVE uses the underlay network to tunnel frames to and from other NVEs; the outward-facing side of the NVE sends and receives data to and from systems outside the network.
  • a virtual network instance is a specific instance of a virtual network on a NVE (e.g., a NE/VNE on an ND, a part of a NE/VNE on a ND where that NE/VNE is divided into multiple VNEs through emulation); one or more VNIs can be instantiated on an NVE (e.g., as different VNEs on an ND).
  • a virtual access point is a logical connection point on the NVE for connecting external systems to a virtual network; a VAP can be physical or virtual ports identified through logical interface identifiers (e.g., a VLAN ID).
  • Examples of network services include: 1) an Ethernet Local Area Network (LAN) emulation service (an Ethernet-based multipoint service similar to an Internet Engineering Task Force (IETF) Multiprotocol Label Switching (MPLS) or Ethernet VPN (EVPN) service) in which external systems are interconnected across the network by a LAN environment over the underlay network (e.g., an NVE provides separate L2 VNIs (virtual switching instances) for different such virtual networks, and L3 (e.g., IP/MPLS) tunneling encapsulation across the underlay network); and 2) a virtualized IP forwarding service (similar to IETF IP VPN (e.g., Border Gateway Protocol (BGP)/MPLS IPVPN) from a service definition perspective) in which external systems are interconnected across the network by an L3 environment over the underlay network (e.g., an NVE provides separate L3 VNIs (forwarding and routing instances) for different such virtual networks, and L3 (e.g., IP/MPLS) tunneling encapsulation across the
  • Network services may also include quality of service capabilities (e.g., traffic classification marking, traffic conditioning and scheduling), security capabilities (e.g., filters to protect customer premises from network - originated attacks, to avoid malformed route announcements), and management capabilities (e.g., full detection and processing).
  • quality of service capabilities e.g., traffic classification marking, traffic conditioning and scheduling
  • security capabilities e.g., filters to protect customer premises from network - originated attacks, to avoid malformed route announcements
  • management capabilities e.g., full detection and processing
  • FIG. 10D illustrates a network with a single network element on each of the NDs of Figure 10A, and within this straight forward approach contrasts a traditional distributed approach (commonly used by traditional routers) with a centralized approach for maintaining reachability and forwarding information (also called network control), according to some embodiments.
  • Figure 10D illustrates network elements (NEs) 1070A-1070H with the same connectivity as the NDs 1000A-1000H of Figure 10A.
  • Figure 10D illustrates that the distributed approach 1072 distributes responsibility for generating the reachability and forwarding information across the NEs 1070A-1070H; in other words, the process of neighbor discovery and topology discovery is distributed.
  • the control communication and configuration module(s) 1032A-1032R of the ND control plane 1024 typically include a reachability and forwarding information module to implement one or more routing protocols (e.g., an exterior gateway protocol such as Border Gateway Protocol (BGP), Interior Gateway Protocol(s) (IGP) (e.g., Open Shortest Path First (OSPF), Intermediate System to Intermediate System (IS-IS), Routing Information Protocol (RIP), Label Distribution Protocol (LDP), Resource Reservation Protocol (RSVP) (including RS VP-Traffic Engineering (TE): Extensions to RSVP for Label Switched Path (LSP) Tunnels and Generalized Multi-Protocol Label Switching (GMPLS) Signaling RSVP-TE)) that communicate with other NEs to exchange routes, and then selects those routes based on one or more routing metrics.
  • Border Gateway Protocol BGP
  • IGP Interior Gateway Protocol
  • OSPF Open Shortest Path First
  • IS-IS Intermediate System to Intermediate System
  • RIP Routing Information Protocol
  • LDP Label Distribution Protocol
  • RSVP Resource Reservation Protocol
  • the NEs 1070A-1070H (e.g., the compute resource(s) 1012 executing the control communication and configuration module(s) 1032A-1032R) perform their responsibility for participating in controlling how data (e.g., packets) is to be routed (e.g., the next hop for the data and the outgoing physical NI for that data) by distributively determining the reachability within the network and calculating their respective forwarding information. Routes and adjacencies are stored in one or more routing structures (e.g., Routing Information Base (RIB), Label
  • the ND control plane 1024 programs the ND forwarding plane 1026 with information (e.g., adjacency and route information) based on the routing structure(s). For example, the ND control plane 1024 programs the adjacency and route information into one or more forwarding table(s) 1034A-1034R (e.g., Forwarding Information Base (FIB), Label Forwarding Information Base (LFIB), and one or more adjacency structures) on the ND forwarding plane 1026.
  • the ND can store one or more bridging tables that are used to forward data based on the layer 2 information in that data. While the above example uses the special-purpose network device 1002, the same distributed approach 1072 can be implemented on the general purpose network device 1004 and the hybrid network device 1006.
  • FIG. 10D illustrates that a centralized approach 1074 (also known as software defined networking (SDN)) that decouples the system that makes decisions about where traffic is sent from the underlying systems that forwards traffic to the selected destination.
  • the illustrated centralized approach 1074 has the responsibility for the generation of reachability and forwarding information in a centralized control plane 1076 (sometimes referred to as a SDN control module, controller, network controller, OpenFlow controller, SDN controller, control plane node, network virtualization authority, or management control entity), and thus the process of neighbor discovery and topology discovery is centralized.
  • a centralized control plane 1076 sometimes referred to as a SDN control module, controller, network controller, OpenFlow controller, SDN controller, control plane node, network virtualization authority, or management control entity
  • the centralized control plane 1076 has a south bound interface 1082 with a data plane 1080 (sometime referred to the infrastructure layer, network forwarding plane, or forwarding plane (which should not be confused with a ND forwarding plane)) that includes the NEs 1070A-1070H (sometimes referred to as switches, forwarding elements, data plane elements, or nodes).
  • the centralized control plane 1076 includes a network controller 1078, which includes a centralized reachability and forwarding information module 1079 that determines the reachability within the network and distributes the forwarding information to the NEs 1070A-1070H of the data plane 1080 over the south bound interface 1082 (which may use the OpenFlow protocol).
  • the network intelligence is centralized in the centralized control plane 1076 executing on electronic devices that are typically separate from the NDs.
  • each of the control communication and configuration module(s) 1032A-1032R of the ND control plane 1024 typically include a control agent that provides the VNE side of the south bound interface 1082.
  • the ND control plane 1024 (the compute resource(s) 1012 executing the control communication and configuration module(s) 1032A-1032R) performs its responsibility for participating in controlling how data (e.g., packets) is to be routed (e.g., the next hop for the data and the outgoing physical NI for that data) through the control agent communicating with the centralized control plane 1076 to receive the forwarding information (and in some cases, the reachability information) from the centralized reachability and forwarding information module 1079 (it should be understood that in some embodiments, the control communication and configuration module(s) 1032A-1032R, in addition to
  • communicating with the centralized control plane 1076 may also play some role in determining reachability and/or calculating forwarding information - albeit less so than in the case of a distributed approach; such embodiments are generally considered to fall under the centralized approach 1074, but may also be considered a hybrid approach).
  • each of the VNE 1060A-1060R performs its responsibility for controlling how data (e.g., packets) is to be routed (e.g., the next hop for the data and the outgoing physical NI for that data) by communicating with the centralized control plane 1076 to receive the forwarding information (and in some cases, the reachability information) from the centralized reachability and forwarding information module 1079; it should be understood that in some embodiments, the VNEs 1060A-1060R, in addition to communicating with the centralized control plane 1076, may also play some role in determining reachability and/or calculating forwarding information - albeit less so than in the case of a distributed approach) and the hybrid network device 1006.
  • NFV is able to support SDN by providing an infrastructure upon which the SDN software can be run
  • NFV and SDN both aim to make use of commodity server hardware and physical switches.
  • FIG. 10D also shows that the centralized control plane 1076 has a north bound interface 1084 to an application layer 1086, in which resides application(s) 1088.
  • the centralized control plane 1076 has the ability to form virtual networks 1092 (sometimes referred to as a logical forwarding plane, network services, or overlay networks (with the NEs 1070A- 1070H of the data plane 1080 being the underlay network)) for the application(s) 1088.
  • virtual networks 1092 sometimes referred to as a logical forwarding plane, network services, or overlay networks (with the NEs 1070A- 1070H of the data plane 1080 being the underlay network)
  • the centralized control plane 1076 maintains a global view of all NDs and configured
  • NEs/VNEs maps the virtual networks to the underlying NDs efficiently (including maintaining these mappings as the physical network changes either through hardware (ND, link, or ND component) failure, addition, or removal).
  • Figure 10D shows the distributed approach 1072 separate from the centralized approach 1074
  • the effort of network control may be distributed differently or the two combined in certain embodiments.
  • embodiments may generally use the centralized approach (e.g., SDN) 1074, but have certain functions delegated to the NEs (e.g., the distributed approach may be used to implement one or more of fault monitoring, performance monitoring, protection switching, and primitives for neighbor and/or topology discovery); or 2)
  • embodiments may perform neighbor discovery and topology discovery via both the centralized control plane and the distributed protocols, and the results compared to raise exceptions where they do not agree. Such embodiments are generally considered to fall under the centralized approach 1074, but may also be considered a hybrid approach.
  • Figure 10D illustrates the simple case where each of the NDs 1000A-1000H implements a single NE 1070A-1070H, it should be understood that the network control approaches described with reference to Figure 10D also work for networks where one or more of the NDs 1000A-1000H implement multiple VNEs (e.g., VNEs 1030A-1030R, VNEs 1060A- 1060R, those in the hybrid network device 1006).
  • the network controller 1078 may also emulate the implementation of multiple VNEs in a single ND.
  • the network controller 1078 may present the implementation of a VNE/NE in a single ND as multiple VNEs in the virtual networks 1092 (all in the same one of the virtual network(s) 1092, each in different ones of the virtual network(s) 1092, or some combination).
  • the network controller 1078 may cause an ND to implement a single VNE (a NE) in the underlay network, and then logically divide up the resources of that NE within the centralized control plane 1076 to present different VNEs in the virtual network(s) 1092 (where these different VNEs in the overlay networks are sharing the resources of the single VNE/NE implementation on the ND in the underlay network).
  • Figures 10E and 10F respectively illustrate exemplary abstractions of NEs and VNEs that the network controller 1078 may present as part of different ones of the virtual networks 1092.
  • Figure 10E illustrates the simple case of where each of the NDs 1000A- 1000H implements a single NE 1070A-1070H (see Figure 10D), but the centralized control plane 1076 has abstracted multiple of the NEs in different NDs (the NEs 1070A-1070C and 1070G-1070H) into (to represent) a single NE 10701 in one of the virtual network(s) 1092 of Figure 10D, according to some embodiments.
  • Figure 10E shows that in this virtual network, the NE 10701 is coupled to NE 1070D and 1070F, which are both still coupled to NE 1070E.
  • Figure 10F illustrates a case where multiple VNEs (VNE 1070A.1 and VNE 1070H.1) are implemented on different NDs (ND 1000 A and ND 1000H) and are coupled to each other, and where the centralized control plane 1076 has abstracted these multiple VNEs such that they appear as a single VNE 1070T within one of the virtual networks 1092 of Figure 10D, according to some embodiments.
  • the abstraction of a NE or VNE can span multiple NDs.
  • a network interface may be physical or virtual; and in the context of IP, an interface address is an IP address assigned to a NI, be it a physical NI or virtual NI.
  • a virtual NI may be associated with a physical NI, with another virtual interface, or stand on its own (e.g., a loopback interface, a point-to-point protocol interface).
  • a NI (physical or virtual) may be numbered (a NI with an IP address) or unnumbered (a NI without an IP address).
  • a loopback interface (and its loopback address) is a specific type of virtual NI (and IP address) of a
  • IP addresses of that ND are referred to as IP addresses of that ND; at a more granular level, the IP address(es) assigned to NI(s) assigned to a NE/VNE implemented on a ND can be referred to as IP addresses of that NE/VNE.
  • Next hop selection by the routing system for a given destination may resolve to one path (that is, a routing protocol may generate one next hop on a shortest path); but if the routing system determines there are multiple viable next hops (that is, the routing protocol generated forwarding solution offers more than one next hop on a shortest path - multiple equal cost next hops), some additional criteria is used - for instance, in a connectionless network, Equal Cost Multi Path (ECMP) (also known as Equal Cost Multi Pathing, multipath forwarding and IP multipath) may be used (e.g., typical implementations use as the criteria particular header fields to ensure that the packets of a particular packet flow are always forwarded on the same next hop to preserve packet flow ordering).
  • ECMP Equal Cost Multi Path
  • a packet flow is defined as a set of packets that share an ordering constraint.
  • the set of packets in a particular TCP transfer sequence need to arrive in order, else the TCP logic will interpret the out of order delivery as congestion and slow the TCP transfer rate down.
  • Each VNE e.g., a virtual router, a virtual bridge (which may act as a virtual switch instance in a Virtual Private Local Area Network Service (VPLS) is typically independently administrable.
  • each of the virtual routers may share system resources but is separate from the other virtual routers regarding its management domain, AAA (authentication, authorization, and accounting) name space, IP address, and routing database(s).
  • AAA authentication, authorization, and accounting
  • Multiple VNEs may be employed in an edge ND to provide direct network access and/or different classes of services for subscribers of service and/or content providers.
  • Some NDs provide support for implementing VPNs (Virtual Private Networks) (e.g., Layer 2 VPNs and/or Layer 3 VPNs).
  • VPNs Virtual Private Networks
  • the ND where a provider's network and a customer's network are coupled are respectively referred to as PEs (Provider Edge) and CEs (Customer Edge).
  • PEs Provide Edge
  • CEs Customer Edge
  • forwarding typically is performed on the CE(s) on either end of the VPN and traffic is sent across the network (e.g., through one or more PEs coupled by other NDs).
  • Layer 2 circuits are configured between the CEs and PEs (e.g., an Ethernet port, an ATM permanent virtual circuit (PVC), a Frame Relay PVC).
  • PVC ATM permanent virtual circuit
  • Frame Relay PVC Frame Relay PVC
  • an edge ND that supports multiple VNEs may be deployed as a PE; and a VNE may be configured with a VPN protocol, and thus that VNE is referred as a VPN VNE.
  • a VPN VNE may be deployed as a PE; and a VNE may be configured with a VPN protocol, and thus that VNE is referred as a VPN VNE.

Abstract

Exemplary techniques for efficiently reconstructing a forwarding information base (FIB) are described. A util node of a util trie is identified as being "dirty" due to a new, changed, or deleted route. Immediate child util nodes of the dirty util node can be cached. A harvesting process, using a control trie, identifies hybrid nodes to be placed within a child array of a hybrid node of the FIB that corresponds to the util node. The identified hybrid nodes include Point of Harvest (POH) identifiers, which can be compared to cached POH identifiers within the cached child util nodes. When a POH identifier of a cached child util node matches a POH identifier identified by the harvesting process, a hybrid node corresponding to that cached child util node can be reused with its existing sub-tree instead of regenerating the hybrid node and its sub-tree.

Description

TECHNIQUES FOR EFFICIENT FORWARDING INFORMATION BASE RECONSTRUCTION USING POINT OF HARVEST IDENTIFIERS
TECHNICAL FIELD
[0001] Embodiments relate to the field of computer networking; and more specifically, to techniques for efficient forwarding information base reconstruction using Point of Harvest identifiers.
BACKGROUND
[0002] In recent years, the Internet has evolved to become a platform for providing a variety of services including voice, media, and data for both fixed and mobile end users. Additionally, with upcoming technologies such as Fifth Generation (5G) wireless networking, the Internet of Things (IoT), virtualization, etc., a new era of services over the Internet will be created. Along with these new services will also bring even more stringent quality requirements for end users' experience, such as the end users' desire to receive a service as soon as it has been requested, with the least amount of delay and interruptions.
[0003] Accordingly, given the focus upon new networking technologies where provisioning rate demands are higher and flexible software-based solutions are a priority, improved techniques enabling networks to meet these increased requirements are needed.
[0004] One crucial area for meeting these increased requirements is on the data path between a service provider and its end users, and specifically, within the forwarding elements on that path. For example, upon an end user accessing a service, the forwarding elements on the path between the user and the service must be able to configure themselves to accommodate the flow of traffic between the two as quickly as possible. This requirement is especially important with today's reliance on general purpose computing devices (or hybrid devices) implementing various network functions (e.g., Network Functions Virtualization (NFV)), where these devices may not have dedicated, special-purpose hardware for performing such tasks as is the case with hardware-based solutions. Accordingly, techniques are desired for improved forwarding path performance, such as the management/updating of a forwarding information database (also referred to as a Forwarding Information Base, or "FIB"), without the benefit of dedicated hardware components while maintaining deterministic latency constraints for lookups on the FIB. SUMMARY
[0005] Systems, methods, apparatuses, computer program products, and machine -readable media are provided for efficient forwarding information base (FIB) reconstruction using Point of Harvest identifiers. Embodiments disclosed herein can efficiently manage a FIB by identifying certain subsets of entries of the FIB that can be re-utilized without modification when performing operations on the FIB instead of completely rebuilding large subtrees of the FIB . Moreover, some embodiments utilize techniques where (potentially massively) parallel systems can safely operate in parallel to even further increase performance during FIB updates, without benefitting from special-purpose FIB hardware, and while maintaining deterministic lookup times in the FIB.
[0006] According to some embodiments, a method in a packet forwarder implemented by a device for efficiently reconstructing a forwarding information base (FIB) to reflect a new, changed, or deleted route of a communications network. The method includes determining, by the packet forwarder, that an update to the FIB utilized by the packet forwarder to make forwarding decisions is to be performed to reflect the new, changed, or deleted route. The FIB comprises a data structure having a plurality of levels. The data structure includes one or more hybrid nodes - each acting as a root of a sub-tree of the data structure and each including a pointer to a child array of nodes at a next level of the plurality of levels. The method also includes updating a control trie to reflect the new, changed, or deleted route by adding, removing, or modifying at least one external node in the control trie. The control trie includes one or more external nodes each indicating routing information for one or more of a plurality of routes of the network. The method also includes identifying, within a util trie, a util node that corresponds to a hybrid node of the FIB serving as a root of a subtree that is to be updated due to the new, changed, or deleted route. The util trie also has the same plurality of levels, and includes a plurality of util nodes. Each of the plurality of util nodes is located at a boundary of one of the plurality of levels. Each of the plurality of util nodes corresponds to one of the one or more hybrid nodes of the FIB . The method also includes obtaining a point of harvest (POH) identifier for each of one or more immediate descendant util nodes of the identified util node within the util trie. Each POH identifier identifies, for the corresponding descendant util node, one of the nodes within the control trie that is at a corresponding location within the control trie or is a next node within the control trie beneath the corresponding location within the control trie. The method also includes obtaining, via a harvesting process, data including one or more nodes that are to form the child array of the hybrid node corresponding to the identified util node. The data further includes a POH identifier for each of the one or more nodes that are hybrid nodes. The method also includes, responsive to determining that the POH identifier of one of the one or more nodes obtained from the harvesting process matches the obtained POH identifier of one of the one or more immediate descendant util nodes, reusing the hybrid node from the child array while reconstructing the child array instead of regenerating the hybrid node and any of its descendant nodes.
[0007] In some embodiments, the method further includes, after the identifying of the util node, inserting a dirty node in a dirty util trie at a corresponding location of the dirty util trie as the location of the identified util node in the util trie and, at a later point in time, traversing the dirty util trie in a top-down breadth-first manner to identify those of the util nodes needing to have their corresponding child arrays reconstructed. The dirty node comprises a pointer to the identified util node of the util trie.
[0008] In some embodiments, obtaining the POH identifier for each of one or more immediate descendant util nodes of the identified util node within the util trie comprises caching, within the identified util node, each of the one or more immediate descendant util nodes. Each of the one or more immediate descendant util nodes stores its corresponding POH identifier.
[0009] In some embodiments, reconstructing the child array includes determining that one or more of the hybrid nodes of the child array can be reused, and generating a second child array, including copying each of the one or more of the hybrid nodes that can be reused to the second child array. In some embodiments, reconstructing the child array further includes updating a pointer from the hybrid node corresponding to the identified util node to point to the second child array instead of the child array. In some embodiments, at least one of the copied one or more of the hybrid nodes is placed at a different index within the second child array compared to its index within the child array, but in some embodiments, all of the copied one or more of the hybrid nodes is placed at a same index within the second child array as its index within the child array.
[0010] In some embodiments, the method further includes updating the POH identifier of one or more of the util nodes of the util trie responsive to the update of the control trie.
[0011] In some embodiments, the control trie stores route information for the plurality of routes and is indexed by a routing prefix of a route, the control trie further includes one or more split nodes each identifying one or more bit locations of the routing prefix that can be utilized to determine how to traverse the control trie, and the FIB further includes one or more leaf nodes that collectively store forwarding information for the plurality of routes.
[0012] According to some embodiments, a non-transitory machine readable medium provides instructions which, when executed by a processor of a device, will cause the device to implement a packet forwarder to perform operations for efficiently reconstructing a forwarding information base (FIB) to reflect a new, changed, or deleted route of a communications network. The operations include determining that an update to the FIB utilized by the packet forwarder to make forwarding decisions is to be performed to reflect the new, changed, or deleted route. The FIB comprises a data structure having a plurality of levels. The data structure includes one or more hybrid nodes - each acting as a root of a sub-tree of the data structure and each including a pointer to a child array of nodes at a next level of the plurality of levels. The operations also include updating a control trie to reflect the new, changed, or deleted route by adding, removing, or modifying at least one external node in the control trie. The control trie includes one or more external nodes each indicating routing information for one or more of a plurality of routes of the network. The operations also include identifying, within a util trie, a util node that corresponds to a hybrid node of the FIB serving as a root of a subtree that is to be updated due to the new, changed, or deleted route. The util trie also has the same plurality of levels, and includes a plurality of util nodes. Each of the plurality of util nodes is located at a boundary of one of the plurality of levels. Each of the plurality of util nodes corresponds to one of the one or more hybrid nodes of the FIB. The operations also include obtaining a point of harvest (POH) identifier for each of one or more immediate descendant util nodes of the identified util node within the util trie. Each POH identifier identifies, for the corresponding descendant util node, one of the nodes within the control trie that is at a corresponding location within the control trie or is a next node within the control trie beneath the corresponding location within the control trie. The operations also include obtaining, via a harvesting process, data including one or more nodes that are to form the child array of the hybrid node corresponding to the identified util node. The data further includes a POH identifier for each of the one or more nodes that are hybrid nodes. The operations also include, responsive to determining that the POH identifier of one of the one or more nodes obtained from the harvesting process matches the obtained POH identifier of one of the one or more immediate descendant util nodes, reusing the hybrid node from the child array while reconstructing the child array instead of regenerating the hybrid node and any of its descendant nodes.
[0013] According to some embodiments, a device includes one or more processors and a non- transitory machine-readable storage medium. The non-transitory machine readable medium provides instructions which, when executed by the one or more processors, will cause the device to implement a packet forwarder to perform operations for efficiently reconstructing a forwarding information base (FIB) to reflect a new, changed, or deleted route of a
communications network. The operations include determining that an update to the FIB utilized by the packet forwarder to make forwarding decisions is to be performed to reflect the new, changed, or deleted route. The FIB comprises a data structure having a plurality of levels. The data structure includes one or more hybrid nodes - each acting as a root of a sub-tree of the data structure and each including a pointer to a child array of nodes at a next level of the plurality of levels. The operations also include updating a control trie to reflect the new, changed, or deleted route by adding, removing, or modifying at least one external node in the control trie. The control trie includes one or more external nodes each indicating routing information for one or more of a plurality of routes of the network. The operations also include identifying, within a util trie, a util node that corresponds to a hybrid node of the FIB serving as a root of a subtree that is to be updated due to the new, changed, or deleted route. The util trie also has the same plurality of levels, and includes a plurality of util nodes. Each of the plurality of util nodes is located at a boundary of one of the plurality of levels. Each of the plurality of util nodes corresponds to one of the one or more hybrid nodes of the FIB. The operations also include obtaining a point of harvest (POH) identifier for each of one or more immediate descendant util nodes of the identified util node within the util trie. Each POH identifier identifies, for the corresponding descendant util node, one of the nodes within the control trie that is at a corresponding location within the control trie or is a next node within the control trie beneath the corresponding location within the control trie. The operations also include obtaining, via a harvesting process, data including one or more nodes that are to form the child array of the hybrid node corresponding to the identified util node. The data further includes a POH identifier for each of the one or more nodes that are hybrid nodes. The operations also include, responsive to determining that the POH identifier of one of the one or more nodes obtained from the harvesting process matches the obtained POH identifier of one of the one or more immediate descendant util nodes, reusing the hybrid node from the child array while reconstructing the child array instead of regenerating the hybrid node and any of its descendant nodes.
BRIEF DESCRIPTION OF THE DRAWINGS
[0014] The invention may best be understood by referring to the following description and accompanying drawings that are used to illustrate embodiments. In the drawings:
[0015] Figure 1 is a block diagram illustrating an exemplary forwarding database according to some embodiments.
[0016] Figure 2 is a block diagram illustrating an exemplary control trie populated using a number of keys corresponding to exemplary network routes according to some embodiments.
[0017] Figure 3 is a block diagram illustrating an exemplary util trie and an exemplary control trie with the util trie overlaid upon it according to some embodiments.
[0018] Figure 4 is a block diagram illustrating a portion of the overlaid util and control tries of Figure 3 and a corresponding portion of the exemplary forwarding database of Figure 1 according to some embodiments. [0019] Figure 5 is a block diagram illustrating a portion of an overlaid util and control trie with illustrated Point of Harvest (POH) locations according to some embodiments.
[0020] Figure 6 is a flow diagram illustrating a pre -reconstruction flow and a reconstruction flow for efficient forwarding information base reconstruction according to some embodiments.
[0021] Figure 7 is a block diagram illustrating an insertion of a route and some operations performed with various portions of a util trie and/or control trie according to some embodiments.
[0022] Figure 8 is a block diagram illustrating an exemplary overlaid util and control trie before and after a deletion according to some embodiments.
[0023] Figure 9 is a flow diagram illustrating another flow for efficient forwarding
information base reconstruction according to some embodiments.
[0024] Figure 10A illustrates connectivity between network devices (NDs) within an exemplary network, as well as three exemplary implementations of the NDs, according to some embodiments.
[0025] Figure 10B illustrates an exemplary way to implement a special-purpose network device according to some embodiments.
[0026] Figure IOC illustrates various exemplary ways in which virtual network elements (VNEs) may be coupled according to some embodiments.
[0027] Figure 10D illustrates a network with a single network element (NE) on each of the NDs, and within this straight forward approach contrasts a traditional distributed approach (commonly used by traditional routers) with a centralized approach for maintaining reachability and forwarding information (also called network control), according to some embodiments.
[0028] Figure 10E illustrates the simple case of where each of the NDs implements a single NE, but a centralized control plane has abstracted multiple of the NEs in different NDs into (to represent) a single NE in one of the virtual network(s), according to some embodiments.
[0029] Figure 10F illustrates a case where multiple VNEs are implemented on different NDs and are coupled to each other, and where a centralized control plane has abstracted these multiple VNEs such that they appear as a single VNE within one of the virtual networks, according to some embodiments.
DETAILED DESCRIPTION
[0030] The following description describes techniques for efficient forwarding information base reconstruction using Point of Harvest identifiers. In the following description, numerous specific details such as logic implementations, opcodes, means to specify operands, resource partitioning / sharing / duplication implementations, types and interrelationships of system components, and logic partitioning/integration choices are set forth in order to provide a more thorough understanding of the present invention. It will be appreciated, however, by one skilled in the art that the invention may be practiced without such specific details. In other instances, control structures, gate level circuits and full software instruction sequences have not been shown in detail in order not to obscure the invention. Those of ordinary skill in the art, with the included descriptions, will be able to implement appropriate functionality without undue experimentation .
[0031] References in the specification to "one embodiment," "an embodiment," "an example embodiment," etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
[0032] Bracketed text and blocks with dashed borders (e.g., large dashes, small dashes, dot- dash, and dots) may be used herein to illustrate optional operations that add additional features to embodiments. However, such notation should not be taken to mean that these are the only options or optional operations, and/or that blocks with solid borders are not optional in certain embodiments.
[0033] In the following description and claims, the terms "coupled" and "connected," along with their derivatives, may be used. It should be understood that these terms are not intended as synonyms for each other. "Coupled" is used to indicate that two or more elements, which may or may not be in direct physical or electrical contact with each other, co-operate or interact with each other. "Connected" is used to indicate the establishment of communication between two or more elements that are coupled with each other.
[0034] Additionally, the terms "forwarding information base," "FIB," "forwarding database," "forwarding data base," "forwarding table," and common variants thereof may be used synonymously in this description unless otherwise indicated either explicitly or as made obvious by the context of use. Similarly, the terms "control trie," "control database," "control table," etc., may be used synonymously unless otherwise indicated either explicitly or as made obvious by the context of use. Further, the terms "util trie," "util table," "util trie tree," etc., may also be used synonymously unless otherwise indicated either explicitly or as made obvious by the context of use.
[0035] Packet forwarding network elements (or "packet forwarders") have traditionally been implemented using special-purpose network equipment having specialized hardware support for performing certain tasks (e.g., lookups, forwarding table updating, etc.) as fast as possible to enable the most efficient forwarding of data. However, such hardware-based approaches suffer from a variety of different problems, ranging from their high cost, significant hardware "real estate" required to implement them, increased power consumption and, perhaps most importantly in recent times, they are not applicable in virtualized platforms, which have become a tremendously important area. Hardware -based approaches are also deficient because they lack flexibility - e.g., routing information necessarily must be stored within the specifications of the hardware handling it.
[0036] However, software-based solutions tend to also have significant limitations. For example, one often observed flaw with software-based packet forwarders involves poor performance for updating forwarding tables based upon new, updated, or deleted route information. For example, such software-based systems utilize a specially-constructed forwarding table requiring as few as possible memory accesses to lookup forwarding
information. Thus, these systems deploy an extremely "compacted" forwarding database data structure, so when an update needs to be performed, the software-based packet forwarder typically must reconstruct either large portions of the forwarding database or, in some cases, the entire data structure. For example, some packet forwarders use a tree-based data structure for a forwarding database, and to make a change at a particular location of the tree, the packet forwarders will re-construct the entire subtree of the node containing the changes, often times resulting in unnecessary work as other subtrees within the affected region are actually unchanged.
[0037] This produces a significant delay in being able to provide an up-to-date forwarding database, thus delaying the smooth/rapid setup of new network flows for a new service and thus, adding latency to a client's experience. Moreover, as these types of update operations typically affect many portions of the forwarding database, these operations are often necessarily limited to just one processing thread to ensure thread-safety. As a result the updates can be quite slow, despite the operating environment potentially offering significant amounts of unusable processing resources.
[0038] Thus, most existing solutions that can efficiently rebuild forwarding information bases use specialized hardware to do so, and existing software -based solutions instead rely on complete subtree rebuilding of newly provisioned areas.
[0039] Accordingly, embodiments provide techniques for efficient forwarding information base reconstruction utilizing a software -based approach that can efficiently reconstruct different parts of the forwarding information base with the minimal number of operations possible based on the provisioned changes. Embodiments can operate by tracking the forwarding information base via specifically chosen data structures (or "control structures") which are easy to manipulate and map against the FIB. These control structures can be examined to allow for only minimal changes to be made in the FIB to correctly reflect the newly provisioned changes. Embodiments can then make the rebuilding process significantly faster, effectively reducing the latency seen by the client. Furthermore, embodiments are flexibly applicable across different computing platforms, as embodiments can be software based and not rely on any specific hardware. Embodiments can also be very versatile and applicable to new trends like the virtualization of networking components.
[0040] Accordingly, embodiments disclosed herein can provide substantial processing gains compared to other software-based solutions. In some disclosed embodiments, on average, the processing power used for performing operations needed for a change to the forwarding database will diminish since each update only rebuilds a sub-tree root node (and perhaps its immediate data) as opposed to the whole sub-tree.
[0041] Embodiments disclosed herein can greatly reduce transient memory usage compared to other software-based solutions. Because the forwarding plane should not be disturbed when updating the FIB, it follows that every sub-tree that will be updated will not be freed until the whole new tree is ready. This means that at some point both sub-trees will exist at the same time (i.e., the old one, and the new one). As some embodiments may only rebuild an extremely limited amount of data, only this limited amount data needs be present in transient memory during the rebuild at any particular moment in time, greatly reducing the transient memory usage compared to other approaches that reconstruct large portions of the FIB, thus requiring significant amounts of transient memory.
[0042] Additionally, embodiments disclosed herein are widely applicable, and can be particularly efficient for longest prefix match types of forwarding tables, where data can be inserted at any prefix length. Accordingly, embodiments are very useful for most widely- deployed protocols, like Internet Protocol (IP) version 4 (IPv4) and version 6 (IPv6).
[0043] Embodiments can also provide high overall performance gains compared to other approaches. Given the processing gains and the parallelism enabled by these embodiments, the provisioning throughput can soar depending on the application. For example, initial tests of an embodiment involving IPv4 provisioning resulting in a performance improvement of approximately 80% compared to another recent software -based provisioning system.
[0044] At its core, embodiments benefit by reducing the number of operations required to perform a change to the forwarding database effectively increasing the rate at which the changes occur. Embodiments can also ensure that forwarding traffic is never disrupted, that independent operations could be performed in parallel, and that only a minimum number of data structures will get re-processed.
[0045] For the purpose of understanding, we will first explore how a forwarding table can be constructed in some embodiments to have a latency that is capped at a configurable maximum amount (i.e., that the number of memory accesses required to perform a lookup is bounded). Such a forwarding table may look like the one depicted in Figure 1, which is a block diagram illustrating an exemplary forwarding database 100 according to some embodiments. The forwarding database 100 comprises a hierarchical tree-type data structure having one or more hybrid nodes 104 (represented with an "H") and one or more leaf nodes 105 (represented with an "L"). A leaf node 105 can store forwarding information for a particular route (as "data"), and in some embodiments, a leaf node 105 can be a "POP" node, which indicates that, during a traversal of the database 100, the traversal has ended and thus, a back trace of the database 100 should be performed (e.g., to reveal a longest prefix match route at a previously visited leaf node 105). Reaching a POP node during a traversal may also cause additional nodes in the forwarding database 100, referred to as "pushdflt" nodes, to be "popped" as described later herein. The hybrid nodes 104, in contrast, can be used to traverse the database 100 while searching for forwarding information by indicating how to access other nodes of the database 100 at the next level.
[0046] For example, in some embodiments the forwarding database 100 can be traversed in the following manner. During a traversal, each time a hybrid node is landed upon, it is determined whether a "pushdflt" value (e.g., bit) of the hybrid node is set. If the pushdflt value is set, then a pushdflt node (e.g., an actual leaf node) resides in the child array of the hybrid node, and this child array may be "saved" to some temporary memory location. Eventually, at some point the traversal will arrive at a leaf node or a POP node. If a leaf is hit, the traversal process may return that leaf node. In contrast, if a POP is hit, then the traversal process may return the pushdflt node(s) that have been saved in temporary memory (each time, during the traversal a hybrid node was hit that had its pushdflt value set).
[0047] As illustrated, Figure 1 shows how the forwarding database 100 is arranged in different levels - level 0 102A, level 1 102B, level 2 102C, level 3 102D, level 4 102E, and level 5 102F. To move from one level to a next level, a certain number of configurable bits from a packet' s key can be used in addition with data from the corresponding sub-tree root node (i.e., a hybrid node at that level). The lookup may conclude when a data node is hit (i.e., a leaf node 105), which indicates how to forward a particular packet. [0048] Each hybrid node 104 serves as a root 106 of a sub-tree. Accordingly, as illustrated, the second hybrid node at level 2 serves as a root of a sub-tree 108 including a portion of levels 3-5 (102D-102F), which will be discussed in additional detail later herein with regard to Figure 4.
[0049] As described above, the forwarding database 100 can be constructed in a particular manner to ensure that a traversal of the forwarding database 100 requires at most a particular number of memory accesses. Thus, forwarding database 100 can be logically and/or physically arranged in an extremely efficient manner to constrain the number of memory accesses to such a maximum value, e.g., by arranging the tree with a maximum number of levels, keeping the individual data structures (e.g., arrays) involved tightly arranged/packed, etc.
[0050] To maintain this efficient layout, other software-based approaches may, upon an update needing to be performed in the forwarding database 100, either completely rebuild the entire forwarding database 100 or perhaps insert/remove/update the involved node (e.g., the root of the sub-tree at 106) and then rebuild its entire sub-tree (e.g., all of its direct and indirect child nodes/arrays; see subtree 108). However, embodiments disclosed herein can avoid such time and processing intensive reconstructive tasks.
[0051] Accordingly, to better create and manage/maintain the forwarding database 100, two other structures are introduced: a control trie table (also referred to as a "control trie," "control database," etc.) and a util trie table (also referred to as a "util trie," "util trie data structure," etc.).
[0052] Figure 2 is a block diagram illustrating an exemplary control trie 200 populated using a number of keys 206 corresponding to exemplary network routes according to some
embodiments. In some embodiments, the control trie 200 can be a binary-type tree (e.g., a PATRICIA trie, where PATRICIA is an acronym for "Practical Algorithm to Retrieve
Information Coded in Alphanumeric") with two types of nodes: split nodes 202 and external nodes 204. External nodes 204 are those nodes containing data (e.g., route/forwarding information) and are illustrated with circles having solid white backgrounds, while split nodes 202 can be "internal" nodes that are used to split routing prefixes at a prefix depth where their first bit differs, and are illustrated with circles having striped backgrounds. As a result, external nodes 204 can be split nodes 202, but the same is not true the other way around.
[0053] In some embodiments, the control trie 200 data structure can be utilized as a first stop for any operation needing to be performed involving routing/forwarding information.
Accordingly, insertions, deletions, and updates of such data (e.g., key/data pairs) may be stored in the control tire 200.
[0054] For example, if the following set of keys is inserted, the resulting tree would look like the control trie 200 of Figure 2: { 0.16.0.0.0.0/13, 0.16.32.32.32.0/37, 0.16.36.36.36.0/39, 0.16.48.48.48.0/37, 0.16.52.56.56.0/37, 0.16.52.56.56.0/45, 0.16.52.56.56.8/45, and 0.16.52.56.63.248/45 }. Of course, these "keys" are exemplary and thus are not IPv4 or IPv6 routes themselves; it is to be understood that the keys can be of a number of useful values known to those of skill in the art.
[0055] The resulting control trie 200 includes a first split node 202 with a "/0" indicator, and because all of the keys begin with an initial zero bit, only one path down the trie 200 to the external node 204 exists. If a search of the tree is performed with a key that falls within the "0.16.0.0.0.0/13" subnet, the traversal will stop at this point with the external node 204, which stores routing/forwarding information for those routes. However, if the key does not fall within that subnet, the traversal will continue at the split node of "/19", where the corresponding bit of the key will be analyzed - if it is a "0", the traversal will continue down the left subtree, otherwise if the bit value is a "1", the traversal will continue down the right subtree. The traversal will continue through the trie 200 until the key is matched with one of the external nodes 204, which may or may not be at a "leaf location of the tree.
[0056] Notably, in updating the control trie 200, a split node 202 can be created each time a set of inserted keys/prefixes diverge at a particular bit location.
[0057] In some embodiments, the control trie 200 data structure stores all the information the "client" has inserted. Accordingly the actual forwarding database can be generated by parsing the control trie 200. However, some embodiments further utilize a util trie. Figure 3 is a block diagram illustrating an exemplary util trie 300 and an exemplary control trie with the util trie overlaid 350 upon it according to some embodiments.
[0058] The util trie 300 can be a binary tree type structure that is utilized to hold information regarding the forwarding table sub-tree root nodes, and can include util nodes 302 and sometimes even split nodes (as in the control trie 200). Each solid black dot on Figure 3 represents a util node 302.
[0059] A util node 302 can have all of the data needed to reproduce the corresponding hybrid node 104 in the forwarding database 100. Embodiments disclosed herein utilize such a util trie 300 because it is much more efficient to manage a binary tree than the flattened out and latency efficient forwarding table when performing provisioning operations.
[0060] In some embodiments, the util nodes 302 of the util trie 300 are inserted only on "stride" boundaries. The number of strides (e.g., stride/levels 301A-301E) in the util trie 300 corresponds to the number of levels of the forwarding database 100, and the "length" of each stride can be the number of bits used from the key to move from one level to the next.
[0061] In the util trie 300 example, the strides may be represented as 13-8-8-8-8 (which also means we have 5 levels after the root, and that the first stride 301A represents 13 bits, a second stride 301B (not illustrated) represents 8 bits, etc.). Horizontal lines at each stride's depth are illustrated herein as extending over the control trie (which has been overlaid 350 with the util trie 300) to show where the util nodes will be generated - see, for example, stride boundary 304A, stride boundary 304B, etc., stride boundary 304F.
[0062] Util nodes will be generated at these particular positions of the util trie 300, as each time a stride ends, the next set of bits from the lookup prefix of the key are to be indexed in order to jump to the next level and process the following node.
[0063] The util trie 300 can in some embodiments be generated from control trie 200 for the purpose of creating (or managing/updating) the forwarding information base 100.
[0064] When viewing the util trie 300 and control trie 200 together, a clearer picture is provided on how the forwarding database 100 is represented in the control plane and how the global mapping can be done. Accordingly, exemplary operations for constructing the forwarding database 100 using the control trie 200 and util trie 300 are provided.
[0065] Starting at the root node (at the first stride boundary 304 A), there is no data (i.e., external node) present in the control trie and at the same stride boundary 304A there is a util node. Accordingly, as the util node corresponds to a hybrid node in the forwarding database 100, such a hybrid node is constructed to be the root of the tree. That util node is then "mapped" to the hybrid node at the root of the forwarding database 100, such as by storing a pointer to the hybrid node within the util node, or by storing enough information to allow for a pointer to the hybrid node to be ascertained, a pointer to the child array of the hybrid node to be ascertained, additional information to allow for the other elements of the hybrid node to be recreated, etc. This hybrid node 104 is shown in Figure 1 at Level 0 102A.
[0066] This created hybrid node, as a hybrid node, is to include a pointer to its child array in the forwarding database 100, which includes hybrid node(s) and/or leaf node(s). Returning to the control trie 200, the next phase of "harvesting" includes looking at everything that exists between the root node and the next stride boundary (304B) at "/13" - all of this will be
"flattened out." In this case, there exists one external node (or "client data" node) for the
"0.16.0.0.0.0/13" route within the stride within the control trie 200, and one util node (in the util trie 300) within the stride.
[0067] Notably, the external node and the util node share a same location - i.e., right at the stride boundary 304B - and thus, in some embodiments this raises a special case. Because of this shared location, each node would have the same index in the resultant child array, and thus, under the special case, only one of them can exist in the child array - in this case, a hybrid node for the util node. Thus, the data for the external node is not lost, it can instead be placed one level lower than the child array into a special location (within the array) called a "pushdflt" node as introduced earlier. Such pushdflt nodes (not illustrated) are the ones "popped" when a "POP node" is reached while traversing the forwarding database 100. Thus, in this special case, the hybrid node on level 1 will have a child array on level 2 that includes a special pushdflt node with the forwarding information of external node "0.16.0.0.0.0/13."
[0068] If the external node and the util node did not share a same location (which is not the case here), the child array (within the forwarding database 100) of the first hybrid node 104 would have included one hybrid node (corresponding to the discovered util node) and one leaf node (corresponding to the external node), and one (or more) POP node(s).
[0069] In some embodiments, this child array 122 may include a total of 2A13 nodes, where the exponent 13 is derived from the size of the stride - i.e., 13 bits. Accordingly, in such an embodiment, the child array 122 may include one hybrid node, one leaf node, and (2A13 - 2) POP nodes.
[0070] With the one hybrid node (which is, as discussed, a sub-tree root), the process thus continues with performing another harvesting of another level. Thus, within the next stride, there exists three split nodes (one at '719" that is between stride boundaries, and two at '721" that are at the stride boundary) along with two util nodes (that are at the '721" stride boundary). In some embodiments, the split nodes are not considered during the flattening. Thus, in some such embodiments, two hybrid nodes will be constructed in a new child array for the two util nodes, and one or more POP nodes will similarly be constructed to fill the child array. In some embodiments, the child array can include 256 (=2A8) nodes, as the stride length is 8 bits. This child array is shown in Figure 1 as the child array at level 2 102C, in which one of the hybrid nodes is shown as being a root 106 of a sub-tree (though of course, all hybrid nodes act as roots of sub-trees, including the other hybrid node of that same child array).
[0071] Accordingly, the leftmost hybrid node is mapped to the corresponding left util node (e.g., by storing a pointer to the leftmost hybrid node in that util node), and construction will continue by again harvesting for that node by identifying all nodes within the next stride. In this case, there are only two util nodes (at '729") and no external nodes, so another child array is created (e.g., with 2A8 = 256 nodes) to include two hybrid nodes corresponding to the two harvested util nodes, and pointers to these hybrid nodes are stored within the util nodes of the util trie.
[0072] Similarly, the harvesting process can continue, such as at the leftmost util node at '729", which has just one external node storing route information for "0.16.32.32.32.0/37." Accordingly, another child array is created (e.g., with 2A8 = 256 nodes), where one of these nodes is a leaf node storing data for the "0.16.32.32.32.0/37" route, and the others of the nodes are POP leaf nodes. The second util node at '729" will also be harvested, which includes just one util node in the stride beneath it, leading to a child array with one hybrid node (corresponding to the util node) and other leaf nodes as POP nodes.
[0073] Next, the process may continue with harvesting for this util node (at "/37"), where the flattening out process will "flatten" out everything between the /37 boundary to the /45 boundary (304F). This process will find only one external node (for "0.16.36.36.36.0/39") in the stride beneath it. Because this external node is not directly on a stride boundary, some embodiments can "expand" this '739" leaf as the following 2A(45-39) = 64 routes:
0.16.36.36.36.0/45
0.16.36.36.36.8/45
0.16.36.36.36.16/45
0.16.36.36.36.24/45
0.16.36.36.37.248/45
[0074] As shown above, each shown route is the result of being "incremented" by 8 units (e.g., .0 to .8, .8 to .16, etc.) as we only use up to bit 45 (and thus, there will be three trailing bits left out). Due to this flattening, all of these routes will include the same routing information as that of the /39 route. Accordingly, this single /39 route may simply expand to instead be represented as 64 routes (/45 routes) in the child array.
[0075] At this point, the leftmost branch (from the standpoint of '719") has been constructed, and the process similarly continues with the right branch by performing the harvesting process at each stride boundary in a similar manner. Accordingly, the remainder of this "flattening out" process for constructing a forwarding database 100 will not be produced at length here as it is a trivial exercise using the aforementioned techniques.
[0076] We now concentrate upon the process of updating the forwarding database 100. As described above, a previously-preferred procedure for an update of (e.g., inserting, deleting, or modifying) a route included discarding and rebuilding a subtree that is directly affected by the update. This resulted from the difficulty in that it is not at all straightforward to convert the comparatively "friendly" (or easy to use) control structures (e.g., control trie 350, etc.) into the highly compacted and constrained array structures composing the forwarding database 100.
[0077] For example, a previous algorithm may simply identify a closest "sub-tree root" where the change is in the control trie, and then simply reconstruct the entire control trie 350 and/or util trie 300, and also reconstruct the corresponding subtree in the forwarding database 100. This requires, in most cases, a large number of "re-harvests" of the control-side tries and thus a large number of new child arrays of the forwarding database 100 to be constructed.
[0078] In contrast, embodiments disclosed herein can intelligently update the forwarding database 100 without these large-scale reconstructions and significant re-harvests of the control- side structures.
[0079] To further describe such techniques, we turn to Figure 4, which is a block diagram illustrating a portion 400 of the overlaid 350 util and control tries of Figure 3 and a
corresponding portion 450 of the exemplary forwarding database 100 of Figure 1 according to some embodiments. In this example, we suppose that a new node is to be inserted at an insertion point 404 for new routing information. Accordingly, some embodiments can include identifying the util node 402 above the location needing to be updated, as the insertion is in the "harvesting zone" of the corresponding hybrid node (and thus, the child array 452 of that hybrid node). Due to the update (i.e., insertion), it is known that the child array 452 needs to be modified, and, upon a careful inspection, it can be determined that the sub-trees 454 of the child array 452 do not need to be modified/reconstructed because they, if regenerated via re -harvesting, would return exactly the same thing. Thus, although the hybrid nodes of the child array 452 may end up at different indices within the child array 452, the contents will remain the same.
[0080] To implement such a technique, some embodiments utilize a new data structure referred to as a "dirty util trie" (not illustrated), which can be used to track util nodes (of the util trie 300) needing to be updated (i.e., have their child arrays be reconstructed via a harvesting of these util nodes). The dirty util trie can be a trie tree that keeps track of all util nodes that need to be updated after a set of operations in the control trie 350. The dirty util trie can thus be a lightweight structure that may simply include pointers to those of the util nodes in the util trie 300 that are dirty.
[0081] Accordingly, in some embodiments when an update is performed that will perturb a hybrid node (e.g., create a need to rebuild its child array), the dirty util trie will be modified to include a node that references that corresponding "dirty" util node.
[0082] Although in some embodiments the util trie itself could be used for this purpose (e.g., by using a bit value to mark a particular util node as "dirty"), embodiments can utilize a dirty util trie instead to gain additional benefits. For example, in many cases there could be millions of routes stored in a very large util trie data structure, and thus, using a dirty util trie eliminates the need to walk the entire util trie to check every single util node to see if it is dirty. Moreover, the util trie is tremendously lightweight and can reduce an amount of storage overhead as the entries don't need to have data aside from a pointer to a corresponding util node - in contrast, adding an additional field into the util trie for every util node could be tremendously wasteful when only a few (or none at all) of the nodes are dirty.
[0083] Accordingly, as illustrated in Figure 4, when an update is to occur (e.g., an insertion at insertion point 404) that creates a "dirty" util node 402, an entry corresponding to the util node 402 can be inserted into the dirty util trie, which may comprise a pointer to that util node 402.
[0084] Further, in some embodiments, when it is determined that the util node 402 is "dirty," its child nodes 408 may be identified and then cached (e.g., within the util node 402 itself). These cached nodes may be used later as described herein as part of the forwarding database 100 reconstruction process, though in other embodiments it is not necessary to perform such a caching (at the expense of additional processing/time required to later re-acquire these nodes).
[0085] One additional concept and data value is used in some embodiments that is referred to as a "point of harvest" (or POH). For the purpose of explanation, we turn to Figure 5, which is a block diagram illustrating a portion 500 of an overlaid util and control trie with illustrated Point of Harvest (POH) locations according to some embodiments.
[0086] In some embodiments, every util node in the util trie may have an associated POH, which may simply be one of the nodes in the control trie.
[0087] Each sub-tree root node in the forwarding table gets all its information by "harvesting" or flattening all the external nodes between the util node that represents it and the beginning of the next level. Because the util node itself doesn't belong to the control trie, embodiments select the next closest node in line with the same prefix as the starting point, or point of harvest, for this process. The POH can be a split node or external node.
[0088] Additionally, a useful point to observe is that for all util nodes on a same level, their POHs are necessarily unique, meaning that no two util nodes in a given level will share the same POH. This is very useful and can serve as the key to determine whether a "new" hybrid node is actually new or if it already exists, and thus can be reused when processing one of the dirty util nodes.
[0089] For example, as illustrated in Figure 5, the topmost util node Ά' 302A has a POH that is the split node 'A' 502 (of the control trie) at the same level. This is one common POH location for a util node - the node of the control trie at a corresponding location.
[0090] Another common POH location for a util node is the next node in the control trie beneath the corresponding location. For example, for util node 'B' 302B, it does not have a control node at the corresponding level/location, and thus, the POH 504 is the next control node beneath it, which is the external node 'B' 525. In this case, the util node 'B' 302B and its POH 504 (external node 525) are at different levels 550. [0091] Additionally, some nodes can be a POH for multiple util nodes and thus be "shared" by nodes at different levels 570, though as indicated above, the node cannot be a POH for multiple util nodes that are at a same level. For example, for util node 'C 302C, its POH 506 is at a level beneath it, and this POH 506 is also shared for the util node 'D' 302D.
[0092] Accordingly, the POH for a util node can be determined by selecting the control node in the control trie that is at a same location within the control trie (e.g., POH 502 for util node 'A' 302A, POH 506 for util node 'D' 302D), and, if one does not exist, selecting the next control node beneath that corresponding location in the control trie (e.g., POH 504 for util node 'B' 302B, POH 506 for util node 'C 302C). Put another way, the POH can be either at the same prefix length as the util node or lower in the control trie but never higher.
[0093] Turning back to Figure 4, we note that in some embodiments, an identifier of the POH for each util node may be identified and cached within the particular util node. Thus, in those embodiments (described above) that cache a copy of the immediate descendant util nodes 408 in the dirty util node 402, these cached copies can include identifiers of the POHs for these util nodes. Thus, as these util nodes 408 are located on a same level, we know that these POHs must necessarily be different. This property leads to a useful benefit in that these POHs can uniquely identify the cached util nodes at this level despite the absence of other unique information.
[0094] Thus, embodiments can use this POH uniqueness property to provide the re-use benefits described herein. For example, Figure 6 is a flow diagram illustrating a pre- reconstruction phase flow 600 and a reconstruction phase flow 630 for efficient forwarding information base 100 reconstruction according to some embodiments. The operations in the flow diagrams will be described with reference to the exemplary embodiments of the other figures. However, it should be understood that the operations of the flow diagrams can be performed by embodiments other than those discussed with reference to the other figures, and the
embodiments discussed with reference to these other figures can perform operations different than those discussed with reference to the flow diagrams. In some embodiments, the operations of the flow 600 and/or operations of the flow 630 can be performed by a packet forwarder network element as described herein, a control plane entity (implemented by a same or different device or at a same or different physical/geographic location as a corresponding forwarding plane), etc.
[0095] A "pre -reconstruction" phase flow 600 and a "reconstruction" phase flow 630 are illustrated. These phases may be performed adjacent in time to each other, though in other embodiments, the pre-reconstruction phase may be performed one or more times (for one or more changes) and then the reconstruction phase may be performed at a later point in time (for one or more changes). [0096] The pre-reconstruction flow 600 includes, at block 602, identifying a util trie node that is to be considered "dirty" based upon a change. In some embodiments, block 602 includes identifying a util node just "above" the location of the change (e.g., an insert, deletion, modification) that occurs within the control trie.
[0097] At block 604, the flow 600 optionally includes caching the immediate "child" util nodes (including their POH pointers) of the identified util trie node. These child util nodes may be cached within that identified util trie node.
[0098] At block 606, the flow 600 includes inserting a node in a "dirty util trie" that corresponds to the identified util trie node. In some embodiments, the node inserted includes a reference to the identified util trie node, such as a pointer.
[0099] At block 608, the flow 600 includes updating the POH value for all affected util nodes due to the change of the control trie. For example, a newly-inserted node in the control trie might have just become the new POH for a util node.
[00100] This flow 600 may be performed one or multiple times, and thus, there may be one or multiple different dirty util nodes in the dirty util trie. Thus, in some embodiments, the reconstruction process (for perhaps multiple ones of the dirty util nodes, such as those at a same level, etc.) can be performed concurrently/simultaneously by different processing elements (e.g., threads, processes, processors, etc.) For example, in some embodiments, the dirty util trie can be traversed using a breadth first, top-down process to easily identify dirty util nodes and launch the efficient "reconstruction" of the child arrays (corresponding to the util nodes corresponding to the identified dirty util nodes) using, for example, parallelism that could be provided by multithreading techniques.
[00101] Accordingly, the reconstruction phase flow 630 may begin with block 632 to identify a dirty util trie node, which identifies a util node in the util trie that needs to have its child array reconstructed. This may occur according to a breadth first, top down traversal of the dirty util trie, which can thus enable safe parallel reconstruction of the forwarding database 100.
[00102] The flow 630 may then include, at block 634, marking the util node for deletion in the util trie. In some embodiments, the util node will ultimately be deleted and replaced with a newly-constructed util node.
[00103] At block 636, the flow 630 includes harvesting the util node, which can include block 638 and executing a harvesting algorithm (e.g., by a separate harvester process) to identify one or more nodes (including one or more hybrid nodes) to be placed into the child array of the hybrid node corresponding to the util node within the forwarding database. This optionally can include, at block 640, walking at least some of the control trie starting with the node identified by the POH of the identified util trie node to thus identify all nodes within the level/stride beneath the util trie node. This can optionally include block 642, where the harvesting algorithm returns a set of entries to construct the nodes for the child array. Each of the entries that is for a hybrid node includes an index of where that hybrid node is to be placed in the child array, and a point of harvest (POH) pointer of the hybrid node that identifies a node within the control trie that serves as its POH.
[00104] At block 644, the flow 630 includes creating a new util node (corresponding to the identified util node that was marked as dirty) and creating/updating the corresponding new child array in the forwarding database. This can include, at block 646, for every util node (and corresponding hybrid node) to be created, checking to determine whether an existing node (of the forwarding database can be re-used by comparing the returned POH (e.g., from block 642) with the cached POH (e.g., from block 604). If, at block 648, it is determined that the POHs are the same, then the flow can reuse the util (and hybrid) node instead of reconstructing them.
[00105] Notably, this entire reconstruction phase flow 630 can be performed multiple times, such as when the dirty util trie includes multiple dirty util nodes. Additionally, some or all of these operations can be performed in parallel.
[00106] For ease of understanding, we will present a visual explanation with regard to Figure 7, which is a block diagram 700 illustrating an insertion of a route and some operations performed with various portions of a util trie and/or control trie according to some embodiments.
[00107] This figure shows exemplary operations in response to a new route insertion, where an insertion point 404 is illustrated, in which its nearest parent util node is thus deemed "dirty" 705 (e.g., via identification block 602). As detailed in blocks 632/634, during the reconstruction phase this util node can be identified, and thus, the harvesting process will be performed. The harvesting algorithm can be performed by a "harvester" process, which examines the control trie starting at the POH and ending at the stride boundary, to create a data structure containing all the nodes to be placed in the child array including any hybrid nodes (and their POHs), leaf nodes, and their corresponding indices within the child array. In this example, the harvesting process reveals the new data point (e.g., which will become a new leaf node in the child array of the forwarding database) and two util nodes (e.g., which will be two new hybrid nodes in the child array). Notably, the harvesting process does not know that the hybrid nodes corresponding to the harvested util nodes are the same as what is currently deployed within the forwarding database.
[00108] As one example, in some embodiments the harvester process returns a set of entries corresponding to nodes that are to be constructed for the child array. The set of entries may include 2A(number of stride bits) entries, and thus, for example, for a stride of 8 bits the harvester process may return a set of 2A8 = 256 entries. Each entry can include a variety of types of information, such as a type of node (e.g., hybrid type, leaf type, POP type), an index in which the node is to be placed in the child array, a POH pointer (for hybrid nodes), a leaf data pointer (for leaf nodes), etc.
[00109] Accordingly, with this harvested data, a builder process can be used to translate this data into the format required for a child array in the forwarding database. The builder process can utilize the "re-use" logic disclosed herein to determine whether the hybrid node requests (corresponding to the returned harvest data) are for the "same" hybrid nodes as the ones that are currently used in the forwarding database. Notably, it is non-trivial to determine whether these are the same - especially because they are standing on the same prefix. However, because we have the POH values, which are unique for each level, these POH values can be used to identify the "same" nodes. Thus, the builder process at this point examines the data returned by the harvester, and for each entry, populates the child array. This re-use logic can be called each time one of the entries returned by the harvester is a hybrid node - if the POH matches one of the cached ones then the hybrid node is reused, otherwise, a new task needs to be created for a new hybrid node to be built.
[00110] Accordingly, in some embodiments when the harvester returns a request to build a hybrid node (at a particular index), the harvester process returns a hybrid node type, the index in which it needs to be placed within the child array, and an identifier of the POH for the node indicating where one must go (e.g., a pointer to a node in the control trie) to start harvesting for the next level.
[00111] The harvester process can determine the POH when it walks the nodes in the stride. For example, upon going down to the control trie node (720) and determining that it has crossed its stride, the harvester process can determine that since it did cross the stride, that must be a util node above it, and that this control trie node that it landed upon is the POH for that util node.
[00112] Accordingly, instead of instructing the builder process to construct a new hybrid node, the re-use logic can first compare (see 720, see also 646/648) the returned POH value (returned with the hybrid node data by the harvester) with the cached POH value of the cached util nodes.
[00113] When the returned POH value is the same as the POH value cached in one of the child util nodes, the builder can thus be instructed to re-use the existing hybrid node - e.g., the builder can be instructed to make a copy of the hybrid node in a new child array that is being
constructed. As that hybrid node includes a pointer to any of its descendant nodes in the forwarding database, none of these descendant nodes need to be reconstructed.
[00114] Due to the modifications that caused this update (here, an insertion) it is useful to note that the index of the hybrid node may or may not change within the child array, even though the rest of the hybrid node remains the same. However, it may also be the case that the order of the hybrid nodes will not change within the child array even though their particular indices may now be different; this property can be utilized to reduce the number of POH values of the cached child util nodes that need to be compared to a returned POH.
[00115] For example, in a child array having the following hybrid nodes A[0], B[l], and C[2], after one or more updates the same hybrid nodes may be located at different indices: A[0], B[4], and C[6]. Thus, although these hybrid nodes are at different indices of the child array, they remain in the same relative order - among the three, A is still first, B is still second, and C is still third. Thus, if a very first returned POH value is not the same as the POH value of the first cached util node (corresponding to "A"), the process may stop searching for a match because it cannot exist.
[00116] Accordingly, a new child array can be constructed, which due to this insertion may include a new leaf node for a new route, and other leaf nodes may remain the same and the existing hybrid nodes may also remain the same (except for possibly their indices) and be reused due to this logic.
[00117] Some embodiments thus can utilize a "make before break" philosophy, and thus create a new child array and upon its completion, switch over the pointer from its parent hybrid node from the "old" child array to the "new" child array, which can be extremely fast so that no traffic being processed is interrupted due to temporarily missing forwarding information.
[00118] As another example, Figure 8 is a block diagram illustrating an exemplary overlaid util and control trie before 800 and after 850 a deletion according to some embodiments. In this case, the change may involve a deletion of an external node 802 from the control trie. In this case, due to the deletion, the split node 804 in the control trie will disappear after the deletion. Accordingly, the util node 852 will now have a new POH - the external node 854 on the bottom left. Additionally, the util nodes 856 on right side may still exist, but essentially become useless, and thus may have a null POH, and can be considered dirty and added to the dirty util trie (as they may need to be deleted), along with their forwarding table 100 counterparts, when the dirty util trie is processed.
[00119] Accordingly, these util nodes 852/856 were affected by the change, thus causing their POH to became NULL. As indicated in the Figure, there are cases where when doing a single change (e.g., deleting an entry in the control trie) can affect multiple util nodes - in this case, three util nodes 852/856. Two of these util nodes 856 (lower right) now have NULL POHs, and the util node 852 now has the lower left entry of the control trie 854 as its POH.
[00120] In some embodiments that perform batch updates (e.g., 10 changes) before building, it is important to ensure that the control structures are up-to-date. For example, as shown in Figure 8, this one deletion results in several POH changes, and there may continue to be even more changes made on control side. Thus, in some embodiments, for each change, the POH values may be updated after every change to avoid a situation where, during a later harvesting and building process, the POH comparisons won't match even though they should (i.e., the hybrid nodes are the same).
[00121] Another flow 900 is presented in Figure 9, which is a flow diagram illustrating another flow for efficient forwarding information base reconstruction according to some embodiments. The flow 900 can be performed by, for example, a packet forwarder network element as described herein, a control plane entity (implemented by a same or different device or at a same or different physical/geographic location as a corresponding forwarding plane), etc.
[00122] At block 905, the flow 900 includes determining that an update to a forwarding information base (FIB) utilized to make forwarding decisions is to be performed to reflect a new, changed, or deleted route. The FIB comprises a data structure having a plurality of levels, and includes one or more hybrid nodes each acting as a root of a sub-tree of the data structure and each including a pointer to a child array of nodes at a next level of the plurality of levels.
[00123] At block 910, the flow 900 includes updating a control trie to reflect the new, changed, or deleted route by adding, removing, or modifying at least one external node in the control trie. The control trie includes one or more external nodes, each indicating routing information for one or more of a plurality of routes of the network.
[00124] The flow 900 also includes, at block 915, identifying, within a util trie data structure, a util node that corresponds to a hybrid node of the FIB serving as a root of a subtree that is to be updated due to the new, changed, or deleted route. The util trie also has the plurality of levels and includes a plurality of util nodes. Each of the plurality of util nodes is located at a boundary of one of the plurality of levels, each of the plurality of util nodes corresponds to one of the one or more hybrid nodes of the FIB .
[00125] The flow 900 also includes, at block 920, obtaining a point of harvest (POH) identifier for each of one or more immediate descendant util nodes of the identified util node within the util trie. Each POH identifier identifies, for the corresponding descendant util node, one of the nodes within the control trie that is at a corresponding location within the control trie or is a next node within the control trie beneath the corresponding location within the control trie.
[00126] At block 925, the flow 900 includes obtaining, via a harvesting process, data including one or more nodes that are to form the child array of the hybrid node corresponding to the identified util node. The data further includes a POH identifier for each of the one or more nodes that are hybrid nodes.
[00127] At block 930, responsive to determining that the POH identifier of one of the one or more nodes obtained from the harvesting process matches the obtained POH identifier of one of the one or more immediate descendant util nodes, the flow 900 includes reusing the hybrid node within the existing child array while reconstructing the child array instead of regenerating the hybrid node and any of its descendant nodes.
[00128] Embodiments disclosed herein may involve the use of one or more electronic devices. An electronic device stores and transmits (internally and/or with other electronic devices over a network) code (which is composed of software instructions and which is sometimes referred to as computer program code or a computer program) and/or data using machine-readable media (also called computer-readable media), such as machine-readable storage media (e.g., magnetic disks, optical disks, read only memory (ROM), flash memory devices, phase change memory) and machine-readable transmission media (also called a carrier) (e.g., electrical, optical, radio, acoustical or other form of propagated signals - such as carrier waves, infrared signals). Thus, an electronic device (e.g., a computer) includes hardware and software, such as a set of one or more processors coupled to one or more machine-readable storage media to store code for execution on the set of processors and/or to store data. For instance, an electronic device may include non- volatile memory containing the code since the non-volatile memory can persist code/data even when the electronic device is turned off (when power is removed), and while the electronic device is turned on that part of the code that is to be executed by the processor(s) of that electronic device is typically copied from the slower non- volatile memory into volatile memory (e.g., dynamic random access memory (DRAM), static random access memory
(SRAM)) of that electronic device. Typical electronic devices also include a set or one or more physical network interface(s) (NI) to establish network connections (to transmit and/or receive code and/or data using propagating signals) with other electronic devices. One or more parts of an embodiment may be implemented using different combinations of software, firmware, and/or hardware.
[00129] A network device (ND) is an electronic device that communicatively interconnects other electronic devices on the network (e.g., other network devices, end-user devices). Some network devices are "multiple services network devices" that provide support for multiple networking functions (e.g., routing, bridging, switching, Layer 2 aggregation, session border control, Quality of Service, and/or subscriber management), and/or provide support for multiple application services (e.g., data, voice, and video).
[00130] Figure 10A illustrates connectivity between network devices (NDs) within an exemplary network, as well as three exemplary implementations of the NDs, according to some embodiments. Figure 10A shows NDs 1000A-1000H, and their connectivity by way of lines between 1000A-1000B, lOOOB-lOOOC, lOOOC-lOOOD, 1000D-1000E, 1000E-1000F, 1000F- 1000G, and 1000A-1000G, as well as between 1000H and each of 1000A, lOOOC, 1000D, and 1000G. These NDs are physical devices, and the connectivity between these NDs can be wireless or wired (often referred to as a link). An additional line extending from NDs 1000A, 1000E, and 1000F illustrates that these NDs act as ingress and egress points for the network (and thus, these NDs are sometimes referred to as edge NDs; while the other NDs may be called core NDs).
[00131] Two of the exemplary ND implementations in Figure 10A are: 1) a special-purpose network device 1002 that uses custom application-specific integrated-circuits (ASICs) and a special-purpose operating system (OS); and 2) a general purpose network device 1004 that uses common off-the-shelf (COTS) processors and a standard OS.
[00132] The special-purpose network device 1002 includes networking hardware 1010 comprising compute resource(s) 1012 (which typically include a set of one or more processors), forwarding resource(s) 1014 (which typically include one or more ASICs and/or network processors), and physical network interfaces (NIs) 1016 (sometimes called physical ports), as well as non-transitory machine readable storage media 1018 having stored therein networking software 1020 comprising packet forwarder code 1090A (which, for example, can implement a packet forwarder described herein when executed). A physical NI is hardware in a ND through which a network connection (e.g., wirelessly through a wireless network interface controller (WNIC) or through plugging in a cable to a physical port connected to a network interface controller (NIC)) is made, such as those shown by the connectivity between NDs 1000A-1000H. During operation, the networking software 1020 may be executed by the networking hardware 1010 to instantiate a set of one or more networking software instance(s) 1022. Each of the networking software instance(s) 1022, and that part of the networking hardware 1010 that executes that network software instance (be it hardware dedicated to that networking software instance and/or time slices of hardware temporally shared by that networking software instance with others of the networking software instance(s) 1022), form a separate virtual network element 1030A-1030R. Each of the virtual network element(s) (VNEs) 1030A-1030R includes a control communication and configuration module 1032A-1032R (sometimes referred to as a local control module or control communication module) and forwarding table(s) 1034A-1034R, such that a given virtual network element (e.g., 1030A) includes the control communication and configuration module (e.g., 1032A), a set of one or more forwarding table(s) (e.g., 1034A), and that portion of the networking hardware 1010 that executes the virtual network element (e.g., 1030A).
[00133] The special-purpose network device 1002 is often physically and/or logically considered to include: 1) a ND control plane 1024 (sometimes referred to as a control plane) comprising the compute resource(s) 1012 that execute the control communication and configuration module(s) 1032A-1032R; and 2) a ND forwarding plane 1026 (sometimes referred to as a forwarding plane, a data plane, or a media plane) comprising the forwarding resource(s) 1014 that utilize the forwarding table(s) 1034A-1034R and the physical NIs 1016. By way of example, where the ND is a router (or is implementing routing functionality), the ND control plane 1024 (the compute resource(s) 1012 executing the control communication and
configuration module(s) 1032A-1032R) is typically responsible for participating in controlling how data (e.g., packets) is to be routed (e.g., the next hop for the data and the outgoing physical NI for that data) and storing that routing information in the forwarding table(s) 1034A-1034R, and the ND forwarding plane 1026 is responsible for receiving that data on the physical NIs 1016 and forwarding that data out the appropriate ones of the physical NIs 1016 based on the forwarding table(s) 1034A-1034R.
[00134] Figure 10B illustrates an exemplary way to implement the special-purpose network device 1002 according to some embodiments. Figure 10B shows a special-purpose network device including cards 1038 (typically hot pluggable). While in some embodiments the cards 1038 are of two types (one or more that operate as the ND forwarding plane 1026 (sometimes called line cards), and one or more that operate to implement the ND control plane 1024 (sometimes called control cards)), alternative embodiments may combine functionality onto a single card and/or include additional card types (e.g., one additional type of card is called a service card, resource card, or multi-application card). A service card can provide specialized processing (e.g., Layer 4 to Layer 7 services (e.g., firewall, Internet Protocol Security (IPsec), Secure Sockets Layer (SSL) / Transport Layer Security (TLS), Intrusion Detection System (IDS), peer-to-peer (P2P), Voice over IP (VoIP) Session Border Controller, Mobile Wireless Gateways (Gateway General Packet Radio Service (GPRS) Support Node (GGSN), Evolved Packet Core (EPC) Gateway)). By way of example, a service card may be used to terminate IPsec tunnels and execute the attendant authentication and encryption algorithms. These cards are coupled together through one or more interconnect mechanisms illustrated as backplane 1036 (e.g., a first full mesh coupling the line cards and a second full mesh coupling all of the cards).
[00135] Returning to Figure 10A, the general purpose network device 1004 includes hardware 1040 comprising a set of one or more processor(s) 1042 (which are often COTS processors) and network interface controller(s) 1044 (NICs; also known as network interface cards) (which include physical NIs 1046), as well as non-transitory machine readable storage media 1048 having stored therein software 1050 comprising packet forwarder code 1090B. During operation, the processor(s) 1042 execute the software 1050 to instantiate one or more sets of one or more applications 1064A-1064R. While one embodiment does not implement virtualization, alternative embodiments may use different forms of virtualization. For example, in one such alternative embodiment the virtualization layer 1054 represents the kernel of an operating system (or a shim executing on a base operating system) that allows for the creation of multiple instances 1062A-1062R called software containers that may each be used to execute one (or more) of the sets of applications 1064A-1064R; where the multiple software containers (also called virtualization engines, virtual private servers, or jails) are user spaces (typically a virtual memory space) that are separate from each other and separate from the kernel space in which the operating system is run; and where the set of applications running in a given user space, unless explicitly allowed, cannot access the memory of the other processes, In another such alternative embodiment the virtualization layer 1054 represents a hypervisor (sometimes referred to as a virtual machine monitor (VMM)) or a hypervisor executing on top of a host operating system, and each of the sets of applications 1064A-1064R is run on top of a guest operating system within an instance 1062A-1062R called a virtual machine (which may in some cases be considered a tightly isolated form of software container) that is run on top of the hypervisor - the guest operating system and application may not know they are running on a virtual machine as opposed to running on a "bare metal" host electronic device, or through para-virtualization the operating system and/or application may be aware of the presence of virtualization for optimization purposes. In yet other alternative embodiments, one, some or all of the applications are implemented as unikernel(s), which can be generated by compiling directly with an application only a limited set of libraries (e.g., from a library operating system (LibOS) including drivers/libraries of OS sendees) that provide the particular OS services needed by the application. As a unikernel can be implemented to run directly on hardware 1040, directly on a hypervisor (in which case the unikernel is sometimes described as running within a LibOS virtual machine), or in a software container, embodiments can be implemented fully with unikemels running directly on a hypervisor represented by virtualization layer 1054, unikemels running within software containers represented by instances 1062A-1062R, or as a combination of unikemels and the above-described techniques (e.g., unikemels and virtual machines both run directly on a hypervisor, unikemels and sets of applications that are run in different software containers).
[00136] The instantiation of the one or more sets of one or more applications 1064A-1064R, as well as virtualization if implemented, are collectively referred to as software instance(s) 1052. Each set of applications 1064A-1064R, corresponding virtualization construct (e.g., instance 1062A-1062R) if implemented, and that part of the hardware 1040 that executes them (be it hardware dedicated to that execution and/or time slices of hardware temporally shared), forms a separate virtual network element(s) 1060A-1060R. [00137] The virtual network element(s) 1060A-1060R perform similar functionality to the virtual network element(s) 1030A-1030R - e.g., similar to the control communication and configuration module(s) 1032A and forwarding table(s) 1034A (this virtualization of the hardware 1040 is sometimes referred to as network function virtualization (NFV)). Thus, NFV may be used to consolidate many network equipment types onto industry standard high volume server hardware, physical switches, and physical storage, which could be located in Data centers, NDs, and customer premise equipment (CPE). While embodiments are illustrated with each instance 1062A-1062R corresponding to one VNE 1060A-1060R, alternative embodiments may implement this correspondence at a finer level granularity (e.g., line card virtual machines virtualize line cards, control card virtual machine virtualize control cards, etc.); it should be understood that the techniques described herein with reference to a correspondence of instances 1062A-1062R to VNEs also apply to embodiments where such a finer level of granularity and/or unikernels are used.
[00138] In certain embodiments, the virtualization layer 1054 includes a virtual switch that provides similar forwarding services as a physical Ethernet switch. Specifically, this virtual switch forwards traffic between instances 1062A-1062R and the NIC(s) 1044, as well as optionally between the instances 1062A-1062R; in addition, this virtual switch may enforce network isolation between the VNEs 1060A-1060R that by policy are not permitted to communicate with each other (e.g., by honoring virtual local area networks (VLANs)).
[00139] The third exemplary ND implementation in Figure 10A is a hybrid network device 1006, which includes both custom ASICs/special-purpose OS and COTS processors/standard OS in a single ND or a single card within an ND. In certain embodiments of such a hybrid network device, a platform VM (i.e., a VM that that implements the functionality of the special- purpose network device 1002) could provide for para-virtualization to the networking hardware present in the hybrid network device 1006.
[00140] Regardless of the above exemplary implementations of an ND, when a single one of multiple VNEs implemented by an ND is being considered (e.g., only one of the VNEs is part of a given virtual network) or where only a single VNE is currently being implemented by an ND, the shortened term network element (NE) is sometimes used to refer to that VNE. Also in all of the above exemplary implementations, each of the VNEs (e.g., VNE(s) 1030A-1030R, VNEs 1060A-1060R, and those in the hybrid network device 1006) receives data on the physical NIs (e.g., 1016, 1046) and forwards that data out the appropriate ones of the physical NIs (e.g., 1016, 1046). For example, a VNE implementing IP router functionality forwards IP packets on the basis of some of the IP header information in the IP packet; where IP header information includes source IP address, destination IP address, source port, destination port (where "source port" and "destination port" refer herein to protocol ports, as opposed to physical ports of a ND), transport protocol (e.g., user datagram protocol (UDP), Transmission Control Protocol (TCP), and differentiated services code point (DSCP) values.
[00141] Figure IOC illustrates various exemplary ways in which VNEs may be coupled according to some embodiments. Figure IOC shows VNEs 1070A.1-1070A.P (and optionally VNEs 1070A.Q-1070A.R) implemented in ND 1000A and VNE 1070H.1 in ND 1000H. In Figure IOC, VNEs 1070A.1-1070A.P are separate from each other in the sense that they can receive packets from outside ND 1000A and forward packets outside of ND 1000A; VNE 1070A.1 is coupled with VNE 1070H.1, and thus they communicate packets between their respective NDs; VNE 1070A.2-1070A.3 may optionally forward packets between themselves without forwarding them outside of the ND 1000A; and VNE 1070A.P may optionally be the first in a chain of VNEs that includes VNE 1070A.Q followed by VNE 1070A.R (this is sometimes referred to as dynamic service chaining, where each of the VNEs in the series of VNEs provides a different service - e.g., one or more layer 4-7 network services). While Figure IOC illustrates various exemplary relationships between the VNEs, alternative embodiments may support other relationships (e.g., more/fewer VNEs, more/fewer dynamic service chains, multiple different dynamic service chains with some common VNEs and some different VNEs).
[00142] The NDs of Figure 10A, for example, may form part of the Internet or a private network; and other electronic devices (not shown; such as end user devices including
workstations, laptops, netbooks, tablets, palm tops, mobile phones, smartphones, phablets, multimedia phones, Voice Over Internet Protocol (VOIP) phones, terminals, portable media players, Global Positioning Satellite (GPS) units, wearable devices, gaming systems, set-top boxes, Internet enabled household appliances) may be coupled to the network (directly or through other networks such as access networks) to communicate over the network (e.g., the Internet or virtual private networks (VPNs) overlaid on (e.g., tunneled through) the Internet) with each other (directly or through servers) and/or access content and/or services. Such content and/or services are typically provided by one or more servers (not shown) belonging to a service/content provider or one or more end user devices (not shown) participating in a peer-to- peer (P2P) service, and may include, for example, public webpages (e.g., free content, store fronts, search services), private webpages (e.g., username/password accessed webpages providing email services), and/or corporate networks over VPNs. For instance, end user devices may be coupled (e.g., through customer premise equipment coupled to an access network (wired or wirelessly)) to edge NDs, which are coupled (e.g., through one or more core NDs) to other edge NDs, which are coupled to electronic devices acting as servers. However, through compute and storage virtualization, one or more of the electronic devices operating as the NDs in Figure 10A may also host one or more such servers (e.g., in the case of the general purpose network device 1004, one or more of the software instances 1062A-1062R may operate as servers; the same would be true for the hybrid network device 1006; in the case of the special-purpose network device 1002, one or more such servers could also be run on a virtu alization layer executed by the compute resource(s) 1012); in which case the servers are said to be co-located with the VNEs of that ND.
[00143] A virtual network is a logical abstraction of a physical network (such as that in Figure 10A) that provides network services (e.g., L2 and/or L3 services). A virtual network can be implemented as an overlay network (sometimes referred to as a network virtualization overlay) that provides network services (e.g., layer 2 (L2, data link layer) and/or layer 3 (L3, network layer) services) over an underlay network (e.g., an L3 network, such as an Internet Protocol (IP) network that uses tunnels (e.g., generic routing encapsulation (GRE), layer 2 tunneling protocol (L2TP), IPSec) to create the overlay network).
[00144] A network virtualization edge (NVE) sits at the edge of the underlay network and participates in implementing the network virtualization; the network-facing side of the NVE uses the underlay network to tunnel frames to and from other NVEs; the outward-facing side of the NVE sends and receives data to and from systems outside the network. A virtual network instance (VNI) is a specific instance of a virtual network on a NVE (e.g., a NE/VNE on an ND, a part of a NE/VNE on a ND where that NE/VNE is divided into multiple VNEs through emulation); one or more VNIs can be instantiated on an NVE (e.g., as different VNEs on an ND). A virtual access point (VAP) is a logical connection point on the NVE for connecting external systems to a virtual network; a VAP can be physical or virtual ports identified through logical interface identifiers (e.g., a VLAN ID).
[00145] Examples of network services include: 1) an Ethernet Local Area Network (LAN) emulation service (an Ethernet-based multipoint service similar to an Internet Engineering Task Force (IETF) Multiprotocol Label Switching (MPLS) or Ethernet VPN (EVPN) service) in which external systems are interconnected across the network by a LAN environment over the underlay network (e.g., an NVE provides separate L2 VNIs (virtual switching instances) for different such virtual networks, and L3 (e.g., IP/MPLS) tunneling encapsulation across the underlay network); and 2) a virtualized IP forwarding service (similar to IETF IP VPN (e.g., Border Gateway Protocol (BGP)/MPLS IPVPN) from a service definition perspective) in which external systems are interconnected across the network by an L3 environment over the underlay network (e.g., an NVE provides separate L3 VNIs (forwarding and routing instances) for different such virtual networks, and L3 (e.g., IP/MPLS) tunneling encapsulation across the underlay network)). Network services may also include quality of service capabilities (e.g., traffic classification marking, traffic conditioning and scheduling), security capabilities (e.g., filters to protect customer premises from network - originated attacks, to avoid malformed route announcements), and management capabilities (e.g., full detection and processing).
[00146] Fig. 10D illustrates a network with a single network element on each of the NDs of Figure 10A, and within this straight forward approach contrasts a traditional distributed approach (commonly used by traditional routers) with a centralized approach for maintaining reachability and forwarding information (also called network control), according to some embodiments. Specifically, Figure 10D illustrates network elements (NEs) 1070A-1070H with the same connectivity as the NDs 1000A-1000H of Figure 10A.
[00147] Figure 10D illustrates that the distributed approach 1072 distributes responsibility for generating the reachability and forwarding information across the NEs 1070A-1070H; in other words, the process of neighbor discovery and topology discovery is distributed.
[00148] For example, where the special-purpose network device 1002 is used, the control communication and configuration module(s) 1032A-1032R of the ND control plane 1024 typically include a reachability and forwarding information module to implement one or more routing protocols (e.g., an exterior gateway protocol such as Border Gateway Protocol (BGP), Interior Gateway Protocol(s) (IGP) (e.g., Open Shortest Path First (OSPF), Intermediate System to Intermediate System (IS-IS), Routing Information Protocol (RIP), Label Distribution Protocol (LDP), Resource Reservation Protocol (RSVP) (including RS VP-Traffic Engineering (TE): Extensions to RSVP for Label Switched Path (LSP) Tunnels and Generalized Multi-Protocol Label Switching (GMPLS) Signaling RSVP-TE)) that communicate with other NEs to exchange routes, and then selects those routes based on one or more routing metrics. Thus, the NEs 1070A-1070H (e.g., the compute resource(s) 1012 executing the control communication and configuration module(s) 1032A-1032R) perform their responsibility for participating in controlling how data (e.g., packets) is to be routed (e.g., the next hop for the data and the outgoing physical NI for that data) by distributively determining the reachability within the network and calculating their respective forwarding information. Routes and adjacencies are stored in one or more routing structures (e.g., Routing Information Base (RIB), Label
Information Base (LIB), one or more adjacency structures) on the ND control plane 1024. The ND control plane 1024 programs the ND forwarding plane 1026 with information (e.g., adjacency and route information) based on the routing structure(s). For example, the ND control plane 1024 programs the adjacency and route information into one or more forwarding table(s) 1034A-1034R (e.g., Forwarding Information Base (FIB), Label Forwarding Information Base (LFIB), and one or more adjacency structures) on the ND forwarding plane 1026. For layer 2 forwarding, the ND can store one or more bridging tables that are used to forward data based on the layer 2 information in that data. While the above example uses the special-purpose network device 1002, the same distributed approach 1072 can be implemented on the general purpose network device 1004 and the hybrid network device 1006.
[00149] Figure 10D illustrates that a centralized approach 1074 (also known as software defined networking (SDN)) that decouples the system that makes decisions about where traffic is sent from the underlying systems that forwards traffic to the selected destination. The illustrated centralized approach 1074 has the responsibility for the generation of reachability and forwarding information in a centralized control plane 1076 (sometimes referred to as a SDN control module, controller, network controller, OpenFlow controller, SDN controller, control plane node, network virtualization authority, or management control entity), and thus the process of neighbor discovery and topology discovery is centralized. The centralized control plane 1076 has a south bound interface 1082 with a data plane 1080 (sometime referred to the infrastructure layer, network forwarding plane, or forwarding plane (which should not be confused with a ND forwarding plane)) that includes the NEs 1070A-1070H (sometimes referred to as switches, forwarding elements, data plane elements, or nodes). The centralized control plane 1076 includes a network controller 1078, which includes a centralized reachability and forwarding information module 1079 that determines the reachability within the network and distributes the forwarding information to the NEs 1070A-1070H of the data plane 1080 over the south bound interface 1082 (which may use the OpenFlow protocol). Thus, the network intelligence is centralized in the centralized control plane 1076 executing on electronic devices that are typically separate from the NDs.
[00150] For example, where the special-purpose network device 1002 is used in the data plane 1080, each of the control communication and configuration module(s) 1032A-1032R of the ND control plane 1024 typically include a control agent that provides the VNE side of the south bound interface 1082. In this case, the ND control plane 1024 (the compute resource(s) 1012 executing the control communication and configuration module(s) 1032A-1032R) performs its responsibility for participating in controlling how data (e.g., packets) is to be routed (e.g., the next hop for the data and the outgoing physical NI for that data) through the control agent communicating with the centralized control plane 1076 to receive the forwarding information (and in some cases, the reachability information) from the centralized reachability and forwarding information module 1079 (it should be understood that in some embodiments, the control communication and configuration module(s) 1032A-1032R, in addition to
communicating with the centralized control plane 1076, may also play some role in determining reachability and/or calculating forwarding information - albeit less so than in the case of a distributed approach; such embodiments are generally considered to fall under the centralized approach 1074, but may also be considered a hybrid approach).
[00151] While the above example uses the special-purpose network device 1002, the same centralized approach 1074 can be implemented with the general purpose network device 1004 (e.g., each of the VNE 1060A-1060R performs its responsibility for controlling how data (e.g., packets) is to be routed (e.g., the next hop for the data and the outgoing physical NI for that data) by communicating with the centralized control plane 1076 to receive the forwarding information (and in some cases, the reachability information) from the centralized reachability and forwarding information module 1079; it should be understood that in some embodiments, the VNEs 1060A-1060R, in addition to communicating with the centralized control plane 1076, may also play some role in determining reachability and/or calculating forwarding information - albeit less so than in the case of a distributed approach) and the hybrid network device 1006. In fact, the use of SDN techniques can enhance the NFV techniques typically used in the general purpose network device 1004 or hybrid network device 1006 implementations as NFV is able to support SDN by providing an infrastructure upon which the SDN software can be run, and NFV and SDN both aim to make use of commodity server hardware and physical switches.
[00152] Figure 10D also shows that the centralized control plane 1076 has a north bound interface 1084 to an application layer 1086, in which resides application(s) 1088. The centralized control plane 1076 has the ability to form virtual networks 1092 (sometimes referred to as a logical forwarding plane, network services, or overlay networks (with the NEs 1070A- 1070H of the data plane 1080 being the underlay network)) for the application(s) 1088. Thus, the centralized control plane 1076 maintains a global view of all NDs and configured
NEs/VNEs, and it maps the virtual networks to the underlying NDs efficiently (including maintaining these mappings as the physical network changes either through hardware (ND, link, or ND component) failure, addition, or removal).
[00153] While Figure 10D shows the distributed approach 1072 separate from the centralized approach 1074, the effort of network control may be distributed differently or the two combined in certain embodiments. For example: 1) embodiments may generally use the centralized approach (e.g., SDN) 1074, but have certain functions delegated to the NEs (e.g., the distributed approach may be used to implement one or more of fault monitoring, performance monitoring, protection switching, and primitives for neighbor and/or topology discovery); or 2)
embodiments may perform neighbor discovery and topology discovery via both the centralized control plane and the distributed protocols, and the results compared to raise exceptions where they do not agree. Such embodiments are generally considered to fall under the centralized approach 1074, but may also be considered a hybrid approach. [00154] While Figure 10D illustrates the simple case where each of the NDs 1000A-1000H implements a single NE 1070A-1070H, it should be understood that the network control approaches described with reference to Figure 10D also work for networks where one or more of the NDs 1000A-1000H implement multiple VNEs (e.g., VNEs 1030A-1030R, VNEs 1060A- 1060R, those in the hybrid network device 1006). Alternatively or in addition, the network controller 1078 may also emulate the implementation of multiple VNEs in a single ND.
Specifically, instead of (or in addition to) implementing multiple VNEs in a single ND, the network controller 1078 may present the implementation of a VNE/NE in a single ND as multiple VNEs in the virtual networks 1092 (all in the same one of the virtual network(s) 1092, each in different ones of the virtual network(s) 1092, or some combination). For example, the network controller 1078 may cause an ND to implement a single VNE (a NE) in the underlay network, and then logically divide up the resources of that NE within the centralized control plane 1076 to present different VNEs in the virtual network(s) 1092 (where these different VNEs in the overlay networks are sharing the resources of the single VNE/NE implementation on the ND in the underlay network).
[00155] On the other hand, Figures 10E and 10F respectively illustrate exemplary abstractions of NEs and VNEs that the network controller 1078 may present as part of different ones of the virtual networks 1092. Figure 10E illustrates the simple case of where each of the NDs 1000A- 1000H implements a single NE 1070A-1070H (see Figure 10D), but the centralized control plane 1076 has abstracted multiple of the NEs in different NDs (the NEs 1070A-1070C and 1070G-1070H) into (to represent) a single NE 10701 in one of the virtual network(s) 1092 of Figure 10D, according to some embodiments. Figure 10E shows that in this virtual network, the NE 10701 is coupled to NE 1070D and 1070F, which are both still coupled to NE 1070E.
[00156] Figure 10F illustrates a case where multiple VNEs (VNE 1070A.1 and VNE 1070H.1) are implemented on different NDs (ND 1000 A and ND 1000H) and are coupled to each other, and where the centralized control plane 1076 has abstracted these multiple VNEs such that they appear as a single VNE 1070T within one of the virtual networks 1092 of Figure 10D, according to some embodiments. Thus, the abstraction of a NE or VNE can span multiple NDs.
[00157] A network interface (NI) may be physical or virtual; and in the context of IP, an interface address is an IP address assigned to a NI, be it a physical NI or virtual NI. A virtual NI may be associated with a physical NI, with another virtual interface, or stand on its own (e.g., a loopback interface, a point-to-point protocol interface). A NI (physical or virtual) may be numbered (a NI with an IP address) or unnumbered (a NI without an IP address). A loopback interface (and its loopback address) is a specific type of virtual NI (and IP address) of a
NE/VNE (physical or virtual) often used for management purposes; where such an IP address is referred to as the nodal loopback address. The IP address(es) assigned to the NI(s) of a ND are referred to as IP addresses of that ND; at a more granular level, the IP address(es) assigned to NI(s) assigned to a NE/VNE implemented on a ND can be referred to as IP addresses of that NE/VNE.
[00158] Next hop selection by the routing system for a given destination may resolve to one path (that is, a routing protocol may generate one next hop on a shortest path); but if the routing system determines there are multiple viable next hops (that is, the routing protocol generated forwarding solution offers more than one next hop on a shortest path - multiple equal cost next hops), some additional criteria is used - for instance, in a connectionless network, Equal Cost Multi Path (ECMP) (also known as Equal Cost Multi Pathing, multipath forwarding and IP multipath) may be used (e.g., typical implementations use as the criteria particular header fields to ensure that the packets of a particular packet flow are always forwarded on the same next hop to preserve packet flow ordering). For purposes of multipath forwarding, a packet flow is defined as a set of packets that share an ordering constraint. As an example, the set of packets in a particular TCP transfer sequence need to arrive in order, else the TCP logic will interpret the out of order delivery as congestion and slow the TCP transfer rate down.
[00159] Each VNE (e.g., a virtual router, a virtual bridge (which may act as a virtual switch instance in a Virtual Private Local Area Network Service (VPLS) is typically independently administrable. For example, in the case of multiple virtual routers, each of the virtual routers may share system resources but is separate from the other virtual routers regarding its management domain, AAA (authentication, authorization, and accounting) name space, IP address, and routing database(s). Multiple VNEs may be employed in an edge ND to provide direct network access and/or different classes of services for subscribers of service and/or content providers.
[00160] Some NDs provide support for implementing VPNs (Virtual Private Networks) (e.g., Layer 2 VPNs and/or Layer 3 VPNs). For example, the ND where a provider's network and a customer's network are coupled are respectively referred to as PEs (Provider Edge) and CEs (Customer Edge). In a Layer 2 VPN, forwarding typically is performed on the CE(s) on either end of the VPN and traffic is sent across the network (e.g., through one or more PEs coupled by other NDs). Layer 2 circuits are configured between the CEs and PEs (e.g., an Ethernet port, an ATM permanent virtual circuit (PVC), a Frame Relay PVC). In a Layer 3 VPN, routing typically is performed by the PEs. By way of example, an edge ND that supports multiple VNEs may be deployed as a PE; and a VNE may be configured with a VPN protocol, and thus that VNE is referred as a VPN VNE. [00161] While the flow diagrams in the figures show a particular order of operations performed by certain embodiments, it should be understood that such order is exemplary (e.g., alternative embodiments may perform the operations in a different order, combine certain operations, overlap certain operations, etc.).
[00162] Additionally, while the invention has been described in terms of several embodiments, those skilled in the art will recognize that the invention is not limited to the embodiments described, can be practiced with modification and alteration within the spirit and scope of the appended claims. The description is thus to be regarded as illustrative instead of limiting.

Claims

CLAIMS What is claimed is:
1. A method in a packet forwarder implemented by a device (1000) for efficiently reconstructing a latency-bounded forwarding information base (FIB) (100/1034) to reflect a new, changed, or deleted route of a communications network, the method comprising:
determining (905), by the packet forwarder, that an update to the FIB utilized by the packet forwarder to make forwarding decisions is to be performed to reflect the new, changed, or deleted route, wherein the FIB comprises a data structure having a plurality of levels (102), wherein the data structure includes one or more hybrid nodes (104) each acting as a root (106) of a sub-tree (108) of the data structure and including a pointer to a child array (452) of nodes at a next level of the plurality of levels;
updating (910) a control trie (200) to reflect the new, changed, or deleted route by
adding, removing, or modifying at least one external node (204) in the control trie, wherein the control trie includes one or more external nodes each indicating routing information for one or more of a plurality of routes (206) of the network; identifying (915), within a util trie (300), a util node (302) that corresponds to a hybrid node of the FIB serving as a root of a subtree that is to be updated due to the new, changed, or deleted route, wherein the util trie also has the plurality of levels (301) and includes a plurality of util nodes, wherein each of the plurality of util nodes is located at a boundary (304) of one of the plurality of levels, and wherein each of the plurality of util nodes corresponds to one of the one or more hybrid nodes of the FIB;
obtaining (920) a point of harvest (POH) identifier (502/504/506) for each of one or more immediate descendant util nodes (408) of the identified util node within the util trie, wherein each POH identifier identifies, for the corresponding descendant util node, one of the nodes within the control trie that is at a corresponding location (520) within the control trie or is a next node within the control trie beneath the corresponding location (525) within the control trie;
obtaining (925), via a harvesting process (636/710), data including one or more nodes that are to form the child array of the hybrid node corresponding to the identified util node, wherein the data further includes a POH identifier for each of the one or more nodes that are hybrid nodes; and responsive to determining (930) that the POH identifier of one of the one or more nodes obtained from the harvesting process matches the obtained POH identifier of one of the one or more immediate descendant util nodes, reusing the hybrid node from the child array while reconstructing the child array instead of regenerating the hybrid node and any of its descendant nodes.
2. The method of claim 1, further comprising:
after the identifying of the util node, inserting a dirty node in a dirty util trie at a
corresponding location of the dirty util trie as the location of the identified util node in the util trie, wherein the dirty node comprises a pointer to the identified util node of the util trie; and
at a later point in time, traversing the dirty util trie in a top-down breadth-first manner to identify those of the util nodes needing to have their corresponding child arrays reconstructed.
3. The method of claim 1, wherein the obtaining the POH identifier for each of one or more immediate descendant util nodes of the identified util node within the util trie comprises:
caching, within the identified util node, each of the one or more immediate descendant util nodes,
wherein each of the one or more immediate descendant util nodes stores its
corresponding POH identifier.
4. The method of claim 1, wherein reconstructing the child array includes:
determining that one or more of the hybrid nodes of the child array can be reused; and generating a second child array, including copying each of the one or more of the hybrid nodes that can be reused to the second child array.
5. The method of claim 4, wherein reconstructing the child array further includes:
updating a pointer from the hybrid node corresponding to the identified util node to point to the second child array instead of the child array.
6. The method of claim 4, wherein at least one of the copied one or more of the hybrid nodes is placed at a different index within the second child array compared to its index within the child array, and wherein the relative order of the copied one or more hybrid nodes in the second child array is the same as the relative order of the one or more hybrid nodes within the child array.
7. The method of claim 4, wherein all of the copied one or more of the hybrid nodes is placed at a same index within the second child array as its index within the child array.
8. The method of claim 1, further comprising:
updating the POH identifier of one or more of the util nodes of the util trie responsive to the update of the control trie.
9. The method of claim 1, wherein:
the control trie stores route information for the plurality of routes and is indexed by a routing prefix of a route;
the control trie further includes one or more split nodes each identifying one or more bit locations of the routing prefix that can be utilized to determine how to traverse the control trie; and
the FIB further includes one or more leaf nodes that collectively store forwarding
information for the plurality of routes.
10. A non-transitory machine-readable storage medium (1018/1048) that provides instructions which, when executed by a processor (1012/1042) of a device, will cause said device to implement a packet forwarder to efficiently reconstruct a latency-bounded forwarding information base (FIB) to reflect a new, changed, or deleted route of a communications network by performing the method of any one of claims 1-9.
11. A computer program product (1018/1048) having computer program logic arranged to put into effect the method of any of claims 1-9.
12. A device (1000), comprising:
one or more processors (1012/1042); and
the non-transitory machine-readable storage medium of claim 10.
13. A device (1000) to implement a packet forwarder to efficiently reconstruct a latency- bounded forwarding information base (FIB) to reflect a new, changed, or deleted route of a communications network, the device comprising:
a module to determine that an update to the FIB utilized by the packet forwarder to make forwarding decisions is to be performed to reflect the new, changed, or deleted route, wherein the FIB comprises a data structure having a plurality of levels, wherein the data structure includes one or more hybrid nodes each acting as a root of a sub-tree of the data structure and including a pointer to a child array of nodes at a next level of the plurality of levels;
a module to update a control trie to reflect the new, changed, or deleted route by adding, removing, or modifying at least one external node in the control trie, wherein the control trie includes one or more external nodes each indicating routing information for one or more of a plurality of routes of the network; a module to identify, within a util trie, a util node that corresponds to a hybrid node of the FIB serving as a root of a subtree that is to be updated due to the new, changed, or deleted route, wherein the util trie also has the plurality of levels and includes a plurality of util nodes, wherein each of the plurality of util nodes is located at a boundary of one of the plurality of levels, and wherein each of the plurality of util nodes corresponds to one of the one or more hybrid nodes of the FIB;
a module to obtain a point of harvest (POH) identifier for each of one or more immediate descendant util nodes of the identified util node within the util trie, wherein each POH identifier identifies, for the corresponding descendant util node, one of the nodes within the control trie that is at a corresponding location within the control trie or is a next node within the control trie beneath the corresponding location within the control trie;
a module to obtain, via a harvesting process, data including one or more nodes that are to form the child array of the hybrid node corresponding to the identified util node, wherein the data further includes a POH identifier for each of the one or more nodes that are hybrid nodes; and
a module to, responsive to determining that the POH identifier of one of the one or more nodes obtained from the harvesting process matches the obtained POH identifier of one of the one or more immediate descendant util nodes, reuse the hybrid node from the child array while reconstructing the child array instead of regenerating the hybrid node and any of its descendant nodes.
14. A method in a packet forwarder implemented by a device (1000) for efficiently reconstructing a latency-bounded forwarding information base (FIB) (100/1034) to reflect a new, changed, or deleted route of a communications network, the method comprising:
identifying, within a util trie (300), a util node (302) that corresponds to a hybrid node of a FIB that has a child array that is to be updated due to a new, changed, or deleted route, wherein the util trie has a plurality of levels (301) and includes a plurality of util nodes, wherein each of the plurality of util nodes is located at a boundary (304) of one of the plurality of levels, and wherein each of the plurality of util nodes corresponds to one of one or more hybrid nodes of the FIB, wherein the FIB comprises a data structure having a plurality of levels (102) and having one or more hybrid nodes (104) each acting as a root (106) of a sub-tree (108) of the data structure and each including a pointer to a child array (452) of nodes at a next level of the plurality of levels;
caching, using the util trie, one or more immediate descendant util nodes (408) of the identified util node within the identified util node, wherein each of the one or more immediate descendant util nodes includes a point of harvest (POH) identifier (502/504/506) that identifies, for the corresponding util node, one of the nodes within a control trie that is at a corresponding location (520) within the control trie or is a next node within the control trie beneath the corresponding location (525) within the control trie, wherein the control trie includes one or more external nodes each indicating routing information for one or more of a plurality of routes (206) of the network;
obtaining, via a harvesting process (636/710), data including one or more nodes that are to form the child array of the hybrid node corresponding to the identified util node, wherein the data further includes a POH identifier for each of the one or more nodes that are hybrid nodes;
determining that one of the POH identifiers returned via the harvesting process matches one of the POH identifiers of one of the one or more immediate descendant nodes cached within the identified util node; and
responsive to said determining, reusing a hybrid node from an existing version of the child array within the FIB instead of regenerating the hybrid node during a reconstruction of the child array.
PCT/IB2016/055408 2016-09-09 2016-09-09 Techniques for efficient forwarding information base reconstruction using point of harvest identifiers WO2018046986A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/IB2016/055408 WO2018046986A1 (en) 2016-09-09 2016-09-09 Techniques for efficient forwarding information base reconstruction using point of harvest identifiers

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/IB2016/055408 WO2018046986A1 (en) 2016-09-09 2016-09-09 Techniques for efficient forwarding information base reconstruction using point of harvest identifiers

Publications (1)

Publication Number Publication Date
WO2018046986A1 true WO2018046986A1 (en) 2018-03-15

Family

ID=56940114

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2016/055408 WO2018046986A1 (en) 2016-09-09 2016-09-09 Techniques for efficient forwarding information base reconstruction using point of harvest identifiers

Country Status (1)

Country Link
WO (1) WO2018046986A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120170580A1 (en) * 2010-12-31 2012-07-05 Rajesh Jagannathan Bank aware multi-bit trie
US20140365500A1 (en) * 2013-06-11 2014-12-11 InfiniteBio Fast, scalable dictionary construction and maintenance
US20150281055A1 (en) * 2014-03-27 2015-10-01 Brocade Communications Systems, Inc. Techniques for aggregating hardware routing resources in a multi-packet processor networking system
US20150372915A1 (en) * 2013-01-31 2015-12-24 Hewlett-Packard Development Company, L.P. Incremental update of a shape graph

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120170580A1 (en) * 2010-12-31 2012-07-05 Rajesh Jagannathan Bank aware multi-bit trie
US20150372915A1 (en) * 2013-01-31 2015-12-24 Hewlett-Packard Development Company, L.P. Incremental update of a shape graph
US20140365500A1 (en) * 2013-06-11 2014-12-11 InfiniteBio Fast, scalable dictionary construction and maintenance
US20150281055A1 (en) * 2014-03-27 2015-10-01 Brocade Communications Systems, Inc. Techniques for aggregating hardware routing resources in a multi-packet processor networking system

Similar Documents

Publication Publication Date Title
US11362945B2 (en) Dynamic lookup optimization for packet classification
US9450866B2 (en) Forwarding table performance control in SDN
CN108702326B (en) Method, device and non-transitory machine-readable medium for detecting SDN control plane loops
US9736263B2 (en) Temporal caching for ICN
US10205662B2 (en) Prefix distribution-based table performance optimization in SDN
US10404573B2 (en) Efficient method to aggregate changes and to produce border gateway protocol link-state (BGP-LS) content from intermediate system to intermediate system (IS-IS) link-state database
EP3507953B1 (en) Techniques for architecture-independent dynamic flow learning in a packet forwarder
US11283711B2 (en) Efficient VPN route refresh mechanism for BGP based VPN technologies
EP3210347B1 (en) Pre-built match-action tables
WO2015140724A2 (en) Optimized approach to is-is lfa computation with parallel links
US20160316011A1 (en) Sdn network element affinity based data partition and flexible migration schemes
EP3164970A1 (en) A method and system for compressing forward state of a data network
EP3195537B1 (en) Forwarding table precedence in sdn
US11265104B2 (en) Mechanism for inline packet response generation in software defined networks
WO2018046986A1 (en) Techniques for efficient forwarding information base reconstruction using point of harvest identifiers
WO2020100150A1 (en) Routing protocol blobs for efficient route computations and route downloads
WO2020165910A1 (en) Technique for providing priority-aware caching in the fast-path of a virtual switch

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16766643

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16766643

Country of ref document: EP

Kind code of ref document: A1