WO2015141014A1 - Method of constructing software-defined pci express (pci-e) switch - Google Patents

Method of constructing software-defined pci express (pci-e) switch Download PDF

Info

Publication number
WO2015141014A1
WO2015141014A1 PCT/JP2014/058146 JP2014058146W WO2015141014A1 WO 2015141014 A1 WO2015141014 A1 WO 2015141014A1 JP 2014058146 W JP2014058146 W JP 2014058146W WO 2015141014 A1 WO2015141014 A1 WO 2015141014A1
Authority
WO
WIPO (PCT)
Prior art keywords
switch
node
packet
pci
tlp
Prior art date
Application number
PCT/JP2014/058146
Other languages
French (fr)
Inventor
Lei Sun
Takashi Yoshikawa
Masahiko Takahashi
Jun Suzuki
Akira Tsuji
Original Assignee
Nec Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nec Corporation filed Critical Nec Corporation
Priority to PCT/JP2014/058146 priority Critical patent/WO2015141014A1/en
Priority to JP2016555801A priority patent/JP2017511532A/en
Publication of WO2015141014A1 publication Critical patent/WO2015141014A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/40Bus structure
    • G06F13/4004Coupling between buses
    • G06F13/4022Coupling between buses using switching circuits, e.g. switching matrix, connection or expansion network

Definitions

  • the invention generally relates to the technical field of networking and computer architecture, and more specifically relates to approaches of constructing a system which consists of a software-defined network (SDN) box, upstream nodes and downstream nodes.
  • SDN software-defined network
  • PCI-E Peripheral Component Interconnect-Express
  • PCI-E Peripheral Component Interconnect-Express
  • PCI-E employs the same usage model as previous generation I/O bus called PCI. It supports familiar transactions such as memory read/write, I/O read/write and
  • PCI-E PCI Express system architecture, by Ravi Budruk, Don Anderson, Tom Shanley. Addison- Wesley Professional, 2004).
  • PCI-E slots The number of PCI-E slots is usually limited because there is not so much space in servers.
  • ExpEther Express Ethernet; see http: //www.expether.org/.
  • PCI-E packets are encapsulated to PCI-E-over-Ethernet packets at the side of sender of PCI-E packet; then they are forwarded by Ethernet switches, and when these packets are forwarded to the destination, they are decapsulated to PCI-E packets (see Japanese Unexamined Patent Application, First Publication No. 2007-219873 A).
  • a certain type of packets is broadcasted to act as a keep-alive message in ExpEther. There is a timeout value associated with it. If timeout expires, there is still no such keep-alive packets received, the upstream node is concluded that it is already out of service. Once failure is detected, the connected downstream nodes should be handed over to another available upstream node as soon as possible.
  • the invention is proposed to solve the above fast failure detection and hand-over problem of ExpEther. It provides a method of constructing a system based on PCI-E,
  • Ethernet and SDN Ethernet and SDN.
  • the proposed system consists of upstream nodes, downstream nodes and a SDN box.
  • an encapsulation/decapsulation module on both an upstream node and a downstream node.
  • the upstream node (may consists of CPU, memory, hard disk and various I O devices) behaves as a computing system, and the downstream node behaves as a PCI-E device, they are connected by the upstream node.
  • the upstream node accesses the downstream node by PCI-E packets.
  • the communication process is as follows.
  • PCI-E packet When sending PCI-E packet, it encapsulates the PCI-E packet with a specific packet header, and send it out via the network (e.g. Ethernet but NOT limited to Ethernet). When receiving the encapsulated packet, it removes the specific packet header, decapsulates it into a PCI-E packet.
  • the SDN box acts as a 'PCI-E switch', which interconnects the upstream node to the downstream node via the specific network and forwards the encapsulated packets.
  • the SDN box can be implemented by OpenFlow (see
  • FIG. 1 is a block diagram depicting an embodiment of system architecture of a computing system with a SDN box, at least one upstream node and at least one downstream node;
  • FIG. 2A and FIG. 2B are block diagrams depicting an embodiment of system architecture of one of possible implementations based on a SDN box, which consists of at least one software-defined switch (SD switch) and at least one software-defined switch controller (SDSW controller);
  • SD switch software-defined switch
  • SDSW controller software-defined switch controller
  • FIG. 3 is a block diagram depicting an embodiment of system architecture of an upstream node
  • FIG. 4 is a block diagram depicting an embodiment of system architecture of a downstream node
  • FIG. 5 is a block diagram depicting an embodiment of system architecture of a software-defined (SD) switch
  • FIG. 6 is a block diagram depicting an embodiment of system architecture of a software-defined switch controller (SDSW controller);
  • SDSW controller software-defined switch controller
  • FIG. 7 is a table depicting an embodiment of a possible packet format of the DEVINFO packet
  • FIG 8 is a sequence diagram of the memory read/write between upstream node
  • FIG 9 is a flowchart depicting an embodiment of a method of an upstream node or a downstream node how sends out PCI-E packets;
  • FIG. 10 is a flowchart depicting an embodiment of a method of an upstream node or a downstream node how receives PCI-E packets;
  • FIG. 11 is a flowchart depicting an embodiment of a method of the SD switch how to handles Ethernet packets
  • FIG. 12 is a flowchart depicting an embodiment of a method of the SDSW controller how to set up the master transaction layer packet (TLP) routing table during system initiation and how to communicate with SD switch during the process of packet forwarding;
  • TLP master transaction layer packet
  • FIG. 13 is a table depicting an embodiment of a possible data structure of TLP routing table, which is used as slave TLP routing table on the SD switch and master TLP routing table on the SDSW controller, when the underlying network is Ethernet; and
  • FIG. 14 is a sequence diagram of the failure detection and hand-over between upstream nodes, SDN box and downstream nodes.
  • FIG. 1 is a block diagram depicting an embodiment of system architecture of a computing system with a software-defined network (SDN) box 106, upstream nodes 101 and 104 and downstream nodes 102, 103 and 105. Each component is connected by network (e.g. Ethernet but not limited to it).
  • SDN software-defined network
  • Each component is connected by network (e.g. Ethernet but not limited to it).
  • network e.g. Ethernet but not limited to it.
  • FIG. 1 there are two groups. One group consists of upstream node 101 and downstream node 102 and downstream node 103. The other group consists of upstream node 104 and downstream node 105.
  • each group there is a single upstream node and at least one downstream node. All packets between the upstream node and downstream nodes are forwarded by the SDN box.
  • FIG. 2A and FIG. 2B are block diagrams depicting an embodiment of system architecture of a possible implementation, where the SDN box consists of at least one software-defined switch (SD switch) and at least one software-defined switch controller (SDSW controller).
  • SD switch software-defined switch
  • SDSW controller software-defined switch controller
  • the SDN box 206 consists of a SD switch 208 and a SDSW controller 207. There is a communication channel between the SD switch 208 and the SDSW controller 207.
  • the upstream nodes and downstream nodes can be divided into two groups, and each group consists of a single upstream node and at least one downstream node.
  • the upstream node 201 and two downstream nodes 202 and 203 are connected to the first (#1), second (#2) and third (#3) ports of SD switch 208 respectively.
  • the upstream node 204 and the downstream nodes 205 are connected to the fourth (#4) and fifth (#5) ports of SD switch 208 respectively.
  • the SDN box 206 consists of SD switches 208, 209, 211, and 212 and SDSW controllers 207 and 210. There is a communication channel between SDSW controller 207 and SD switch 208 and 209 respectively. Additionally, there is also a communication channel between the SDSW controller 210 and SD switch 211 and 212 respectively.
  • the upstream nodes and downstream nodes can be divided into two groups, and each group consists of a single upstream node and at least one downstream node.
  • the upstream node 201 and two downstream nodes 202 and 203 belong to one group and the upstream node 204 and the downstream node 205 belong to another group.
  • all SD switches are connected by a specific network, e.g. Ethernet (but not limited to it).
  • the communication between the SDSW controller and SD switches are in a specific protocol, e.g. OpenFlow (but not limited to it).
  • FIG. 3 is a block diagram depicting an embodiment of system architecture of an upstream node.
  • a computer system consists of CPU, memory, hard disk and various I/O devices. To explain the system architecture more clearly, the rest
  • An upstream node 301 consists of at least a central processing unit (CPU) 302, an encapsulation/decapsulation (ENCAP/DECAP) module 303 and a network interface card (NIC) 304.
  • CPU 302 is a hardware component that carries out the instructions of a computer program by performing the basic arithmetical, logical, and input/output operations of the system.
  • NIC 304 is a hardware component that connects the upstream node 301 to the network where SDN box is located.
  • the encapsulation/decapsulation module 303 is in charge of encapsulation of PCI-E packets and decapsulation the received encapsulated packets during the process of communication.
  • the connecting method is not limited, it may be either hardware method e.g. PCI-E bus protocol (but not limited to it) or any software method.
  • FIG. 4 is a block diagram depicting an embodiment of system architecture of a downstream node. It may consist of CPU, memory, hard disk and various devices. To explain the system architecture more clearly, the rest components are omitted here.
  • a downstream node 401 consists of at least memory 402, an encapsulation/decapsulation (ENCAP/DECAP) module 403 and a network interface card (NIC) 404.
  • NIC 404 is a hardware component that connects the downstream node 401 to the network where SDN box is located.
  • the encapsulation/decapsulation module 403 is in charge of encapsulation of PCI-E packets and decapsulation the received encapsulated packets during the process of communication.
  • the connecting method is not limited, it may be either hardware method e.g. PCI-E bus protocol (but not limited to it) or any software method.
  • FIG. 5 is a block diagram depicting an embodiment of system architecture of a software-defined (SD) switch.
  • a SD switch 501 at least consists of a RecvD module 502, a SendD module 503, a SendC (sending to control plane) module 504, a RecvC (reception from control plane) module 509 a decapsulation (DECAP) module 505, an encapsulation (ENCAP) module 506, a packet buffer (PKT BUFFER) module 507 and a slave transaction layer packet (TLP) routing table operation module 508.
  • the RecvD module 502 is in charge of reception from upstream nodes and downstream nodes.
  • the SendD module 503 is in charge of sending to upstream nodes and downstream nodes.
  • the RecvC module 509 is in charge of reception from SDSW controller.
  • the SendC module 504 is in charge of sending to SDSW controller.
  • the decapsulation module 505 decapsulates the encapsulated packets to PCI-E packets.
  • the encapsulation module 506 encapsulates PCI-E packets.
  • the packet buffer module 507 is in charge of buffering packets.
  • the slave TLP routing table operation module 508 can create and insert a new slave TLP routing table item and support retrieving function of the slave TLP routing table.
  • FIG. 6 is a block diagram depicting an embodiment of system architecture of a software-defined switch controller (SDSW controller).
  • SDSW controller at least consists of a RecvSW module 602, a SendSW module 603, a master TLP routing table operation module 604, and a msg-parser module 605.
  • the RecvSW module 602 is in charge of reception from SD switch.
  • the SendSW module 603 is in charge of sending to SD switch.
  • the master TLP routing table operation module 604 can create and insert a new master TLP routing table item and support retrieving function of the master TLP routing table.
  • the msg-parser module 605 extracts info which includes: 1) Node ID (the unique ID of a node); 2) Node type (upstream node or downstream node); 3) Destination address of each node (either upstream or downstream); 4) the port number of SD switch which is connected to the node (either upstream or downstream); 5)
  • VLAN-tag used for the group (only necessary when the underlying network is Ethernet); and 6) TLP routing ID (the unique ID of TLP routing) illustrated in FIG.13 from the received packets.
  • the SDN box consists of at least one SDSW controller and at least one SD switch.
  • the whole processing process is as follows.
  • an encapsulated packet (whose inner packet is PCI-E packet) is received by the RcevD module 502 on a SD switch 501, it is decapsulated by the decapsulation module 505 to a PCI-E packet, the TLP routing ID is extracted from it. Then the slave TLP routing table operation module 508 will retrieve the slave TLP routing table based on the extracted TLP routing ID.
  • the packet will be buffer at the packet buffer module 507 and then a query packet (contains the TLP routing ID, node type, node ID, as well as other related information illustrated as FIG.13) will be sent to the SDSW controller 601 by the query (SendC) module 504.
  • the query is received by the reception (RecvSW) module 602.
  • the master TLP routing table on the SDSW controller 601 will be further retrieved by the master TLP routing table operation module 604 based on the TLP routing ID.
  • the master TLP routing table operation module 604 on the SDSW controller 601 will create a new table item based on extracted info by the msg-parser module 605, which includes: 1) Node ID (the unique ID of a node); 2) Node type (upstream node or downstream node); 3)
  • each node Destination address of each node (either upstream or downstream); 4) the port number of SD switch which is connected to the node (either upstream or downstream); 5) VLAN-tag used for the group (only necessary when the underlying network is Ethernet); and 6) TLP routing ID (the unique ID of TLP routing) illustrated in FIG.13. Then it is inserted into the master
  • the sending (SendSW) module 603 on the SDSW controller 601 will notify the SD switch 501 to broadcast the PCI-E-over-Etheraet packet.
  • the sending (SendSW) module 603 on the SDSW controller 601 will notify the SD switch 501 to forward the encapsulated PCI-E packet according to the retrieved destination address.
  • the notification is processed at the RecvC module 509 and the same new table item is inserted into the slave TLP routing table by the slave TLP routing table operation module 508 on SD switch 501.
  • the PCI-E packet will be encapsulated by the encapsulation module 506 with an encapsulation packet header, which is filled with the retrieved address as the destination address returned by the slave TLP routing table operation module 508.
  • PCI-E packet which is used during the operation of memory read/write, I/O read write, config read/write.
  • DEVINFO packet which is newly defined in our invention. It is used to: 1) start the communication between upstream node and downstream node; and 2) remind each other they are keeping alive periodically at a certain interval.
  • the packet format of DEVINFO at least contains data fields (but not limited to it) as follows: 1) Node ID (the unique ID of a node); 2) Node type (upstream node or downstream node); 3) Source address of the node (either upstream or downstream); and 4) TLP routing ID (the unique ID used in TLP routing).
  • FIG. 7 is a table depicting an embodiment of a possible definition of the packet format of the DEVINFO packet.
  • FIG. 8 is a sequence diagram of the memory read write, (the process of I/O read/write, config read/write follows the same sequence diagram, so that it is omitted) between an upstream node 801, SDN box 802 and a downstream node 803.
  • the communication process is as follows.
  • D When the upstream node 801 accesses memory (e.g. memory read/write operation) of the downstream node 803, it sends PCI-E packets.
  • the PCI-E packets are encapsulated with a packet header in step 804 by the encapsulation/decapsulation module 303 and sent out from the NIC 304.
  • the encapsulated packets are decapsulated in step 805, and then their inner information is extracted in step 806. Then the decapsulated packets are
  • step 807 encapsulated in step 807 and forwarded to the downstream node 803.
  • the process of communication from the downstream node to the upstream node is the same as the above steps.
  • the diagram is omitted.
  • FIG. 9 is a flowchart depicting an embodiment of a method of an upstream node or a downstream node how sends out PCI-E packets.
  • a node either upstream node or downstream node
  • the destination address in the encapsulation packet header
  • the destination address should be a broadcast address in step 903.
  • the destination address is the result from step 903 and step 904.
  • the type of the outer packet header depends on the underlying network. For instance, if the underlying network is Ethernet, an Ethernet packet header is to be added as the outer packet header and sent out in step 905.
  • FIG. 10 is a flowchart depicting an embodiment of a method of an upstream node or a downstream node how receives encapsulated packets.
  • the packet When receives an encapsulated packet in step 1001, the packet will be decapsulated into a PCI-E packet in step 1002.
  • FIG. 11 is a flowchart depicting an embodiment of a method of the SD switch how to handles encapsulated packets.
  • the process of the SD switch is as follows, ⁇ If there is not a packet from SDSW controller in step 1101, the packet is checked whether belongs to the specific encapsulated packet in step 1104. If it is not, the packet will be processed further in the routine of other packets in step 1105.
  • step 1106 If it is the encapsulated packets from upstream node or downstream node, its TLP routing ID is extracted and the slave TLP routing table will be retrieved based on it in step 1106. If not found, the packet will be buffered, a query request is sent to SDSW controller in step 1108. If found, the packet will be encapsulated with an encapsulation header, where the retrieved address is filled in as the destination address, and sent out in step 1107.
  • step 1101 If there is a packet received from SDSW controller in step 1101, the carried TLP routing info is extracted in step 1102, and adding the new table item to the slave routing table in step 1103. Then the previous buffered packet will be further processed (encapsulates the packets, fills in the retrieved destination address and sends it out) in step 1107.
  • FIG. 12 is a flowchart depicting an embodiment of a method of the SDSW controller how to process the query request from the SD switch, extract TLP info and update the master TLP routing table, and finally notify SD switch to update the slave TLP routing table.
  • the process is as follows. (D If there is a query request packet received from SD switch in step 1201, the TLP routing information will be extracted in step 1202.
  • the master TLP routing table will be retrieved based on the TLP routing information in step 1203. If nothing is found, the new item will be added to the master TLP routing table in step 1204.
  • FIG.13 is a possible data structure of TLP routing table, which is used on both SD switch and SDSW controller, when the underlying network is Ethernet.
  • the table consists of columns as follows: 1) Node ID (the unique ID of a node); 2) Node type (upstream node or downstream node); 3) Destination address of each node (either upstream or downstream); 4) the port number of SD switch which is connected to the node (either upstream or downstream); 5) VLAN-tag used for the group (only necessary when the underlying network is Ethernet); and 6) TLP routing ID (the unique ID of TLP routing).
  • FIG.13 indicates that two groups of nodes, whose VLAN IDs are 1 or 2. The first three items belong to a group, because they share the same VLAN ID 1.
  • the node whose node ID is 1 is an upstream node, whose MAC address is MAC 00. It is connected to the 1st port of the SD switch, and its TLP ID is busO/devO/funcO.
  • the node whose node ID is 2 is a downstream node, whose MAC address is MAC_01. It is connected to the 2nd port of the SD switch, and its TLP ID is busl/devl/funcl.
  • FIG. 14 is a sequence diagram of the failure detection and hand-over between an upstream node 1401, SD switch (e.g. OpenFlow switch but not limited to OpenFlow) 1402, SDSW controller (e.g. OpenFlow controller but not limited to OpenFlow) 1403 and a downstream node (it is omitted for clear explanation).
  • SD switch e.g. OpenFlow switch but not limited to OpenFlow
  • SDSW controller e.g. OpenFlow controller but not limited to OpenFlow
  • a downstream node it is omitted for clear explanation.
  • the process of failure detection and hand-over is as follows. 14 058146
  • step 1404 the link down network signal will be sent to the SD switch 1402.
  • the SD switch 1402 When the SD switch 1402 receives the link down network signal, it will notify the SDSW controller 1403 in step 1405. For an instance, in OpenFlow , the OpenFlow switch will send OFPPS LINK DOWN message to the OpenFlow controller .
  • step 1402 it will find another available upstream node to hand over in step 1406, modify the master TLP routing table of the connected downstream nodes in step 1407, and then notify the SD switch in step 1408.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computer Hardware Design (AREA)
  • General Physics & Mathematics (AREA)
  • Bus Control (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention generally relates to the technical field of networking and computer architecture, and more specifically relates to approaches of constructing a system which includes a software-defined network (SDN) box and at least one upstream node and one more than downstream node. The nodes may be divided into several groups. In each group, there is a single upstream node and at least one downstream node. All packets between the upstream node and downstream nodes are forwarded by the SDN box. The SDN box acts as a 'PCI-E switch', which interconnects the upstream node to each downstream node and forwards the encapsulated packets (the inner is PCI-E packets). During the whole communication process, the upstream node treats downstream nodes as ordinary local PCI-E devices.

Description

DESCRIPTION
METHOD OF CONSTRUCTING SOFTWARE-DEFINED
PCI EXPRESS (PCI-E) SWITCH
TECHNICAL FIELD
The invention generally relates to the technical field of networking and computer architecture, and more specifically relates to approaches of constructing a system which consists of a software-defined network (SDN) box, upstream nodes and downstream nodes.
BACKGROUND ART PCI-E (Peripheral Component Interconnect-Express) is the third generation high performance I/O bus used to interconnect peripheral devices in computer systems.
PCI-E employs the same usage model as previous generation I/O bus called PCI. It supports familiar transactions such as memory read/write, I/O read/write and
configuration read/write transactions. Existing OS and device drivers can run in a PCI-E system without any modifications (as to PCI-E in detail, refer to "PCI Express system architecture, by Ravi Budruk, Don Anderson, Tom Shanley. Addison- Wesley Professional, 2004).
The number of PCI-E slots is usually limited because there is not so much space in servers. ExpEther (Express Ethernet; see http: //www.expether.org/.) is proposed to address the above problem. ExpEther extends PCI-E over Ethernet. PCI-E packets are encapsulated to PCI-E-over-Ethernet packets at the side of sender of PCI-E packet; then they are forwarded by Ethernet switches, and when these packets are forwarded to the destination, they are decapsulated to PCI-E packets (see Japanese Unexamined Patent Application, First Publication No. 2007-219873 A).
A certain type of packets is broadcasted to act as a keep-alive message in ExpEther. There is a timeout value associated with it. If timeout expires, there is still no such keep-alive packets received, the upstream node is concluded that it is already out of service. Once failure is detected, the connected downstream nodes should be handed over to another available upstream node as soon as possible.
DISCLOSURE OF INVENTION
In current ExpEther, it is difficult to achieve fast failure detection. Because faster failure detection results in shorter timeout value of keep-alive packets, and shorter timeout value results in more broadcast traffic. More broadcast traffic increases the workload of network devices.
The invention is proposed to solve the above fast failure detection and hand-over problem of ExpEther. It provides a method of constructing a system based on PCI-E,
Ethernet and SDN. The proposed system consists of upstream nodes, downstream nodes and a SDN box.
In the proposed system of this invention, there is an encapsulation/decapsulation module on both an upstream node and a downstream node. The upstream node (may consists of CPU, memory, hard disk and various I O devices) behaves as a computing system, and the downstream node behaves as a PCI-E device, they are connected by the
SDN box. The upstream node accesses the downstream node by PCI-E packets. To both upstream node and downstream node, the communication process is as follows.
When sending PCI-E packet, it encapsulates the PCI-E packet with a specific packet header, and send it out via the network (e.g. Ethernet but NOT limited to Ethernet). When receiving the encapsulated packet, it removes the specific packet header, decapsulates it into a PCI-E packet. The SDN box acts as a 'PCI-E switch', which interconnects the upstream node to the downstream node via the specific network and forwards the encapsulated packets.
In the proposed system of this invention, when the upstream node is out of service, a notification will be delivered to SDN box (the SDN box can be implemented by OpenFlow (see
https://www.opennetworking.org/sdn-resources/onf-specifications/openflow) but not limited to it. If OpenFlow is used, an OFPPS_LINK_DOWN message will be sent from OpenFlow switch to OpenFlow controller when the upstream node is out of service). Moreover, the PCI-E routing table is maintained on the SDN box, the hand-over can also be achieved by modifying the group ID of connected downstream nodes.
BRIEF DESCRIPTION OF THE DRAWINGS
To describe the foregoing and other exemplary purposes, aspects, and advantages, we use the following detailed description of an exemplary embodiment of the invention with reference to the drawings, in which:
FIG. 1 is a block diagram depicting an embodiment of system architecture of a computing system with a SDN box, at least one upstream node and at least one downstream node;
FIG. 2A and FIG. 2B are block diagrams depicting an embodiment of system architecture of one of possible implementations based on a SDN box, which consists of at least one software-defined switch (SD switch) and at least one software-defined switch controller (SDSW controller);
FIG. 3 is a block diagram depicting an embodiment of system architecture of an upstream node;
FIG. 4 is a block diagram depicting an embodiment of system architecture of a downstream node;
FIG. 5 is a block diagram depicting an embodiment of system architecture of a software-defined (SD) switch;
FIG. 6 is a block diagram depicting an embodiment of system architecture of a software-defined switch controller (SDSW controller);
FIG. 7 is a table depicting an embodiment of a possible packet format of the DEVINFO packet;
FIG 8 is a sequence diagram of the memory read/write between upstream node,
SDN box and downstream node;
FIG 9 is a flowchart depicting an embodiment of a method of an upstream node or a downstream node how sends out PCI-E packets;
FIG. 10 is a flowchart depicting an embodiment of a method of an upstream node or a downstream node how receives PCI-E packets;
FIG. 11 is a flowchart depicting an embodiment of a method of the SD switch how to handles Ethernet packets;
FIG. 12 is a flowchart depicting an embodiment of a method of the SDSW controller how to set up the master transaction layer packet (TLP) routing table during system initiation and how to communicate with SD switch during the process of packet forwarding;
FIG. 13 is a table depicting an embodiment of a possible data structure of TLP routing table, which is used as slave TLP routing table on the SD switch and master TLP routing table on the SDSW controller, when the underlying network is Ethernet; and FIG. 14 is a sequence diagram of the failure detection and hand-over between upstream nodes, SDN box and downstream nodes.
EMBODIMENTS FOR CARRYING OUT THE INVENTION
In the following description, a preferred embodiment of the invention is described with regard to preferred process steps and data structures.
<System components>
FIG. 1 is a block diagram depicting an embodiment of system architecture of a computing system with a software-defined network (SDN) box 106, upstream nodes 101 and 104 and downstream nodes 102, 103 and 105. Each component is connected by network (e.g. Ethernet but not limited to it). Generally there is at least one group in the system. In FIG. 1 there are two groups. One group consists of upstream node 101 and downstream node 102 and downstream node 103. The other group consists of upstream node 104 and downstream node 105. In each group, there is a single upstream node and at least one downstream node. All packets between the upstream node and downstream nodes are forwarded by the SDN box.
<System architecture of a possible implementation>
FIG. 2A and FIG. 2B are block diagrams depicting an embodiment of system architecture of a possible implementation, where the SDN box consists of at least one software-defined switch (SD switch) and at least one software-defined switch controller (SDSW controller).
In FIG. 2A, the SDN box 206 consists of a SD switch 208 and a SDSW controller 207. There is a communication channel between the SD switch 208 and the SDSW controller 207. The upstream nodes and downstream nodes can be divided into two groups, and each group consists of a single upstream node and at least one downstream node. The upstream node 201 and two downstream nodes 202 and 203 are connected to the first (#1), second (#2) and third (#3) ports of SD switch 208 respectively. The upstream node 204 and the downstream nodes 205 are connected to the fourth (#4) and fifth (#5) ports of SD switch 208 respectively.
In FIG. 2B, the SDN box 206 consists of SD switches 208, 209, 211, and 212 and SDSW controllers 207 and 210. There is a communication channel between SDSW controller 207 and SD switch 208 and 209 respectively. Additionally, there is also a communication channel between the SDSW controller 210 and SD switch 211 and 212 respectively. The upstream nodes and downstream nodes can be divided into two groups, and each group consists of a single upstream node and at least one downstream node. The upstream node 201 and two downstream nodes 202 and 203 belong to one group and the upstream node 204 and the downstream node 205 belong to another group.
In FIG. 2A and FIG. 2B, all SD switches are connected by a specific network, e.g. Ethernet (but not limited to it). The communication between the SDSW controller and SD switches are in a specific protocol, e.g. OpenFlow (but not limited to it).
FIG. 3 is a block diagram depicting an embodiment of system architecture of an upstream node. In general, a computer system consists of CPU, memory, hard disk and various I/O devices. To explain the system architecture more clearly, the rest
components are omitted here. An upstream node 301 consists of at least a central processing unit (CPU) 302, an encapsulation/decapsulation (ENCAP/DECAP) module 303 and a network interface card (NIC) 304. CPU 302 is a hardware component that carries out the instructions of a computer program by performing the basic arithmetical, logical, and input/output operations of the system. NIC 304 is a hardware component that connects the upstream node 301 to the network where SDN box is located. The encapsulation/decapsulation module 303 is in charge of encapsulation of PCI-E packets and decapsulation the received encapsulated packets during the process of communication. CPU 302 and the encapsulation/decapsulation module 303 are connected logically; and the encapsulation/decapsulation module 303 and NIC 304 are connected respectively. In another word, the connecting method is not limited, it may be either hardware method e.g. PCI-E bus protocol (but not limited to it) or any software method.
FIG. 4 is a block diagram depicting an embodiment of system architecture of a downstream node. It may consist of CPU, memory, hard disk and various devices. To explain the system architecture more clearly, the rest components are omitted here. A downstream node 401 consists of at least memory 402, an encapsulation/decapsulation (ENCAP/DECAP) module 403 and a network interface card (NIC) 404. NIC 404 is a hardware component that connects the downstream node 401 to the network where SDN box is located. The encapsulation/decapsulation module 403 is in charge of encapsulation of PCI-E packets and decapsulation the received encapsulated packets during the process of communication. The memory 402 and the
encapsulation/decapsulation module 403 are connected; the encapsulation/decapsulation module 403 and NIC 404 are connected logically respectively. In another word, the connecting method is not limited, it may be either hardware method e.g. PCI-E bus protocol (but not limited to it) or any software method.
<Packet processing inside a SDN box>
FIG. 5 is a block diagram depicting an embodiment of system architecture of a software-defined (SD) switch. A SD switch 501 at least consists of a RecvD module 502, a SendD module 503, a SendC (sending to control plane) module 504, a RecvC (reception from control plane) module 509 a decapsulation (DECAP) module 505, an encapsulation (ENCAP) module 506, a packet buffer (PKT BUFFER) module 507 and a slave transaction layer packet (TLP) routing table operation module 508. The RecvD module 502 is in charge of reception from upstream nodes and downstream nodes. The SendD module 503 is in charge of sending to upstream nodes and downstream nodes. The RecvC module 509 is in charge of reception from SDSW controller. The SendC module 504 is in charge of sending to SDSW controller. The decapsulation module 505 decapsulates the encapsulated packets to PCI-E packets. The encapsulation module 506 encapsulates PCI-E packets. The packet buffer module 507 is in charge of buffering packets. The slave TLP routing table operation module 508 can create and insert a new slave TLP routing table item and support retrieving function of the slave TLP routing table.
FIG. 6 is a block diagram depicting an embodiment of system architecture of a software-defined switch controller (SDSW controller). A SDSW controller at least consists of a RecvSW module 602, a SendSW module 603, a master TLP routing table operation module 604, and a msg-parser module 605. The RecvSW module 602 is in charge of reception from SD switch. The SendSW module 603 is in charge of sending to SD switch. The master TLP routing table operation module 604 can create and insert a new master TLP routing table item and support retrieving function of the master TLP routing table. The msg-parser module 605 extracts info which includes: 1) Node ID (the unique ID of a node); 2) Node type (upstream node or downstream node); 3) Destination address of each node (either upstream or downstream); 4) the port number of SD switch which is connected to the node (either upstream or downstream); 5)
VLAN-tag used for the group (only necessary when the underlying network is Ethernet); and 6) TLP routing ID (the unique ID of TLP routing) illustrated in FIG.13 from the received packets.
The SDN box consists of at least one SDSW controller and at least one SD switch. The whole processing process is as follows. When an encapsulated packet (whose inner packet is PCI-E packet) is received by the RcevD module 502 on a SD switch 501, it is decapsulated by the decapsulation module 505 to a PCI-E packet, the TLP routing ID is extracted from it. Then the slave TLP routing table operation module 508 will retrieve the slave TLP routing table based on the extracted TLP routing ID. > If not found, the packet will be buffer at the packet buffer module 507 and then a query packet (contains the TLP routing ID, node type, node ID, as well as other related information illustrated as FIG.13) will be sent to the SDSW controller 601 by the query (SendC) module 504. At the side of the SDSW controller 601, the query is received by the reception (RecvSW) module 602. The master TLP routing table on the SDSW controller 601 will be further retrieved by the master TLP routing table operation module 604 based on the TLP routing ID.
φ- If not found, the master TLP routing table operation module 604 on the SDSW controller 601 will create a new table item based on extracted info by the msg-parser module 605, which includes: 1) Node ID (the unique ID of a node); 2) Node type (upstream node or downstream node); 3)
Destination address of each node (either upstream or downstream); 4) the port number of SD switch which is connected to the node (either upstream or downstream); 5) VLAN-tag used for the group (only necessary when the underlying network is Ethernet); and 6) TLP routing ID (the unique ID of TLP routing) illustrated in FIG.13. Then it is inserted into the master
TLP routing table. The sending (SendSW) module 603 on the SDSW controller 601 will notify the SD switch 501 to broadcast the PCI-E-over-Etheraet packet.
If found, the sending (SendSW) module 603 on the SDSW controller 601 will notify the SD switch 501 to forward the encapsulated PCI-E packet according to the retrieved destination address. The notification is processed at the RecvC module 509 and the same new table item is inserted into the slave TLP routing table by the slave TLP routing table operation module 508 on SD switch 501.
If found, the PCI-E packet will be encapsulated by the encapsulation module 506 with an encapsulation packet header, which is filled with the retrieved address as the destination address returned by the slave TLP routing table operation module 508. <DEVINFO packet format>
During the whole communication process between upstream node and downstream node, there are two kinds of packets are used. One kind of packet is PCI-E packet, which is used during the operation of memory read/write, I/O read write, config read/write. The other kind packet is called DEVINFO packet, which is newly defined in our invention. It is used to: 1) start the communication between upstream node and downstream node; and 2) remind each other they are keeping alive periodically at a certain interval. The packet format of DEVINFO at least contains data fields (but not limited to it) as follows: 1) Node ID (the unique ID of a node); 2) Node type (upstream node or downstream node); 3) Source address of the node (either upstream or downstream); and 4) TLP routing ID (the unique ID used in TLP routing). FIG. 7 is a table depicting an embodiment of a possible definition of the packet format of the DEVINFO packet.
Communication between an upstream node and a downstream node>
FIG. 8 is a sequence diagram of the memory read write, (the process of I/O read/write, config read/write follows the same sequence diagram, so that it is omitted) between an upstream node 801, SDN box 802 and a downstream node 803. The communication process is as follows. (D When the upstream node 801 accesses memory (e.g. memory read/write operation) of the downstream node 803, it sends PCI-E packets. The PCI-E packets are encapsulated with a packet header in step 804 by the encapsulation/decapsulation module 303 and sent out from the NIC 304.
(2) The encapsulated packets arrive at a certain port of a SD switch in the SDN box 802.
The encapsulated packets are decapsulated in step 805, and then their inner information is extracted in step 806. Then the decapsulated packets are
encapsulated in step 807 and forwarded to the downstream node 803.
(D When the encapsulated packets are received at the downstream node 803, they are decapsulated to PCI-E packets in step 808.
The process of communication from the downstream node to the upstream node (e.g. replying the result of memory read) is the same as the above steps. The diagram is omitted.
FIG. 9 is a flowchart depicting an embodiment of a method of an upstream node or a downstream node how sends out PCI-E packets.
φ When a node (either upstream node or downstream node) wants to send out packets in step 901, if it is a DEVINFO packet in step 902, the destination address (in the encapsulation packet header) should be a broadcast address in step 903.
(2) Otherwise it must be a PCI-E packet; the destination address (in the encapsulation packet header) should be in a predefined format, which can be recognized by the SD switch in step 904.
(3) Finally, the packet is encapsulated with the outer packet header, where the
destination address is the result from step 903 and step 904. The type of the outer packet header depends on the underlying network. For instance, if the underlying network is Ethernet, an Ethernet packet header is to be added as the outer packet header and sent out in step 905.
FIG. 10 is a flowchart depicting an embodiment of a method of an upstream node or a downstream node how receives encapsulated packets.
® When receives an encapsulated packet in step 1001, the packet will be decapsulated into a PCI-E packet in step 1002.
FIG. 11 is a flowchart depicting an embodiment of a method of the SD switch how to handles encapsulated packets. The process of the SD switch is as follows, φ If there is not a packet from SDSW controller in step 1101, the packet is checked whether belongs to the specific encapsulated packet in step 1104. If it is not, the packet will be processed further in the routine of other packets in step 1105.
(2) If it is the encapsulated packets from upstream node or downstream node, its TLP routing ID is extracted and the slave TLP routing table will be retrieved based on it in step 1106. If not found, the packet will be buffered, a query request is sent to SDSW controller in step 1108. If found, the packet will be encapsulated with an encapsulation header, where the retrieved address is filled in as the destination address, and sent out in step 1107.
(3) If there is a packet received from SDSW controller in step 1101, the carried TLP routing info is extracted in step 1102, and adding the new table item to the slave routing table in step 1103. Then the previous buffered packet will be further processed (encapsulates the packets, fills in the retrieved destination address and sends it out) in step 1107.
FIG. 12 is a flowchart depicting an embodiment of a method of the SDSW controller how to process the query request from the SD switch, extract TLP info and update the master TLP routing table, and finally notify SD switch to update the slave TLP routing table. The process is as follows. (D If there is a query request packet received from SD switch in step 1201, the TLP routing information will be extracted in step 1202.
(2) The master TLP routing table will be retrieved based on the TLP routing information in step 1203. If nothing is found, the new item will be added to the master TLP routing table in step 1204.
(3) Finally, the result of retrieval is sent out to notify SD switch in step 1205.
FIG.13 is a possible data structure of TLP routing table, which is used on both SD switch and SDSW controller, when the underlying network is Ethernet. The table consists of columns as follows: 1) Node ID (the unique ID of a node); 2) Node type (upstream node or downstream node); 3) Destination address of each node (either upstream or downstream); 4) the port number of SD switch which is connected to the node (either upstream or downstream); 5) VLAN-tag used for the group (only necessary when the underlying network is Ethernet); and 6) TLP routing ID (the unique ID of TLP routing). FIG.13 indicates that two groups of nodes, whose VLAN IDs are 1 or 2. The first three items belong to a group, because they share the same VLAN ID 1. The node whose node ID is 1, is an upstream node, whose MAC address is MAC 00. It is connected to the 1st port of the SD switch, and its TLP ID is busO/devO/funcO. The node whose node ID is 2, is a downstream node, whose MAC address is MAC_01. It is connected to the 2nd port of the SD switch, and its TLP ID is busl/devl/funcl.
<Failure detection and hand-over>
FIG. 14 is a sequence diagram of the failure detection and hand-over between an upstream node 1401, SD switch (e.g. OpenFlow switch but not limited to OpenFlow) 1402, SDSW controller (e.g. OpenFlow controller but not limited to OpenFlow) 1403 and a downstream node (it is omitted for clear explanation). The process of failure detection and hand-over is as follows. 14 058146
14
φ Once there is failure in step 1404 on the upstream node 1401, the link down network signal will be sent to the SD switch 1402.
(2) When the SD switch 1402 receives the link down network signal, it will notify the SDSW controller 1403 in step 1405. For an instance, in OpenFlow , the OpenFlow switch will send OFPPS LINK DOWN message to the OpenFlow controller .
(3) When the SDSW controller 1403 receives the notification from the SD switch
1402, it will find another available upstream node to hand over in step 1406, modify the master TLP routing table of the connected downstream nodes in step 1407, and then notify the SD switch in step 1408.
(D When the SD switch 1402 receives the notification from the SDSW controller
1403, it will modify the slave TLP routing table in step 1409. So that the connected downstream nodes are handed-over to an upstream node with a new group ID
While preferred embodiments of the invention have been described and illustrated above, it should be understood that these are exemplary of the invention and are not to be considered as limiting. Additions, omissions, substitutions, and other modifications can be made without departing from the scope of the present invention. Accordingly, the invention is not to be considered as being limited by the foregoing description, and is only limited by the scope of the appended claims.

Claims

1. A method of using a software-defined network (SDN) box as a peripheral component interconnect-express (PCI-E) switch over a network which including at least one upstream node, at least one downstream node and a SDN box, the method comprising:
interconnecting the upstream node and the downstream node via the network by the SDN box;
sending a PCI-E packet at one side of the upstream and downstream nodes; encapsulating the PCI-E packet with a specific packet header;
sending to the network the encapsulated packets based on transaction layer packet (TLP) routing identification (ID) carried with inner PCI-E packet by the SDN box;
receiving the encapsulated packet at the other side of the upstream and downstream nodes;
removing the specific packet header from a received packet; and
decapsulating the received packet into the PCI-E packet.
2. The method according to claim 1, wherein:
the upstream node includes a computing unit (CPU), a network interface card
(NIC) and an encapsulation/decapsulation module; and
the downstream node includes an I/O device, a network interface card (NIC) and an encapsulation/decapsulation module.
3. The method according to claim 1 , wherein the upstream node and the downstream node belong to at least one group including a single upstream node and at least one downstream node.
4. The method according to claim 1 , wherein the TLP routing ID is the
identification of TLP routing methods in the PCI-E including address routing, ID routing and implicit routing.
5. The method according to claim 1, wherein the SDN box is implemented as the format of a software-defined (SD) switch and a software-defined switch (SDSW) controller corresponding to the SD switch, the SD switch including a slave TLP routing table and the SDSW controller including a master TLP routing table.
6. The method according to claim 5, comprising:
with the SD switch, decapsulating the received packets from the one side of the upstream node and downstream node, extracting the TLP routing ID, retrieving the slave TLP routing table based on the TLP routing ID,
if a destination address is found in the TLP routing table, encapsulating the packet with a packet header, filling the destination address and sending out,
if a destination address is not found in the TLP routing table, buffering the packet, and a query is sending out to the SDSW controller; and
with the SDSW controller, parsing the query request from the SD switch, extracting TLP routing information, adding the TLP routing information to the master TLP routing table as a new table item and notifying the SD switch to update the slave TLP routing table.
7. The method according to claim 5, wherein the SD switch and the SDSW controller corresponding to the SD switch are connected with a communication channel, by which query and notify messages is transferred, the communication channel being either remote communication channel including Ethernet or local communication channel including U D domain socket.
8. The method defined in claim 1, wherein, in the step of encapsulating of PCI-E packets, the destination address is kept in a pre-defined format which can be recognized by the SDN box.
9. The method according to claim 5, when failure occurs, built-in notification of the SDN box triggers modification of the TLP routing table at the SD switch and the SDSW controller, which achieves handing over of the downstream node.
PCT/JP2014/058146 2014-03-18 2014-03-18 Method of constructing software-defined pci express (pci-e) switch WO2015141014A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2014/058146 WO2015141014A1 (en) 2014-03-18 2014-03-18 Method of constructing software-defined pci express (pci-e) switch
JP2016555801A JP2017511532A (en) 2014-03-18 2014-03-18 Method for configuring a software defined PCI Express (PCI-E) switch

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2014/058146 WO2015141014A1 (en) 2014-03-18 2014-03-18 Method of constructing software-defined pci express (pci-e) switch

Publications (1)

Publication Number Publication Date
WO2015141014A1 true WO2015141014A1 (en) 2015-09-24

Family

ID=54144014

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2014/058146 WO2015141014A1 (en) 2014-03-18 2014-03-18 Method of constructing software-defined pci express (pci-e) switch

Country Status (2)

Country Link
JP (1) JP2017511532A (en)
WO (1) WO2015141014A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017112329A1 (en) * 2015-12-26 2017-06-29 Intel Corporation Platform environment control interface tunneling via enhanced serial peripheral interface
CN108512758A (en) * 2018-03-07 2018-09-07 华为技术有限公司 Message processing method, controller and forwarding unit
US11005782B2 (en) 2019-04-26 2021-05-11 Dell Products L.P. Multi-endpoint adapter/multi-processor packet routing system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007219873A (en) * 2006-02-17 2007-08-30 Nec Corp Switch and network bridge device
WO2013051386A1 (en) * 2011-10-05 2013-04-11 日本電気株式会社 Load reduction system, and load reduction method
JP2014003392A (en) * 2012-06-15 2014-01-09 Ntt Docomo Inc Control node and communication control method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007219873A (en) * 2006-02-17 2007-08-30 Nec Corp Switch and network bridge device
WO2013051386A1 (en) * 2011-10-05 2013-04-11 日本電気株式会社 Load reduction system, and load reduction method
JP2014003392A (en) * 2012-06-15 2014-01-09 Ntt Docomo Inc Control node and communication control method

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017112329A1 (en) * 2015-12-26 2017-06-29 Intel Corporation Platform environment control interface tunneling via enhanced serial peripheral interface
US11086812B2 (en) 2015-12-26 2021-08-10 Intel Corporation Platform environment control interface tunneling via enhanced serial peripheral interface
CN108512758A (en) * 2018-03-07 2018-09-07 华为技术有限公司 Message processing method, controller and forwarding unit
WO2019170083A1 (en) * 2018-03-07 2019-09-12 华为技术有限公司 Message processing method, controller, and forwarding device
US11546255B2 (en) 2018-03-07 2023-01-03 Huawei Technologies Co., Ltd. Packet processing method, controller, and forwarding device
US11005782B2 (en) 2019-04-26 2021-05-11 Dell Products L.P. Multi-endpoint adapter/multi-processor packet routing system

Also Published As

Publication number Publication date
JP2017511532A (en) 2017-04-20

Similar Documents

Publication Publication Date Title
US20200412578A1 (en) NSH Encapsulation for Traffic Steering
US8908704B2 (en) Switch with dual-function management port
EP2843906B1 (en) Method, apparatus, and system for data transmission
EP3154227B1 (en) Packet transmission method, node, path management server and storage medium
TWI504193B (en) Method and system for offloading tunnel packet processing in cloud computing
US9608841B2 (en) Method for real-time synchronization of ARP record in RSMLT cluster
CN105745883B (en) Forwarding table synchronous method, the network equipment and system
EP3069471B1 (en) Optimized multicast routing in a clos-like network
US8571408B2 (en) Hardware accelerated data frame forwarding
CN103888386A (en) Extensible virtual local area network message transmission method, device and system
WO2019134383A1 (en) Method for controlling network congestion, access device, and computer readable storage medium
US9455916B2 (en) Method and system for changing path and controller thereof
TW201217983A (en) Register access in distributed virtual bridge environment
US11689501B2 (en) Data transfer method and virtual switch
CN109412922B (en) Method, forwarding device, controller and system for transmitting message
WO2017015839A1 (en) Sdn-based arp realization method and apparatus
WO2017157318A1 (en) Link discovery method and apparatus
CN107770027B (en) Implementation method for providing GRE tunnel service based on OpenStack architecture
US10225367B2 (en) Method and device for generating forwarding information
WO2016107269A1 (en) Device and method for data transmission in virtual extensible local area network
WO2015141014A1 (en) Method of constructing software-defined pci express (pci-e) switch
WO2015027401A1 (en) Packet processing method, device and system
CN106100960B (en) Method, device and system for Fabric intercommunication of cross-storage area network
CN107483369B (en) Message processing method and virtual switch
US9036465B2 (en) Hierarchical network with active redundant links

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14886499

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2016555801

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14886499

Country of ref document: EP

Kind code of ref document: A1