WO2019165355A1 - Technologies for nic port reduction with accelerated switching - Google Patents

Technologies for nic port reduction with accelerated switching Download PDF

Info

Publication number
WO2019165355A1
WO2019165355A1 PCT/US2019/019377 US2019019377W WO2019165355A1 WO 2019165355 A1 WO2019165355 A1 WO 2019165355A1 US 2019019377 W US2019019377 W US 2019019377W WO 2019165355 A1 WO2019165355 A1 WO 2019165355A1
Authority
WO
WIPO (PCT)
Prior art keywords
network
computing device
accelerator
virtual
network traffic
Prior art date
Application number
PCT/US2019/019377
Other languages
French (fr)
Inventor
Gerald Rogers
Stephen T. Palermo
Shih-Wei Chien
Namakkal N. Venkatesan
Irene Liew
Deepal MEHTA
Rajesh Gadiyar
Original Assignee
Intel Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intel Corporation filed Critical Intel Corporation
Priority to DE112019000965.6T priority Critical patent/DE112019000965T5/en
Priority to CN201980006768.0A priority patent/CN111492628A/en
Publication of WO2019165355A1 publication Critical patent/WO2019165355A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/28Data switching networks characterised by path configuration, e.g. LAN [Local Area Networks] or WAN [Wide Area Networks]
    • H04L12/46Interconnection of networks
    • H04L12/4633Interconnection of networks using encapsulation techniques, e.g. tunneling

Definitions

  • Modem computing devices may include general-purpose processor cores as well as a variety of hardware accelerators for performing specialized tasks.
  • Certain computing devices may include one or more field-programmable gate arrays (FPGAs), which may include programmable digital logic resources that may be configured by the end user or system integrator.
  • FPGAs field-programmable gate arrays
  • an FPGA may be used to perform network packet processing tasks instead of using general-purpose compute cores.
  • FIG. 1 is a simplified block diagram of at least one embodiment of a system for network acceleration
  • FIG. 2 is a simplified block diagram of at least one embodiment of a computing device of the system of FIG. 1 ;
  • FIG. 3 is a simplified block diagram of at least one embodiment of an environment of the computing device of FIGS. 1 and 2;
  • FIG. 4 is a simplified block diagram of at least one embodiment of a virtual switch application function unit of the computing device of FIGS. 1-3;
  • FIG. 5 is a simplified flow diagram of at least one embodiment of a method for network acceleration that may be executed by the computing device of FIGS. 1-4;
  • FIG. 6 is a chart illustrating exemplary test results that may be achieved with the system of FIGS. 1-4.
  • FIG. 7 is a simplified block diagram of a typical system. DET AILED DESCRIPTION OF THE DRAWINGS
  • references in the specification to“one embodiment,”“an embodiment,”“an illustrative embodiment,” etc. indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may or may not necessarily include that particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
  • items included in a list in the form of“at least one A, B, and C” can mean (A); (B); (C); (A and B); (A and C); (B and C); or (A, B, and C).
  • items listed in the form of“at least one of A, B, or C” can mean (A); (B); (C); (A and B); (A and C); (B and C); or (A, B, and C).
  • the disclosed embodiments may be implemented, in some cases, in hardware, firmware, software, or any combination thereof.
  • the disclosed embodiments may also be implemented as instructions carried by or stored on a transitory or non-transitory machine- readable (e.g., computer-readable) storage medium, which may be read and executed by one or more processors.
  • a machine-readable storage medium may be embodied as any storage device, mechanism, or other physical structure for storing or transmitting information in a form readable by a machine (e.g., a volatile or non-volatile memory, a media disc, or other media device).
  • a typical system 700 for network processing may require one or more dedicated network cards assigned to each virtual network function.
  • a computing device 702 includes a processor 720 and multiple network interface controllers (NICs) 726.
  • the processor 720 executes multiple virtual network functions (VNFs) 722.
  • VNFs virtual network functions
  • Each VNF 722 of the computing device 702 is assigned to one or more dedicated ports in a traditional NIC 726.
  • Each VNF 722 may have direct access to the NIC 726 (or a part of the NIC 726 such as a PCI virtual function) using a hardware interface such as Single-Root I/O Virtualization (SR-IOV).
  • SR-IOV Single-Root I/O Virtualization
  • each VNF 722 accesses a dedicated NIC 726 using Intel® VT-d technology 724 provided by the processor 720.
  • each illustrative NIC 726 includes two network ports, and each of those network ports is coupled to a corresponding port 742 of a network switch 704.
  • the computing device 702 occupies eight ports 742 of the switch 704.
  • a system 100 for accelerated networking includes multiple computing devices 102 in communication over a network 104.
  • Each computing device includes multiple computing devices 102 in communication over a network 104.
  • Each computing device includes multiple computing devices 102 in communication over a network 104.
  • 102 has a processor 120 and an accelerator 128, such as a field-programmable gate array
  • FPGA field-programmable gate array
  • a computing device 102 executes one or more virtual network functions (VNFs) or other virtual machines (VMs).
  • VNFs virtual network functions
  • VMs virtual machines
  • Network traffic associated with the VNFs is processed by a virtual switch (vSwitch) of the accelerator 128.
  • the accelerator 128 includes one or more ports or other physical interfaces that are coupled to a switch 106 of the network 104.
  • Each VNF does not require dedicated ports on the switch 106.
  • the system 100 may execute high throughput, scalable network workloads with reduced top of rack (ToR) switch port consumption as compared to conventional systems that require traditional NICs and dedicated ports for each VNF.
  • ToR top of rack
  • each computing device 102 may require fewer NICs, which may reduce cost and power consumption. Additionally, reducing the number of required NICs may overcome server form factor limits on the number of physical NIC expansion cards, chassis space, and/or other physical resources of the computing device 102. Further, flexibility for users or tenants of the system 100 may be improved, because the user is not required to purchase and install a predetermined number of NICs in each server, and performance is not limited to the capacity that those NICs provide. Rather, performance for the system 100 may scale with the overall throughput capability of the particular server platform and network fabric.
  • tests have shown that the disclosed system 100 may achieve performance that is comparable to single-root I/O virtualization (SR-IOV) implementations, without using standard NICs and using fewer switch ports. Additionally, tests have shown that the system 100 may achieve better performance than software-based systems.
  • SR-IOV single-root I/O virtualization
  • chart 600 illustrates test results that may be achieved by the system 100 as compared to typical systems.
  • Bar 602 illustrates throughput achieved by a system using SR-IOV/VT-d PCI passthrough virtualization, similar to the system 700 of FIG. 7.
  • Bar 604 illustrates throughput that may be achieved by a system 100 with an FPGA accelerator 128, as disclosed herein.
  • Bar 606 illustrates throughput achieved by a system using the Intel Data Plane Development Kit (DPDK), which is a high-performance software packet processing framework.
  • DPDK Intel Data Plane Development Kit
  • the SR-IOV system 700 achieves about 40 Gbps
  • the FPGA system 100 achieves about 36.4 Gbps
  • the DPDK system achieves about 15 Gbps.
  • the SR-IOV system 700 achieves about 2.67 times the throughput of the DPDK (software) system, and about 1.1 times the throughput of the FPGA 100 system.
  • the FPGA system 100 achieves about 2.4 times the throughput of the DPDK (software) system.
  • Curve 608 illustrates switch ports used by each system. As shown, the SR-IOV system 700 uses four ports, the FPGA system 100 uses two ports, and the DPDK system uses two ports.
  • Curve 610 illustrates processor cores used by each system. As shown, the SR-IOV system 700 uses zero cores, the FPGA system 100 uses zero cores, and the DPDK system uses six cores. Thus, as shown in the chart 600, the FGPA system 100 provides throughput performance comparable to typical SR-IOV systems 700 with reduced NIC port usage and without using additional processor cores.
  • each computing device 102 may be embodied as any type of computation or computer device capable of performing the functions described herein, including, without limitation, a computer, a server, a workstation, a desktop computer, a laptop computer, a notebook computer, a tablet computer, a mobile computing device, a wearable computing device, a network appliance, a web appliance, a distributed computing system, a processor-based system, and/or a consumer electronic device. As shown in FIG.
  • the computing device 102 illustratively include the processor 120, the accelerator 128, an input/output subsystem 130, a memory 132, a data storage device 134, and a communication subsystem 136, and/or other components and devices commonly found in a server or similar computing device.
  • the computing device 102 may include other or additional components, such as those commonly found in a server computer (e.g., various input/output devices), in other embodiments.
  • one or more of the illustrative components may be incorporated in, or otherwise form a portion of, another component.
  • the memory 132, or portions thereof may be incorporated in the processor 120 in some embodiments.
  • the processor 120 may be embodied as any type of processor capable of performing the functions described herein.
  • the processor 120 is a multi-core processor 120 having two processor cores 122.
  • the processor 120 may be embodied as a single or multi-core processor(s), digital signal processor, microcontroller, or other processor or processing/controlling circuit.
  • the memory 132 may be embodied as any type of volatile or non-volatile memory or data storage capable of performing the functions described herein. In operation, the memory 132 may store various data and software used during operation of the computing device 102 such operating systems, applications, programs, libraries, and drivers.
  • the memory 132 is communicatively coupled to the processor 120 via the I/O subsystem 130, which may be embodied as circuitry and/or components to facilitate input/output operations with the processor 120, the accelerator 128, the memory 132, and other components of the computing device 102.
  • the I/O subsystem 130 may be embodied as, or otherwise include, memory controller hubs, input/output control hubs, sensor hubs, firmware devices, communication links (i.e., point-to- point links, bus links, wires, cables, light guides, printed circuit board traces, etc.) and/or other components and subsystems to facilitate the input/output operations.
  • the I/O subsystem 130 may form a portion of a system-on-a-chip (SoC) and be incorporated, along with the processor 120, the memory 132, and other components of the computing device 102, on a single integrated circuit chip.
  • SoC system-on-a-chip
  • the data storage device 134 may be embodied as any type of device or devices configured for short-term or long-term storage of data such as, for example, memory devices and circuits, memory cards, hard disk drives, solid-state drives, non-volatile flash memory, or other data storage devices.
  • the computing device 102 also includes the communication subsystem 136, which may be embodied as any communication circuit, device, or collection thereof, capable of enabling communications between the computing device 102 and other remote devices over the computer network 104.
  • the communication subsystem 136 may be embodied as or otherwise include a network interface controller (NIC) for sending and/or receiving network data with remote devices.
  • NIC network interface controller
  • the communication subsystem 136 may be configured to use any one or more communication technology (e.g., wired or wireless communications) and associated protocols (e.g., Ethernet, InfiniBand®, Bluetooth®, Wi-Fi®, WiMAX, 3G, 4G LTE, etc.) to effect such communication.
  • communication technology e.g., wired or wireless communications
  • protocols e.g., Ethernet, InfiniBand®, Bluetooth®, Wi-Fi®, WiMAX, 3G, 4G LTE, etc.
  • the computing device 102 includes an accelerator 128.
  • the accelerator 128 may be embodied as a field-programmable gate array (FPGA), an application-specific integrated circuit (ASIC), a graphics processing using (GPU), an artificial intelligence
  • the accelerator 128 is an FPGA included in a multi-chip package with the processor 120.
  • the accelerator 128 may be coupled to the processor 120 via multiple high-speed connection interfaces including the coherent interconnect 124 and one or more non-coherent interconnects 126.
  • the coherent interconnect 124 may be embodied as a high-speed data interconnect capable of maintaining data coherency between a last-level cache of the processor 120, any cache or other local memory of the accelerator 128, and the memory 132.
  • the coherent interconnect 124 may be embodied as an in-die interconnect (IDI), Intel UltraPath Interconnect (UPI), QuickPath Interconnect (QPI), Intel Accelerator Link (IAL), or other coherent interconnect.
  • the non-coherent interconnect 126 may be embodied as a high speed data interconnect that does not provide data coherency, such as a peripheral bus (e.g., a PCI Express bus), a fabric interconnect such as Intel Omni-Path Architecture, or other non coherent interconnect. Additionally or alternatively, it should be understood that in some embodiments, the coherent interconnect 124 and/or the non-coherent interconnect 126 may be merged to form an interconnect that is capable of serving both functions. In some embodiments, the computing device 102 may include multiple coherent interconnects 124, multiple non-coherent interconnects 126, and/or multiple merged interconnects.
  • the computing device 102 may further include one or more peripheral devices
  • the peripheral devices 138 may include any number of additional input/output devices, interface devices, and/or other peripheral devices.
  • the peripheral devices 138 may include a touch screen, graphics circuitry, a graphical processing unit (GPU) and/or processor graphics, an audio device, a microphone, a camera, a keyboard, a mouse, a network interface, and/or other input/output devices, interface devices, and/or peripheral devices.
  • GPU graphical processing unit
  • the computing devices 102 may be configured to transmit and receive data with each other and/or other devices of the system 100 over the network 104.
  • the network 104 may be embodied as any number of various wired and/or wireless networks.
  • the network 104 may be embodied as, or otherwise include, a wired or wireless local area network (LAN), and/or a wired or wireless wide area network (WAN).
  • the network 104 may include any number of additional devices, such as additional computers, routers, and switches, to facilitate communications among the devices of the system 100.
  • the network 104 is embodied as a local Ethernet network.
  • the network 104 includes an illustrative switch 106, which may be embodied as a top-of-rack (ToR) switch, a middle-of-rack (MoR) switch, or other switch.
  • the network 104 may include multiple switches 106 and other network devices.
  • diagram 200 illustrates one potential embodiment of a computing device 102.
  • the computing device 102 includes a multi-chip package (MCP) 202.
  • the MCP 202 includes the processor 120 and the accelerator 128, as well as the coherent interconnect 124 and the non-coherent interconnect 126.
  • the accelerator 128 is an FPGA, which may be embodied as an integrated circuit including programmable digital logic resources that may be configured after manufacture.
  • the FPGA 128 may include, for example, a configurable array of logic blocks in communication over a configurable data interchange.
  • the computing device 102 further includes the memory 132 and the communication subsystem 136.
  • the FPGA 128 is coupled to the communication subsystem 136 and thus may send and/or receive network data. Additionally, although illustrated in FIG. 2 as discrete components separate from the MCP 202, it should be understood that in some embodiments the memory 132 and/or the communication subsystem 136 may also be incorporated in the MCP 202.
  • the FPGA 128 includes an FPGA interface unit (FIU) 204, which may be embodied as digital logic resources that are configured by a manufacturer, vendor, or other entity associated with the computing device 102.
  • the FIU 204 implements the interface protocols and manageability for links between the processor 120 and the FPGA 128.
  • the FIU 204 may also provide platform capabilities, such as Intel Virtualization Technology for directed I/O (Intel VT-d), security, error monitoring, performance monitoring, power and thermal management, partial reconfiguration, etc.
  • the FIU 204 further includes an UltraPath Interconnect (UPI) block 206 coupled to the coherent interconnect 124 and a PCI Express (PCIe) block 208 coupled to the non-coherent interconnect 126.
  • the UPI block 206 and the PCIe block 208 may be embodied as digital logic configured to transport data between the FPGA 128 and the processor 120 over the physical interconnects 124, 126, respectively.
  • the physical coherent UPI block 206 and the physical non-coherent block 208 and their associated interconnects may be multiplexed as a set of virtual channels (VCs) connected to a VC steering block.
  • VCs virtual channels
  • the FPGA 128 further includes one or more accelerated function units (AFUs)
  • Each AFU 210 may be embodied as digital logic configured to perform one or more accelerated networking functions.
  • each AFU 210 may be embodied as smart NIC logic, smart vSwitch logic, or other logic that performs one or more network workloads (e.g., user-designed custom data path logic such as forwarding, classification, packet steering, encapsulation, security, quality-of-service, etc.).
  • network workloads e.g., user-designed custom data path logic such as forwarding, classification, packet steering, encapsulation, security, quality-of-service, etc.
  • each AFU 210 may be configured by a user of the computing device 102.
  • Each AFU 210 may access data in the memory 132 using one or more virtual channels (VCs) that are backed by the coherent interconnect 124 and/or the non-coherent interconnect 126 using the FIU 204.
  • VCs virtual channels
  • the accelerator 128 may be embodied an ASIC, coprocessor, or other accelerator 128 that also includes one or more AFUs 210 to provide accelerated networking functions.
  • the AFUs 210 may be fixed-function or otherwise not user configurable.
  • the computing device in an illustrative embodiment, the computing device
  • the illustrative environment 300 includes one or more virtual network functions (VNFs) 302, a virtual machine monitor (VMM) 304, a virtual I/O block 306, a vS witch 308, and physical interfaces 310.
  • VNFs virtual network functions
  • VMM virtual machine monitor
  • I/O block virtual I/O block
  • vS witch a virtual I/O block
  • physical interfaces 310 physical interfaces
  • the various components of the environment 300 may be embodied as hardware, firmware, software, or a combination thereof.
  • one or more of the components of the environment 300 may be embodied as circuitry or collection of electrical devices (e.g., VNF circuitry 302, VMM circuitry 304, virtual I/O block circuitry 306, vS witch circuitry 308, and/or physical interface circuitry 310).
  • one or more of the VNF circuitry 302, the VMM circuitry 304, the virtual I/O block circuitry 306, the vS witch circuitry 308, and/or the physical interface circuity 310 may form a portion of the processor 120, the accelerator 128, the I/O subsystem 130, and/or other components of the computing device 102. Additionally, in some embodiments, one or more of the illustrative components may form a portion of another component and/or one or more of the illustrative components may be independent of one another.
  • the VMM 304 may be embodied as any virtual machine monitor, hypervisor, or other component that allows virtualized workloads to be executed on the computing device 102.
  • the VMM 304 may have complete control over the computing device 102, for example by executing in a non-virtualized host mode, such as ringlevel 0 and/or VMX-root mode.
  • Each VNF 302 may be embodied as any guest virtual machine, guest operating system, or other guest software configured to perform a virtualized workload on the computing device 102.
  • each VNF 302 may be embodied as a virtual network function (VNF) or other network workload (e.g., user-designed custom data path logic such as forwarding, classification, packet steering, encapsulation, security, quality-of-service, etc.).
  • VNF virtual network function
  • the VMM 304 may enforce isolation between the VNFs 302 and otherwise enforce platform security.
  • the computing device 102 may host guests executed by multiple users or other tenants.
  • the VNFs 302 and the VMM 304 are executed by the processor 120.
  • the VMM 304 may be configured to configure the accelerator 128 with the vSwitch 308.
  • the virtual I/O block 306 may be embodied as one or more I/O ports, queues, or other I/O interfaces that may be accessed by the VNFs 302.
  • the virtual I/O block 306 may be coupled with, embodied as, or otherwise include one or more paravirtualized drivers, which may provide high performance I/O for the VNFs 302.
  • the virtual I/O block 306 may be embodied as one or more virtio queues, drivers, and/or other associated components.
  • the VMM 304 may be further configured to couple each VNF 302 to a paravirtualization interface provided by the virtual I/O block 306.
  • Each physical interface 310 may be embodied as an Ethernet PHY, MAC, or other physical interface. Each physical interface 310 is coupled to a port 312 of an external switch 106 (e.g., a ToR switch) by a network link.
  • the network link may include one or more communications lanes, wires, backplane, optical links, communication channels, and/or other communication components.
  • the vSwitch 308 is configured to process network traffic associated with the
  • the network traffic may be accessed by the accelerator 128 and/or the processor 120 via the virtual I/O block 306.
  • the network traffic may be processed within a coherency domain shared by the accelerator 128 and the processor 120.
  • the network traffic may be communicated between the processor 120 and the accelerator 128 via the coherent interconnect 124.
  • the vSwitch 308 may forward network traffic from the VNFs 302 to the switch 106 and/or from the switch 106 to the VNFs 302 via a physical interface 310 and the corresponding port 312.
  • the vSwitch 308 may also forward network traffic between multiple VNFs 302.
  • the illustrated accelerator 128 may include the virtual I/O block 306, the vSwitch 308, and/or the physical interface 310 as described above. Those components of the accelerator 128 may be embodied as, for example, an AFU 210 of the accelerator 128. As shown, the accelerator 128 includes a full packet processing pipeline.
  • the illustrative accelerator 128 includes a gDMA block 402, a pipeline configuration block 404, an Open vSwitch (OVS) virtio handler block 406, an I/O configuration block 408, a retimer card 410, a 10 Gb MAC block 412, an ingress rate limiter 414, a tunneling block 416 with VxLAN de encapsulation block 418 and Network Virtualization using Generic Routing Encapsulation (NVGRE) de-encapsulation block 420, an OpenFlow classifer 422, a forwarding information base (FIB) and action block 424 with exact match block 426, megaflow block 428, and
  • OpenFlow action block 430 OpenFlow action block 430, a packet infrastructure 432, a tunneling block 434 with VxLAN encapsulation block 436 and NVGRE encapsulation block 438, a crossbar switch 440, and an egress QoS/traffic shaping block 442.
  • the physical interface 310 may be embodied as the MAC 412 and/or retimer 410. As described above, those components are coupled to a port 312 of an external switch 106 via a network link.
  • the virtual I/O block 306 may be embodied as the gDMA 402 and/or the virtio handler 406.
  • the vSwitch 308 may be embodied as the remaining components of the accelerator 128.
  • the illustrative accelerator 128 may receive incoming network traffic via the retimer 410 and MAC 412 and provide that data to the tunnels 416.
  • the accelerator may receive network traffic generated by the VNFs 302 via the gDMA 402, virtio handler 406, and ingress rate limiter 414 and provide that data to the tunnels 416.
  • the accelerator 128 processes the network traffic using the tunnels 416, the packet infrastructure 432 including the OF classifier 422 and FIB/action tables 424, and the tunnels 434. After processing, the network traffic is provided to the crossbar switch 440 and the egress QoS/traffic shaping 442. Network traffic destined for the switch 106 is provided to the MAC 412 for egress. Network traffic destined for the VNFs 302 is provided to the virtio handler 406.
  • the computing device 102 may execute a method 500 for accelerated network processing. It should be appreciated that, in some embodiments, the operations of the method 500 may be performed by one or more components of the environment 300 of the computing device 102 as shown in FIG. 3.
  • the method 500 begins in block 502, in which the computing device 102 configures an AFU 210 of the accelerator 128 for vSwitch 308 operations.
  • the computing device 102 may, for example, perform configuration or partial configuration of the FPGA 128 with bitstream or other code for the vSwitch 308 functions.
  • the computing device 102 may also configure network routing rules, flow rules, actions, QoS, and other network configuration of the vSwitch 308.
  • the computing device 102 binds one or more ports or other physical interfaces 310 of the accelerator 128 to an external switch 106.
  • the computing device 102 may bind one or more MACs, PHYs, or other Ethernet ports of the accelerator 128 to corresponding port(s) of the external switch 106.
  • the ports of the accelerator 128 may be embodied as fixed-function hardware ports or reconfigurable“soft” ports. In some embodiments, the ports of the accelerator 128 may be pre-configured or otherwise provided by a manufacturer or other entity associated with the accelerator 128.
  • the computing device 102 configures a VNF 302 for network processing.
  • the computing device 102 may, for example, load the VNF 302 or otherwise initialize the VNF 302.
  • the VNF 302 may be provided by a tenant or other user of the computing device 102.
  • the computing device 102 binds the VNF 302 to a virtio queue of the accelerator 128 or other paravirtualized interface of the accelerator 128.
  • the computing device 102 may, for example, configure the VNF 302 with one or more paravirtualized drivers, queues, buffers, or other interfaces to the accelerator 128.
  • the computing device 102 determines whether additional VNFs
  • the method 500 loops back to block 506 to load additional VNFs 302.
  • Each additional VNF 302 may be bound to one or more dedicated virtio queues or other interfaces of the accelerator 128.
  • the accelerator 128 need not be bound to additional ports of the switch 106. Referring back to block 510, if no additional VNFs 302 remain to be configured, the method 500 advances to block 512.
  • the computing device 102 processes network workloads with the
  • VNFs 302 and processes network traffic with the virtual switch 308 of the accelerator 128.
  • Each of the VNFs 302 may generate and/or receive network traffic (e.g., packet frames).
  • network traffic e.g., packet frames
  • each VNF 302 may read and/or write network packet data into buffers in system memory 132 corresponding to virtual I/O queues.
  • the vSwitch 308 may perform full packet processing pipeline operations on that network traffic data.
  • the computing device 102 processes the network traffic using the VNFs 302 and the vSwitch 308 in the same coherency domain.
  • the computing device 102 may transfer data (e.g., virtio queue data) between the processor 120 and the accelerator 128 via the coherent interconnect 124.
  • the VNFs 302 and the vSwitch 308 may process network data concurrently, simultaneously, or otherwise, with the coherent interconnect 124 providing data coherency between the last-level cache of the processor 120, cache or other local memory of the accelerator 128, and the memory 132. In some embodiments, full packet frames may be transferred between the processor 120 and the accelerator 128, so that multiple switching actions can happen simultaneously.
  • the method 500 loops back to block 512 to continue processing network traffic with the VNFs 302 and the vSwitch 308.
  • the computing device 102 may dynamically load and/or unload VNFs 302 during execution of the method 500.
  • the method 500 may be embodied as various instructions stored on a computer-readable media, which may be executed by the processor 120, the accelerator 128, and/or other components of the computing device 102 to cause the computing device 102 to perform the method 500.
  • the computer-readable media may be embodied as any type of media capable of being read by the computing device 102 including, but not limited to, the memory 132, the data storage device 134, firmware devices, other memory or data storage devices of the computing device 102, portable media readable by a peripheral device 138 of the computing device 102, and/or other media.
  • An embodiment of the technologies may include any one or more, and any combination of, the examples described below.
  • Example 1 includes a computing device for accelerated network processing, the computing device comprising: an accelerator to couple a first network port of a virtual switch of the accelerator with a second network port of a network switch via a network link; and a processor to execute a plurality of virtual network functions in response to coupling of the first network port with the second network port; wherein the virtual switch is to process network traffic associated with the plurality of virtual network functions in response to execution of the plurality of virtual network functions.
  • Example 2 includes the subject matter of Example 1, and further comprising a virtual machine monitor to configure the accelerator with the virtual switch.
  • Example 3 includes the subject matter of any of Examples 1 and 2, and wherein the accelerator comprises a field-programmable gate array and wherein the virtual switch comprises an application function unit of the field-programmable gate array.
  • Example 4 includes the subject matter of any of Examples 1-3, and wherein to process the network traffic comprises to process the network traffic within a coherency domain shared by the accelerator and the processor.
  • Example 5 includes the subject matter of any of Examples 1-4, and further comprising: a coherent interconnect that couples the processor and the accelerator; wherein to process the network traffic comprises to communicate the network traffic between the processor and the accelerator via the coherent interconnect.
  • Example 6 includes the subject matter of any of Examples 1-5, and further comprising: a virtual machine monitor to couple each of the virtual network functions to a paravirtulization interface of the accelerator; wherein to process the network traffic comprises to process network traffic associated with the paravirtualization interface.
  • Example 7 includes the subject matter of any of Examples 1-6, and wherein to process the network traffic comprises to forward network traffic from the plurality of network functions to the network switch via the first network port and the second network port.
  • Example 8 includes the subject matter of any of Examples 1-7, and wherein to process the network traffic comprises to forward network traffic received from the network switch via the first network port and the second network port to the plurality of network functions.
  • Example 9 includes the subject matter of any of Examples 1-8, and wherein to process the network traffic comprises to forward network traffic between a first virtual network function and a second virtual network function.
  • Example 10 includes the subject matter of any of Examples 1-9, and wherein each of the virtual network functions comprises a virtual machine.
  • Example 11 includes the subject matter of any of Examples 1-10, and wherein the accelerator comprises an application-specific integrated circuit.
  • Example 12 includes the subject matter of any of Examples 1-11, and wherein the processor and the accelerator are included in a multi-chip package of the computing device.
  • Example 13 includes a method for accelerated network processing, the method comprising: coupling, by a computing device, a first network port of a virtual switch of an accelerator of the computing device with a second network port of a network switch via a network link; executing, by the computing device, a plurality of virtual network functions with a processor of the computing device in response to coupling the first network port with the second network port; and processing, by the computing device with the virtual switch of the accelerator, network traffic associated with the plurality of virtual network functions in response to executing the plurality of virtual network functions.
  • Example 14 includes the subject matter of Example 13, and further comprising configuring, by the computing device, the accelerator with the virtual switch.
  • Example 15 includes the subject matter of any of Examples 13 and 14, and wherein the accelerator comprises a field-programmable gate array and wherein the virtual switch comprises an application function unit of the field-programmable gate array.
  • Example 16 includes the subject matter of any of Examples 13-15, and wherein processing the network traffic comprises processing the network traffic within a coherency domain shared by the accelerator and the processor.
  • Example 17 includes the subject matter of any of Examples 13-16, and wherein processing the network traffic comprises communicating the network traffic between the processor and the accelerator via a coherent interconnect of the computing device.
  • Example 18 includes the subject matter of any of Examples 13-17, and further comprising: coupling, by the computing device, each of the virtual network functions to a paravirtulization interface of the accelerator; wherein processing the network traffic comprises processing network traffic associated with the paravirtualization interface.
  • Example 19 includes the subject matter of any of Examples 13-18, and wherein processing the network traffic comprises forwarding network traffic from the plurality of network functions to the network switch via the first network port and the second network port.
  • Example 20 includes the subject matter of any of Examples 13-19, and wherein processing the network traffic comprises forwarding network traffic received from the network switch via the first network port and the second network port to the plurality of network functions.
  • Example 21 includes the subject matter of any of Examples 13-20, and wherein processing the network traffic comprises forwarding network traffic between a first virtual network function and a second virtual network function.
  • Example 22 includes the subject matter of any of Examples 13-21, and wherein each of the virtual network functions comprises a virtual machine.
  • Example 23 includes the subject matter of any of Examples 13-22, and wherein the accelerator comprises an application-specific integrated circuit.
  • Example 24 includes the subject matter of any of Examples 13-23, and wherein the processor and the accelerator are included in a multi-chip package of the computing device.
  • Example 25 includes a computing device comprising: a processor; and a memory having stored therein a plurality of instructions that when executed by the processor cause the computing device to perform the method of any of Examples 13-24.
  • Example 26 includes one or more non-transitory, computer readable storage media comprising a plurality of instructions stored thereon that in response to being executed result in a computing device performing the method of any of Examples 13-24.
  • Example 27 includes a computing device comprising means for performing the method of any of Examples 13-24.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Multi Processors (AREA)

Abstract

Technologies for accelerated network processing include a computing device having a processor and an accelerator. The accelerator may be a field-programmable gate array (FPGA). The accelerator includes a virtual switch and a network port, such as an Ethernet physical interface. The network port of the accelerator is coupled to a network port of an external switch. The processor executes multiple virtual network functions, and the virtual switch processes network traffic associated with the virtual network functions. For example, the virtual switch may forward traffic generated by the virtual network functions to the switch via the port of the accelerator and the port of the switch. Each virtual network function may be coupled to a paravirtualization interface of the accelerator, such as a virtual I/O queue. The network traffic may be processed within a coherency domain shared by the processor and the accelerator. Other embodiments are described and claimed.

Description

TECHNOLOGIES FOR NIC PORT REDUCTION WITH ACCELERATED SWITCHING
CROSS-REFERENCE TO RELATED APPLICATION
[0001] The present application claims the benefit of U.S. Provisional Patent Application
No. 62/634,874, filed February 25, 2018.
BACKGROUND
[0002] Modem computing devices may include general-purpose processor cores as well as a variety of hardware accelerators for performing specialized tasks. Certain computing devices may include one or more field-programmable gate arrays (FPGAs), which may include programmable digital logic resources that may be configured by the end user or system integrator. In some computing devices, an FPGA may be used to perform network packet processing tasks instead of using general-purpose compute cores.
BRIEF DESCRIPTION OF THE DRAWINGS
[0003] The concepts described herein are illustrated by way of example and not by way of limitation in the accompanying figures. For simplicity and clarity of illustration, elements illustrated in the figures are not necessarily drawn to scale. Where considered appropriate, reference labels have been repeated among the figures to indicate corresponding or analogous elements.
[0004] FIG. 1 is a simplified block diagram of at least one embodiment of a system for network acceleration;
[0005] FIG. 2 is a simplified block diagram of at least one embodiment of a computing device of the system of FIG. 1 ;
[0006] FIG. 3 is a simplified block diagram of at least one embodiment of an environment of the computing device of FIGS. 1 and 2;
[0007] FIG. 4 is a simplified block diagram of at least one embodiment of a virtual switch application function unit of the computing device of FIGS. 1-3;
[0008] FIG. 5 is a simplified flow diagram of at least one embodiment of a method for network acceleration that may be executed by the computing device of FIGS. 1-4;
[0009] FIG. 6 is a chart illustrating exemplary test results that may be achieved with the system of FIGS. 1-4; and
[0010] FIG. 7 is a simplified block diagram of a typical system. DET AILED DESCRIPTION OF THE DRAWINGS
[0011] While the concepts of the present disclosure are susceptible to various modifications and alternative forms, specific embodiments thereof have been shown by way of example in the drawings and will be described herein in detail. It should be understood, however, that there is no intent to limit the concepts of the present disclosure to the particular forms disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives consistent with the present disclosure and the appended claims.
[0012] References in the specification to“one embodiment,”“an embodiment,”“an illustrative embodiment,” etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may or may not necessarily include that particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described. Additionally, it should be appreciated that items included in a list in the form of“at least one A, B, and C” can mean (A); (B); (C); (A and B); (A and C); (B and C); or (A, B, and C). Similarly, items listed in the form of“at least one of A, B, or C” can mean (A); (B); (C); (A and B); (A and C); (B and C); or (A, B, and C).
[0013] The disclosed embodiments may be implemented, in some cases, in hardware, firmware, software, or any combination thereof. The disclosed embodiments may also be implemented as instructions carried by or stored on a transitory or non-transitory machine- readable (e.g., computer-readable) storage medium, which may be read and executed by one or more processors. A machine-readable storage medium may be embodied as any storage device, mechanism, or other physical structure for storing or transmitting information in a form readable by a machine (e.g., a volatile or non-volatile memory, a media disc, or other media device).
[0014] In the drawings, some structural or method features may be shown in specific arrangements and/or orderings. However, it should be appreciated that such specific arrangements and/or orderings may not be required. Rather, in some embodiments, such features may be arranged in a different manner and/or order than shown in the illustrative figures. Additionally, the inclusion of a structural or method feature in a particular figure is not meant to imply that such feature is required in all embodiments and, in some embodiments, may not be included or may be combined with other features. [0015] Referring now to FIG. 7, a typical system 700 for network processing may require one or more dedicated network cards assigned to each virtual network function. In the illustrative system 700, a computing device 702 includes a processor 720 and multiple network interface controllers (NICs) 726. The processor 720 executes multiple virtual network functions (VNFs) 722. Each VNF 722 of the computing device 702 is assigned to one or more dedicated ports in a traditional NIC 726. Each VNF 722 may have direct access to the NIC 726 (or a part of the NIC 726 such as a PCI virtual function) using a hardware interface such as Single-Root I/O Virtualization (SR-IOV). For example, in the illustrative system, each VNF 722 accesses a dedicated NIC 726 using Intel® VT-d technology 724 provided by the processor 720. As shown, each illustrative NIC 726 includes two network ports, and each of those network ports is coupled to a corresponding port 742 of a network switch 704. Thus, in the illustrative system 700, to execute four VNFs 722 the computing device 702 occupies eight ports 742 of the switch 704.
[0016] Referring now to FIG. 1, a system 100 for accelerated networking includes multiple computing devices 102 in communication over a network 104. Each computing device
102 has a processor 120 and an accelerator 128, such as a field-programmable gate array
(FPGA) 128. The processor 120 and the accelerator 128 are coupled by a coherent interconnect
124 and a non-coherent interconnect 126. In use, as described below, a computing device 102 executes one or more virtual network functions (VNFs) or other virtual machines (VMs).
Network traffic associated with the VNFs is processed by a virtual switch (vSwitch) of the accelerator 128. The accelerator 128 includes one or more ports or other physical interfaces that are coupled to a switch 106 of the network 104. Each VNF does not require dedicated ports on the switch 106. Thus, the system 100 may execute high throughput, scalable network workloads with reduced top of rack (ToR) switch port consumption as compared to conventional systems that require traditional NICs and dedicated ports for each VNF.
Accordingly, each computing device 102 may require fewer NICs, which may reduce cost and power consumption. Additionally, reducing the number of required NICs may overcome server form factor limits on the number of physical NIC expansion cards, chassis space, and/or other physical resources of the computing device 102. Further, flexibility for users or tenants of the system 100 may be improved, because the user is not required to purchase and install a predetermined number of NICs in each server, and performance is not limited to the capacity that those NICs provide. Rather, performance for the system 100 may scale with the overall throughput capability of the particular server platform and network fabric. Additionally, and unexpectedly, tests have shown that the disclosed system 100 may achieve performance that is comparable to single-root I/O virtualization (SR-IOV) implementations, without using standard NICs and using fewer switch ports. Additionally, tests have shown that the system 100 may achieve better performance than software-based systems.
[0017] Referring now to FIG. 6, chart 600 illustrates test results that may be achieved by the system 100 as compared to typical systems. Bar 602 illustrates throughput achieved by a system using SR-IOV/VT-d PCI passthrough virtualization, similar to the system 700 of FIG. 7. Bar 604 illustrates throughput that may be achieved by a system 100 with an FPGA accelerator 128, as disclosed herein. Bar 606 illustrates throughput achieved by a system using the Intel Data Plane Development Kit (DPDK), which is a high-performance software packet processing framework. As shown, the SR-IOV system 700 achieves about 40 Gbps, the FPGA system 100 achieves about 36.4 Gbps, and the DPDK system achieves about 15 Gbps. The SR-IOV system 700 achieves about 2.67 times the throughput of the DPDK (software) system, and about 1.1 times the throughput of the FPGA 100 system. The FPGA system 100 achieves about 2.4 times the throughput of the DPDK (software) system. Curve 608 illustrates switch ports used by each system. As shown, the SR-IOV system 700 uses four ports, the FPGA system 100 uses two ports, and the DPDK system uses two ports. Curve 610 illustrates processor cores used by each system. As shown, the SR-IOV system 700 uses zero cores, the FPGA system 100 uses zero cores, and the DPDK system uses six cores. Thus, as shown in the chart 600, the FGPA system 100 provides throughput performance comparable to typical SR-IOV systems 700 with reduced NIC port usage and without using additional processor cores.
[0018] Referring back to FIG. 1, each computing device 102 may be embodied as any type of computation or computer device capable of performing the functions described herein, including, without limitation, a computer, a server, a workstation, a desktop computer, a laptop computer, a notebook computer, a tablet computer, a mobile computing device, a wearable computing device, a network appliance, a web appliance, a distributed computing system, a processor-based system, and/or a consumer electronic device. As shown in FIG. 1, the computing device 102 illustratively include the processor 120, the accelerator 128, an input/output subsystem 130, a memory 132, a data storage device 134, and a communication subsystem 136, and/or other components and devices commonly found in a server or similar computing device. Of course, the computing device 102 may include other or additional components, such as those commonly found in a server computer (e.g., various input/output devices), in other embodiments. Additionally, in some embodiments, one or more of the illustrative components may be incorporated in, or otherwise form a portion of, another component. For example, the memory 132, or portions thereof, may be incorporated in the processor 120 in some embodiments. [0019] The processor 120 may be embodied as any type of processor capable of performing the functions described herein. Illustratively, the processor 120 is a multi-core processor 120 having two processor cores 122. Of course, in other embodiments the processor 120 may be embodied as a single or multi-core processor(s), digital signal processor, microcontroller, or other processor or processing/controlling circuit. Similarly, the memory 132 may be embodied as any type of volatile or non-volatile memory or data storage capable of performing the functions described herein. In operation, the memory 132 may store various data and software used during operation of the computing device 102 such operating systems, applications, programs, libraries, and drivers. The memory 132 is communicatively coupled to the processor 120 via the I/O subsystem 130, which may be embodied as circuitry and/or components to facilitate input/output operations with the processor 120, the accelerator 128, the memory 132, and other components of the computing device 102. For example, the I/O subsystem 130 may be embodied as, or otherwise include, memory controller hubs, input/output control hubs, sensor hubs, firmware devices, communication links (i.e., point-to- point links, bus links, wires, cables, light guides, printed circuit board traces, etc.) and/or other components and subsystems to facilitate the input/output operations. In some embodiments, the I/O subsystem 130 may form a portion of a system-on-a-chip (SoC) and be incorporated, along with the processor 120, the memory 132, and other components of the computing device 102, on a single integrated circuit chip.
[0020] The data storage device 134 may be embodied as any type of device or devices configured for short-term or long-term storage of data such as, for example, memory devices and circuits, memory cards, hard disk drives, solid-state drives, non-volatile flash memory, or other data storage devices. The computing device 102 also includes the communication subsystem 136, which may be embodied as any communication circuit, device, or collection thereof, capable of enabling communications between the computing device 102 and other remote devices over the computer network 104. For example, the communication subsystem 136 may be embodied as or otherwise include a network interface controller (NIC) for sending and/or receiving network data with remote devices. The communication subsystem 136 may be configured to use any one or more communication technology (e.g., wired or wireless communications) and associated protocols (e.g., Ethernet, InfiniBand®, Bluetooth®, Wi-Fi®, WiMAX, 3G, 4G LTE, etc.) to effect such communication.
[0021] As shown in FIG. 1, the computing device 102 includes an accelerator 128. The accelerator 128 may be embodied as a field-programmable gate array (FPGA), an application- specific integrated circuit (ASIC), a graphics processing using (GPU), an artificial intelligence
(AI) accelerator, a coprocessor, or other digital logic device capable of performing accelerated network functions. Illustratively, as described further below in connection with FIG. 2, the accelerator 128 is an FPGA included in a multi-chip package with the processor 120. The accelerator 128 may be coupled to the processor 120 via multiple high-speed connection interfaces including the coherent interconnect 124 and one or more non-coherent interconnects 126.
[0022] The coherent interconnect 124 may be embodied as a high-speed data interconnect capable of maintaining data coherency between a last-level cache of the processor 120, any cache or other local memory of the accelerator 128, and the memory 132. For example, the coherent interconnect 124 may be embodied as an in-die interconnect (IDI), Intel UltraPath Interconnect (UPI), QuickPath Interconnect (QPI), Intel Accelerator Link (IAL), or other coherent interconnect. The non-coherent interconnect 126 may be embodied as a high speed data interconnect that does not provide data coherency, such as a peripheral bus (e.g., a PCI Express bus), a fabric interconnect such as Intel Omni-Path Architecture, or other non coherent interconnect. Additionally or alternatively, it should be understood that in some embodiments, the coherent interconnect 124 and/or the non-coherent interconnect 126 may be merged to form an interconnect that is capable of serving both functions. In some embodiments, the computing device 102 may include multiple coherent interconnects 124, multiple non-coherent interconnects 126, and/or multiple merged interconnects.
[0023] The computing device 102 may further include one or more peripheral devices
138. The peripheral devices 138 may include any number of additional input/output devices, interface devices, and/or other peripheral devices. For example, in some embodiments, the peripheral devices 138 may include a touch screen, graphics circuitry, a graphical processing unit (GPU) and/or processor graphics, an audio device, a microphone, a camera, a keyboard, a mouse, a network interface, and/or other input/output devices, interface devices, and/or peripheral devices.
[0024] The computing devices 102 may be configured to transmit and receive data with each other and/or other devices of the system 100 over the network 104. The network 104 may be embodied as any number of various wired and/or wireless networks. For example, the network 104 may be embodied as, or otherwise include, a wired or wireless local area network (LAN), and/or a wired or wireless wide area network (WAN). As such, the network 104 may include any number of additional devices, such as additional computers, routers, and switches, to facilitate communications among the devices of the system 100. In the illustrative embodiment, the network 104 is embodied as a local Ethernet network. The network 104 includes an illustrative switch 106, which may be embodied as a top-of-rack (ToR) switch, a middle-of-rack (MoR) switch, or other switch. Of course, the network 104 may include multiple switches 106 and other network devices.
[0025] Referring now to FIG. 2, diagram 200 illustrates one potential embodiment of a computing device 102. As shown, the computing device 102 includes a multi-chip package (MCP) 202. The MCP 202 includes the processor 120 and the accelerator 128, as well as the coherent interconnect 124 and the non-coherent interconnect 126. Illustratively, the accelerator 128 is an FPGA, which may be embodied as an integrated circuit including programmable digital logic resources that may be configured after manufacture. The FPGA 128 may include, for example, a configurable array of logic blocks in communication over a configurable data interchange. As shown, the computing device 102 further includes the memory 132 and the communication subsystem 136. The FPGA 128 is coupled to the communication subsystem 136 and thus may send and/or receive network data. Additionally, although illustrated in FIG. 2 as discrete components separate from the MCP 202, it should be understood that in some embodiments the memory 132 and/or the communication subsystem 136 may also be incorporated in the MCP 202.
[0026] As shown, the FPGA 128 includes an FPGA interface unit (FIU) 204, which may be embodied as digital logic resources that are configured by a manufacturer, vendor, or other entity associated with the computing device 102. The FIU 204 implements the interface protocols and manageability for links between the processor 120 and the FPGA 128. In some embodiments, the FIU 204 may also provide platform capabilities, such as Intel Virtualization Technology for directed I/O (Intel VT-d), security, error monitoring, performance monitoring, power and thermal management, partial reconfiguration, etc. As shown, the FIU 204 further includes an UltraPath Interconnect (UPI) block 206 coupled to the coherent interconnect 124 and a PCI Express (PCIe) block 208 coupled to the non-coherent interconnect 126. The UPI block 206 and the PCIe block 208 may be embodied as digital logic configured to transport data between the FPGA 128 and the processor 120 over the physical interconnects 124, 126, respectively. The physical coherent UPI block 206 and the physical non-coherent block 208 and their associated interconnects may be multiplexed as a set of virtual channels (VCs) connected to a VC steering block.
[0027] The FPGA 128 further includes one or more accelerated function units (AFUs)
210. Each AFU 210 may be embodied as digital logic configured to perform one or more accelerated networking functions. For example, each AFU 210 may be embodied as smart NIC logic, smart vSwitch logic, or other logic that performs one or more network workloads (e.g., user-designed custom data path logic such as forwarding, classification, packet steering, encapsulation, security, quality-of-service, etc.). Illustratively, each AFU 210 may be configured by a user of the computing device 102. Each AFU 210 may access data in the memory 132 using one or more virtual channels (VCs) that are backed by the coherent interconnect 124 and/or the non-coherent interconnect 126 using the FIU 204. Although illustrated in FIG. 2 as an FGPA 128 including multiple AFUs 210, it should be understood that in some embodiments the accelerator 128 may be embodied an ASIC, coprocessor, or other accelerator 128 that also includes one or more AFUs 210 to provide accelerated networking functions. In those embodiments, the AFUs 210 may be fixed-function or otherwise not user configurable.
[0028] Referring now to FIG. 3, in an illustrative embodiment, the computing device
102 establishes an environment 300 during operation. The illustrative environment 300 includes one or more virtual network functions (VNFs) 302, a virtual machine monitor (VMM) 304, a virtual I/O block 306, a vS witch 308, and physical interfaces 310. As shown, the various components of the environment 300 may be embodied as hardware, firmware, software, or a combination thereof. As such, in some embodiments, one or more of the components of the environment 300 may be embodied as circuitry or collection of electrical devices (e.g., VNF circuitry 302, VMM circuitry 304, virtual I/O block circuitry 306, vS witch circuitry 308, and/or physical interface circuitry 310). It should be appreciated that, in such embodiments, one or more of the VNF circuitry 302, the VMM circuitry 304, the virtual I/O block circuitry 306, the vS witch circuitry 308, and/or the physical interface circuity 310 may form a portion of the processor 120, the accelerator 128, the I/O subsystem 130, and/or other components of the computing device 102. Additionally, in some embodiments, one or more of the illustrative components may form a portion of another component and/or one or more of the illustrative components may be independent of one another.
[0029] The VMM 304 may be embodied as any virtual machine monitor, hypervisor, or other component that allows virtualized workloads to be executed on the computing device 102. The VMM 304 may have complete control over the computing device 102, for example by executing in a non-virtualized host mode, such as ringlevel 0 and/or VMX-root mode. Each VNF 302 may be embodied as any guest virtual machine, guest operating system, or other guest software configured to perform a virtualized workload on the computing device 102. For example, each VNF 302 may be embodied as a virtual network function (VNF) or other network workload (e.g., user-designed custom data path logic such as forwarding, classification, packet steering, encapsulation, security, quality-of-service, etc.). The VMM 304 may enforce isolation between the VNFs 302 and otherwise enforce platform security. Thus, the computing device 102 may host guests executed by multiple users or other tenants. The VNFs 302 and the VMM 304 are executed by the processor 120. The VMM 304 may be configured to configure the accelerator 128 with the vSwitch 308.
[0030] The virtual I/O block 306 may be embodied as one or more I/O ports, queues, or other I/O interfaces that may be accessed by the VNFs 302. The virtual I/O block 306 may be coupled with, embodied as, or otherwise include one or more paravirtualized drivers, which may provide high performance I/O for the VNFs 302. Illustratively, the virtual I/O block 306 may be embodied as one or more virtio queues, drivers, and/or other associated components. The VMM 304 may be further configured to couple each VNF 302 to a paravirtualization interface provided by the virtual I/O block 306.
[0031] Each physical interface 310 may be embodied as an Ethernet PHY, MAC, or other physical interface. Each physical interface 310 is coupled to a port 312 of an external switch 106 (e.g., a ToR switch) by a network link. The network link may include one or more communications lanes, wires, backplane, optical links, communication channels, and/or other communication components.
[0032] The vSwitch 308 is configured to process network traffic associated with the
VNFs 302. The network traffic may be accessed by the accelerator 128 and/or the processor 120 via the virtual I/O block 306. The network traffic may be processed within a coherency domain shared by the accelerator 128 and the processor 120. For example, the network traffic may be communicated between the processor 120 and the accelerator 128 via the coherent interconnect 124. The vSwitch 308 may forward network traffic from the VNFs 302 to the switch 106 and/or from the switch 106 to the VNFs 302 via a physical interface 310 and the corresponding port 312. The vSwitch 308 may also forward network traffic between multiple VNFs 302.
[0033] Referring now to FIG. 4, one potential embodiment of the accelerator 128 is shown. The illustrated accelerator 128 may include the virtual I/O block 306, the vSwitch 308, and/or the physical interface 310 as described above. Those components of the accelerator 128 may be embodied as, for example, an AFU 210 of the accelerator 128. As shown, the accelerator 128 includes a full packet processing pipeline. In particular, the illustrative accelerator 128 includes a gDMA block 402, a pipeline configuration block 404, an Open vSwitch (OVS) virtio handler block 406, an I/O configuration block 408, a retimer card 410, a 10 Gb MAC block 412, an ingress rate limiter 414, a tunneling block 416 with VxLAN de encapsulation block 418 and Network Virtualization using Generic Routing Encapsulation (NVGRE) de-encapsulation block 420, an OpenFlow classifer 422, a forwarding information base (FIB) and action block 424 with exact match block 426, megaflow block 428, and
OpenFlow action block 430, a packet infrastructure 432, a tunneling block 434 with VxLAN encapsulation block 436 and NVGRE encapsulation block 438, a crossbar switch 440, and an egress QoS/traffic shaping block 442.
[0034] In the illustrative embodiment, the physical interface 310 may be embodied as the MAC 412 and/or retimer 410. As described above, those components are coupled to a port 312 of an external switch 106 via a network link. Similarly, in the illustrative embodiment, the virtual I/O block 306 may be embodied as the gDMA 402 and/or the virtio handler 406. The vSwitch 308 may be embodied as the remaining components of the accelerator 128. For example, the illustrative accelerator 128 may receive incoming network traffic via the retimer 410 and MAC 412 and provide that data to the tunnels 416. Similarly, the accelerator may receive network traffic generated by the VNFs 302 via the gDMA 402, virtio handler 406, and ingress rate limiter 414 and provide that data to the tunnels 416. The accelerator 128 processes the network traffic using the tunnels 416, the packet infrastructure 432 including the OF classifier 422 and FIB/action tables 424, and the tunnels 434. After processing, the network traffic is provided to the crossbar switch 440 and the egress QoS/traffic shaping 442. Network traffic destined for the switch 106 is provided to the MAC 412 for egress. Network traffic destined for the VNFs 302 is provided to the virtio handler 406.
[0035] Referring now to FIG. 5, in use, the computing device 102 may execute a method 500 for accelerated network processing. It should be appreciated that, in some embodiments, the operations of the method 500 may be performed by one or more components of the environment 300 of the computing device 102 as shown in FIG. 3. The method 500 begins in block 502, in which the computing device 102 configures an AFU 210 of the accelerator 128 for vSwitch 308 operations. The computing device 102 may, for example, perform configuration or partial configuration of the FPGA 128 with bitstream or other code for the vSwitch 308 functions. The computing device 102 may also configure network routing rules, flow rules, actions, QoS, and other network configuration of the vSwitch 308.
[0036] In block 504, the computing device 102 binds one or more ports or other physical interfaces 310 of the accelerator 128 to an external switch 106. For example, the computing device 102 may bind one or more MACs, PHYs, or other Ethernet ports of the accelerator 128 to corresponding port(s) of the external switch 106. The ports of the accelerator 128 may be embodied as fixed-function hardware ports or reconfigurable“soft” ports. In some embodiments, the ports of the accelerator 128 may be pre-configured or otherwise provided by a manufacturer or other entity associated with the accelerator 128.
[0037] In block 506, the computing device 102 configures a VNF 302 for network processing. The computing device 102 may, for example, load the VNF 302 or otherwise initialize the VNF 302. The VNF 302 may be provided by a tenant or other user of the computing device 102. In block 508, the computing device 102 binds the VNF 302 to a virtio queue of the accelerator 128 or other paravirtualized interface of the accelerator 128. The computing device 102 may, for example, configure the VNF 302 with one or more paravirtualized drivers, queues, buffers, or other interfaces to the accelerator 128.
[0038] In block 510, the computing device 102 determines whether additional VNFs
302 should be configured. If so, the method 500 loops back to block 506 to load additional VNFs 302. Each additional VNF 302 may be bound to one or more dedicated virtio queues or other interfaces of the accelerator 128. However, the accelerator 128 need not be bound to additional ports of the switch 106. Referring back to block 510, if no additional VNFs 302 remain to be configured, the method 500 advances to block 512.
[0039] In block 512, the computing device 102 processes network workloads with the
VNFs 302 and processes network traffic with the virtual switch 308 of the accelerator 128. Each of the VNFs 302 may generate and/or receive network traffic (e.g., packet frames). For example, each VNF 302 may read and/or write network packet data into buffers in system memory 132 corresponding to virtual I/O queues. The vSwitch 308 may perform full packet processing pipeline operations on that network traffic data. In some embodiments, in block 514 the computing device 102 processes the network traffic using the VNFs 302 and the vSwitch 308 in the same coherency domain. For example, the computing device 102 may transfer data (e.g., virtio queue data) between the processor 120 and the accelerator 128 via the coherent interconnect 124. The VNFs 302 and the vSwitch 308 may process network data concurrently, simultaneously, or otherwise, with the coherent interconnect 124 providing data coherency between the last-level cache of the processor 120, cache or other local memory of the accelerator 128, and the memory 132. In some embodiments, full packet frames may be transferred between the processor 120 and the accelerator 128, so that multiple switching actions can happen simultaneously. After processing the network data, the method 500 loops back to block 512 to continue processing network traffic with the VNFs 302 and the vSwitch 308. In some embodiments, the computing device 102 may dynamically load and/or unload VNFs 302 during execution of the method 500.
[0040] It should be appreciated that, in some embodiments, the method 500 may be embodied as various instructions stored on a computer-readable media, which may be executed by the processor 120, the accelerator 128, and/or other components of the computing device 102 to cause the computing device 102 to perform the method 500. The computer-readable media may be embodied as any type of media capable of being read by the computing device 102 including, but not limited to, the memory 132, the data storage device 134, firmware devices, other memory or data storage devices of the computing device 102, portable media readable by a peripheral device 138 of the computing device 102, and/or other media.
EXAMPLES
[0041] Illustrative examples of the technologies disclosed herein are provided below.
An embodiment of the technologies may include any one or more, and any combination of, the examples described below.
[0042] Example 1 includes a computing device for accelerated network processing, the computing device comprising: an accelerator to couple a first network port of a virtual switch of the accelerator with a second network port of a network switch via a network link; and a processor to execute a plurality of virtual network functions in response to coupling of the first network port with the second network port; wherein the virtual switch is to process network traffic associated with the plurality of virtual network functions in response to execution of the plurality of virtual network functions.
[0043] Example 2 includes the subject matter of Example 1, and further comprising a virtual machine monitor to configure the accelerator with the virtual switch.
[0044] Example 3 includes the subject matter of any of Examples 1 and 2, and wherein the accelerator comprises a field-programmable gate array and wherein the virtual switch comprises an application function unit of the field-programmable gate array.
[0045] Example 4 includes the subject matter of any of Examples 1-3, and wherein to process the network traffic comprises to process the network traffic within a coherency domain shared by the accelerator and the processor.
[0046] Example 5 includes the subject matter of any of Examples 1-4, and further comprising: a coherent interconnect that couples the processor and the accelerator; wherein to process the network traffic comprises to communicate the network traffic between the processor and the accelerator via the coherent interconnect.
[0047] Example 6 includes the subject matter of any of Examples 1-5, and further comprising: a virtual machine monitor to couple each of the virtual network functions to a paravirtulization interface of the accelerator; wherein to process the network traffic comprises to process network traffic associated with the paravirtualization interface.
[0048] Example 7 includes the subject matter of any of Examples 1-6, and wherein to process the network traffic comprises to forward network traffic from the plurality of network functions to the network switch via the first network port and the second network port.
[0049] Example 8 includes the subject matter of any of Examples 1-7, and wherein to process the network traffic comprises to forward network traffic received from the network switch via the first network port and the second network port to the plurality of network functions.
[0050] Example 9 includes the subject matter of any of Examples 1-8, and wherein to process the network traffic comprises to forward network traffic between a first virtual network function and a second virtual network function.
[0051] Example 10 includes the subject matter of any of Examples 1-9, and wherein each of the virtual network functions comprises a virtual machine.
[0052] Example 11 includes the subject matter of any of Examples 1-10, and wherein the accelerator comprises an application-specific integrated circuit.
[0053] Example 12 includes the subject matter of any of Examples 1-11, and wherein the processor and the accelerator are included in a multi-chip package of the computing device.
[0054] Example 13 includes a method for accelerated network processing, the method comprising: coupling, by a computing device, a first network port of a virtual switch of an accelerator of the computing device with a second network port of a network switch via a network link; executing, by the computing device, a plurality of virtual network functions with a processor of the computing device in response to coupling the first network port with the second network port; and processing, by the computing device with the virtual switch of the accelerator, network traffic associated with the plurality of virtual network functions in response to executing the plurality of virtual network functions.
[0055] Example 14 includes the subject matter of Example 13, and further comprising configuring, by the computing device, the accelerator with the virtual switch.
[0056] Example 15 includes the subject matter of any of Examples 13 and 14, and wherein the accelerator comprises a field-programmable gate array and wherein the virtual switch comprises an application function unit of the field-programmable gate array.
[0057] Example 16 includes the subject matter of any of Examples 13-15, and wherein processing the network traffic comprises processing the network traffic within a coherency domain shared by the accelerator and the processor.
[0058] Example 17 includes the subject matter of any of Examples 13-16, and wherein processing the network traffic comprises communicating the network traffic between the processor and the accelerator via a coherent interconnect of the computing device.
[0059] Example 18 includes the subject matter of any of Examples 13-17, and further comprising: coupling, by the computing device, each of the virtual network functions to a paravirtulization interface of the accelerator; wherein processing the network traffic comprises processing network traffic associated with the paravirtualization interface. [0060] Example 19 includes the subject matter of any of Examples 13-18, and wherein processing the network traffic comprises forwarding network traffic from the plurality of network functions to the network switch via the first network port and the second network port.
[0061] Example 20 includes the subject matter of any of Examples 13-19, and wherein processing the network traffic comprises forwarding network traffic received from the network switch via the first network port and the second network port to the plurality of network functions.
[0062] Example 21 includes the subject matter of any of Examples 13-20, and wherein processing the network traffic comprises forwarding network traffic between a first virtual network function and a second virtual network function.
[0063] Example 22 includes the subject matter of any of Examples 13-21, and wherein each of the virtual network functions comprises a virtual machine.
[0064] Example 23 includes the subject matter of any of Examples 13-22, and wherein the accelerator comprises an application-specific integrated circuit.
[0065] Example 24 includes the subject matter of any of Examples 13-23, and wherein the processor and the accelerator are included in a multi-chip package of the computing device.
[0066] Example 25 includes a computing device comprising: a processor; and a memory having stored therein a plurality of instructions that when executed by the processor cause the computing device to perform the method of any of Examples 13-24.
[0067] Example 26 includes one or more non-transitory, computer readable storage media comprising a plurality of instructions stored thereon that in response to being executed result in a computing device performing the method of any of Examples 13-24.
[0068] Example 27 includes a computing device comprising means for performing the method of any of Examples 13-24.

Claims

WHAT IS CLAIMED IS:
1. A computing device for accelerated network processing, the computing device comprising:
an accelerator to couple a first network port of a virtual switch of the accelerator with a second network port of a network switch via a network link; and
a processor to execute a plurality of virtual network functions in response to coupling of the first network port with the second network port;
wherein the virtual switch is to process network traffic associated with the plurality of virtual network functions in response to execution of the plurality of virtual network functions.
2. The computing device of claim 1, further comprising a virtual machine monitor to configure the accelerator with the virtual switch.
3. The computing device of claim 2, wherein the accelerator comprises a field- programmable gate array and wherein the virtual switch comprises an application function unit of the field-programmable gate array.
4. The computing device of claim 1, wherein to process the network traffic comprises to process the network traffic within a coherency domain shared by the accelerator and the processor.
5. The computing device of claim 4, further comprising:
a coherent interconnect that couples the processor and the accelerator;
wherein to process the network traffic comprises to communicate the network traffic between the processor and the accelerator via the coherent interconnect.
6. The computing device of claim 1, further comprising:
a virtual machine monitor to couple each of the virtual network functions to a paravirtulization interface of the accelerator;
wherein to process the network traffic comprises to process network traffic associated with the paravirtualization interface.
7. The computing device of claim 1, wherein to process the network traffic comprises to forward network traffic from the plurality of network functions to the network switch via the first network port and the second network port.
8. The computing device of claim 1, wherein to process the network traffic comprises to forward network traffic received from the network switch via the first network port and the second network port to the plurality of network functions.
9. The computing device of claim 1, wherein to process the network traffic comprises to forward network traffic between a first virtual network function and a second virtual network function.
10. The computing device of claim 1, wherein each of the virtual network functions comprises a virtual machine.
11. The computing device of claim 1, wherein the accelerator comprises an application- specific integrated circuit.
12. The computing device of claim 1, wherein the processor and the accelerator are included in a multi-chip package of the computing device.
13. A method for accelerated network processing, the method comprising:
coupling, by a computing device, a first network port of a virtual switch of an accelerator of the computing device with a second network port of a network switch via a network link;
executing, by the computing device, a plurality of virtual network functions with a processor of the computing device in response to coupling the first network port with the second network port; and
processing, by the computing device with the virtual switch of the accelerator, network traffic associated with the plurality of virtual network functions in response to executing the plurality of virtual network functions.
14. The method of claim 13, further comprising configuring, by the computing device, the accelerator with the virtual switch.
15. The method of claim 14, wherein the accelerator comprises a field- programmable gate array and wherein the virtual switch comprises an application function unit of the field-programmable gate array.
16. The method of claim 13, wherein processing the network traffic comprises processing the network traffic within a coherency domain shared by the accelerator and the processor.
17. The method of claim 16, wherein processing the network traffic comprises communicating the network traffic between the processor and the accelerator via a coherent interconnect of the computing device.
18. The method of claim 13, further comprising:
coupling, by the computing device, each of the virtual network functions to a paravirtulization interface of the accelerator;
wherein processing the network traffic comprises processing network traffic associated with the paravirtualization interface.
19. The method of claim 13, wherein processing the network traffic comprises forwarding network traffic from the plurality of network functions to the network switch via the first network port and the second network port.
20. The method of claim 13, wherein processing the network traffic comprises forwarding network traffic received from the network switch via the first network port and the second network port to the plurality of network functions.
21. The method of claim 13, wherein processing the network traffic comprises forwarding network traffic between a first virtual network function and a second virtual network function.
22. The method of claim 13, wherein each of the virtual network functions comprises a virtual machine.
23. A computing device comprising:
a processor; and
a memory having stored therein a plurality of instructions that when executed by the processor cause the computing device to perform the method of any of claims 13-22.
24. One or more non-transitory, computer readable storage media comprising a plurality of instructions stored thereon that in response to being executed result in a computing device performing the method of any of claims 13-22.
25. A computing device comprising means for performing the method of any of claims 13-22.
PCT/US2019/019377 2018-02-25 2019-02-25 Technologies for nic port reduction with accelerated switching WO2019165355A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
DE112019000965.6T DE112019000965T5 (en) 2018-02-25 2019-02-25 TECHNOLOGIES TO REDUCE NIC CONNECTIONS WITH ACCELERATED CIRCUIT
CN201980006768.0A CN111492628A (en) 2018-02-25 2019-02-25 Techniques for NIC port reduction with accelerated switching

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201862634874P 2018-02-25 2018-02-25
US62/634,874 2018-02-25

Publications (1)

Publication Number Publication Date
WO2019165355A1 true WO2019165355A1 (en) 2019-08-29

Family

ID=67687342

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2019/019377 WO2019165355A1 (en) 2018-02-25 2019-02-25 Technologies for nic port reduction with accelerated switching

Country Status (3)

Country Link
CN (1) CN111492628A (en)
DE (1) DE112019000965T5 (en)
WO (1) WO2019165355A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3846398A1 (en) * 2019-12-30 2021-07-07 Avago Technologies International Sales Pte. Limited Hyperscalar packet processing
WO2021143135A1 (en) * 2020-01-13 2021-07-22 苏州浪潮智能科技有限公司 Far-end data migration device and method based on fpga cloud platform
KR20210132348A (en) * 2020-04-27 2021-11-04 한국전자통신연구원 Computing resource disaggregated collaboration system of interconnected an optical line and, resource disaggregated collaboration method

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114915598B (en) * 2021-02-08 2023-10-20 腾讯科技(深圳)有限公司 Network acceleration method and device of application program and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130145072A1 (en) * 2004-07-22 2013-06-06 Xsigo Systems, Inc. High availability and I/O aggregation for server environments
EP2722767A1 (en) * 2012-10-16 2014-04-23 Solarflare Communications Inc Encapsulated accelerator
US20140215463A1 (en) * 2013-01-31 2014-07-31 Broadcom Corporation Systems and methods for handling virtual machine packets
US20160232019A1 (en) * 2015-02-09 2016-08-11 Broadcom Corporation Network Interface Controller with Integrated Network Flow Processing
US20170019351A1 (en) * 2014-03-31 2017-01-19 Juniper Networks, Inc. Network interface card having embedded virtual router

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130145072A1 (en) * 2004-07-22 2013-06-06 Xsigo Systems, Inc. High availability and I/O aggregation for server environments
EP2722767A1 (en) * 2012-10-16 2014-04-23 Solarflare Communications Inc Encapsulated accelerator
US20140215463A1 (en) * 2013-01-31 2014-07-31 Broadcom Corporation Systems and methods for handling virtual machine packets
US20170019351A1 (en) * 2014-03-31 2017-01-19 Juniper Networks, Inc. Network interface card having embedded virtual router
US20160232019A1 (en) * 2015-02-09 2016-08-11 Broadcom Corporation Network Interface Controller with Integrated Network Flow Processing

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3846398A1 (en) * 2019-12-30 2021-07-07 Avago Technologies International Sales Pte. Limited Hyperscalar packet processing
US11184278B2 (en) 2019-12-30 2021-11-23 Avago Technologies International Sales Pte. Limited Hyperscalar packet processing
US11558289B2 (en) 2019-12-30 2023-01-17 Avago Technologies International Sales Pte. Limited Hyperscalar packet processing
WO2021143135A1 (en) * 2020-01-13 2021-07-22 苏州浪潮智能科技有限公司 Far-end data migration device and method based on fpga cloud platform
US11868297B2 (en) 2020-01-13 2024-01-09 Inspur Suzhou Intelligent Technology Co., Ltd. Far-end data migration device and method based on FPGA cloud platform
KR20210132348A (en) * 2020-04-27 2021-11-04 한국전자통신연구원 Computing resource disaggregated collaboration system of interconnected an optical line and, resource disaggregated collaboration method
KR102607421B1 (en) 2020-04-27 2023-11-29 한국전자통신연구원 Computing resource disaggregated collaboration system of interconnected an optical line and, resource disaggregated collaboration method

Also Published As

Publication number Publication date
CN111492628A (en) 2020-08-04
DE112019000965T5 (en) 2021-04-15

Similar Documents

Publication Publication Date Title
US20220197685A1 (en) Technologies for application-specific network acceleration with unified coherency domain
US10997106B1 (en) Inter-smartNIC virtual-link for control and datapath connectivity
US11194753B2 (en) Platform interface layer and protocol for accelerators
US8776090B2 (en) Method and system for network abstraction and virtualization for a single operating system (OS)
WO2019165355A1 (en) Technologies for nic port reduction with accelerated switching
US20180109471A1 (en) Generalized packet processing offload in a datacenter
US20230115114A1 (en) Hardware assisted virtual switch
US9678912B2 (en) Pass-through converged network adaptor (CNA) using existing ethernet switching device
US11372787B2 (en) Unified address space for multiple links
WO2016209502A1 (en) Netflow collection and export offload using network silicon
US11586575B2 (en) System decoder for training accelerators
KR20150030738A (en) Systems and methods for input/output virtualization
US11303638B2 (en) Atomic update of access control list rules
WO2014031430A1 (en) Systems and methods for sharing devices in a virtualization environment
US20230185732A1 (en) Transparent encryption
US11321179B1 (en) Powering-down or rebooting a device in a system fabric
CA3167334C (en) Zero packet loss upgrade of an io device
US20190042432A1 (en) Reducing cache line collisions
US20190042434A1 (en) Dynamic prefetcher tuning
WO2019173937A1 (en) Improved memory-mapped input/output (mmio) region access
US11138072B2 (en) Protected runtime mode
US20240028381A1 (en) Virtual i/o device management
Wang et al. High Performance Network Virtualization Architecture on FPGA SmartNIC
Wang et al. Design and implementation of a cloud computing-oriented virtual 10-Gigabit NIC
Deri et al. Towards wire-speed network monitoring using Virtual Machines

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19757928

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19757928

Country of ref document: EP

Kind code of ref document: A1