EP3563535B1 - Transmission of messages by acceleration components configured to accelerate a service - Google Patents

Transmission of messages by acceleration components configured to accelerate a service Download PDF

Info

Publication number
EP3563535B1
EP3563535B1 EP17833038.7A EP17833038A EP3563535B1 EP 3563535 B1 EP3563535 B1 EP 3563535B1 EP 17833038 A EP17833038 A EP 17833038A EP 3563535 B1 EP3563535 B1 EP 3563535B1
Authority
EP
European Patent Office
Prior art keywords
acceleration
components
point
acceleration components
component
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
EP17833038.7A
Other languages
German (de)
French (fr)
Other versions
EP3563535A1 (en
EP3563535B8 (en
Inventor
Adrian M. CAULFIELD
Eric S. CHUNG
Michael PAPAMICHAEL
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Technology Licensing LLC
Original Assignee
Microsoft Technology Licensing LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing LLC filed Critical Microsoft Technology Licensing LLC
Publication of EP3563535A1 publication Critical patent/EP3563535A1/en
Publication of EP3563535B1 publication Critical patent/EP3563535B1/en
Application granted granted Critical
Publication of EP3563535B8 publication Critical patent/EP3563535B8/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/66Layer 2 routing, e.g. in Ethernet based MAN's
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/70Virtual switches
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/16Multipoint routing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/65Re-configuration of fast packet switches
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network

Definitions

  • US2007/0053356A1 describes a system for scheduling multirate multicast packets through an interconnection network having a plurality of input ports, a plurality of output ports, and a plurality of input queues, comprising multirate multicast packets with rate weight, at each input port is operated in nonblocking manner by scheduling corresponding to the packet rate weight, at most as many packets equal to the number of input queues from each input port to each output port.
  • the scheduling is performed so that each multicast packet is fan-out split through not more than two interconnection networks and not more than two switching times.
  • the system is operated at 100% throughput, work conserving, fair, and yet deterministically thereby never congesting the output ports.
  • the system performs arbitration in only one iteration, with mathematical minimum speedup in the interconnection network.
  • the system operates with absolutely no packet reordering issues, no internal buffering of packets in the interconnection network, and hence in a truly cut-through and distributed manner.
  • the present disclosure relates to a system comprising a software plane including a plurality of host components configured to execute instructions corresponding to at least one service and an acceleration plane including a plurality of acceleration components configurable to accelerate the at least one service.
  • the system may further include a network configured to interconnect the software plane and the acceleration plane, the network may include a first top-of-rack (TOR) switch associated with a first subset of the plurality of acceleration components, a second TOR switch associated with a second subset of the plurality of acceleration components, and a third TOR switch associated with a third subset of the plurality of acceleration components, where any of the first subset of the plurality of acceleration components is configurable to transmit a point-to-point message to any of the second subset of the plurality of acceleration components and to transmit the point-to-point message to any of the third subset of the plurality of acceleration components, and where any of the second subset of the plurality of acceleration components is configurable to broadcast the point-to-point message to all of the second subset of
  • the present disclosure relates to a method for allowing a first acceleration component among a first plurality of acceleration components, associated with a first top-of-rack (TOR) switch, to transmit messages to other acceleration components in an acceleration plane configurable to provide service acceleration for at least one service.
  • the method may include the first acceleration component transmitting a point-to-point message to a second acceleration component, associated with a second TOR switch different form the first TOR switch, and to a third acceleration component, associated with a third TOR switch different form the first TOR switch and the second TOR switch.
  • the method may further include the second acceleration component broadcasting the point-to-point message to all of a second plurality of acceleration components associated with the second TOR switch.
  • the method may further include the third acceleration component broadcasting the point-to-point message to all of a third plurality of acceleration components associated with the third TOR switch.
  • the present disclosure relates to an acceleration component for use among a first plurality of acceleration components, associated with a first top-of-rack (TOR) switch, to transmit messages to other acceleration components in an acceleration plane configurable to provide service acceleration for at least one service.
  • the acceleration component may include a transport component configured to transmit a first point-to-point message to a second acceleration component, associated with a second TOR switch different form the first TOR switch, and to a third acceleration component, associated with a third TOR switch different from the first TOR switch and the second TOR switch.
  • the transport component may further be configured to broadcast a second point-to-point message to all of a second plurality of acceleration components associated with the second TOR switch and to all of a third plurality of acceleration components associated with the third TOR switch.
  • An acceleration component includes, but is not limited to, a hardware component configurable (or configured) to perform a function corresponding to a service being offered by, for example, a data center more efficiently than software running on a general-purpose central processing unit (CPU).
  • Acceleration components may include Field Programmable Gate Arrays (FPGAs), Graphics Processing Units (GPUs), Application Specific Integrated Circuits (ASICs), Erasable and/or Complex programmable logic devices (PLDs), Programmable Array Logic (PAL) devices, Generic Array Logic (GAL) devices, and massively parallel processor array (MPPA) devices.
  • FPGAs Field Programmable Gate Arrays
  • GPUs Graphics Processing Units
  • ASICs Application Specific Integrated Circuits
  • PLDs Erasable and/or Complex programmable logic devices
  • PAL Programmable Array Logic
  • GAL Generic Array Logic
  • MPPA massively parallel processor array
  • An image file may be used to configure or re-configure acceleration components such as FPGAs.
  • Information included in an image file can be used to program hardware components of an acceleration component (e.g., logic blocks and reconfigurable interconnects of an FPGA) to implement desired functionality.
  • Desired functionality can be implemented to support any service that can be offered via a combination of computing, networking, and storage resources such as via a data center or other infrastructure for delivering a service.
  • Cloud computing may refer to a model for enabling on-demand network access to a shared pool of configurable computing resources.
  • cloud computing can be employed in the marketplace to offer ubiquitous and convenient on-demand access to the shared pool of configurable computing resources.
  • the shared pool of configurable computing resources can be rapidly provisioned via virtualization and released with low management effort or service provider interaction, and then scaled accordingly.
  • a cloud computing model can be composed of various characteristics such as, for example, on-demand self-service, broad network access, resource pooling, rapid elasticity, measured service, and so forth.
  • a cloud computing model can also expose various service models, such as, for example, Software as a Service (“SaaS”), Platform as a Service (“PaaS”), and Infrastructure as a Service (“IaaS”).
  • SaaS Software as a Service
  • PaaS Platform as a Service
  • IaaS Infrastructure as a Service
  • a cloud computing model can also be deployed using different deployment models such as private cloud, community cloud, public cloud, hybrid cloud, and so forth.
  • a data center deployment may include a hardware acceleration plane and a software plane.
  • the hardware acceleration plane can include a plurality of networked acceleration components (e.g., FPGAs).
  • the software plane can include a plurality of networked software implemented host components (e.g., central processing units (CPUs)).
  • a network infrastructure can be shared between the hardware acceleration plane and the software plane.
  • software-implemented host components are locally linked to corresponding acceleration components. Acceleration components may communicate with each other via a network protocol.
  • any communication mechanisms may be required to meet certain performance requirements, including reliability.
  • the present disclosure provides for a lightweight transport layer for meeting such requirements.
  • the acceleration components may communicate with each other via a Lightweight Transport Layer (LTL).
  • LTL Lightweight Transport Layer
  • FIG. 1 shows architecture 100 that may include a software plane 104 and an acceleration plane 106 in accordance with one example.
  • the software plane 104 may include a collection of software-driven host components (each denoted by the symbol "S") while the acceleration plane may include a collection of acceleration components (each denoted by the symbol "A").
  • each host component may correspond to a server computer that executes machine-readable instruction using one or more central processing units (CPUs).
  • CPUs central processing units
  • these instructions may correspond to a service, such as a text/image/video search service, a translation service, or any other service that may be configured to provide a user of a device a useful result.
  • Each CPU may execute the instructions corresponding to the various components (e.g., software modules or libraries) of the service.
  • Each acceleration component may include hardware logic for implementing functions, such as, for example, portions of services offered by a data center.
  • Acceleration plane 106 may be constructed using a heterogenous or a homogenous collection of acceleration components, including different types of acceleration components and/or the same type of acceleration components with different capabilities.
  • acceleration plane 106 may include Field-Programmable Gate Arrays (FPGAs), Application-Specific Integrated Circuits (ASICs), Application-Specific Standard Products (ASSPs), System-on-a-Chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), other types of programmable hardware logic devices and so on.
  • Acceleration plane 106 may provide a reconfigurable fabric of acceleration components.
  • a host component may generally be any compute component that may perform operations by using each of its CPU hardware threads to execute machine-readable instructions.
  • An acceleration component may perform operations using several parallel logic elements to perform computational tasks.
  • an FPGA may include several gate arrays that may be configured to perform certain computational tasks in parallel.
  • an acceleration component can perform some operations in less time compared to a software-driven host component.
  • the "acceleration" reflects its potential for accelerating the functions that are performed by the host components.
  • architecture 100 may correspond to a data center environment that includes a large number of servers.
  • the servers may correspond to the host components in software plane 104.
  • architecture 100 may correspond to an enterprise system.
  • architecture 100 may correspond to a user device or appliance which uses at least one host component that has access to two or more acceleration components. Indeed, depending upon the requirements of a service other implementations for architecture 100 are also possible.
  • Network 120 may couple host components in software plane 104 to the other host components and couple acceleration components in acceleration plane 106 to other acceleration components.
  • host components can use network 120 to interact with one another and acceleration components can use network 120 to interact with one another.
  • Interaction among host components in software plane 104 may be independent of the interaction among acceleration components in acceleration plane 106.
  • two or more acceleration components may communicate in a transparent manner relative to host components in software plane 104, outside the direction of the host components, and without the host components being "aware" of a particular interaction even taking place in acceleration plane 106.
  • Architecture 100 may use any of a variety of different protocols to facilitate communication among acceleration components over network 120 and can use any of a variety of different protocols to facilitate communication between host components over network 120.
  • architecture 100 can use Ethernet protocol to transmit Internet Protocol (IP) packets over network 120.
  • IP Internet Protocol
  • each local host component in a server is given a single physical IP address.
  • the local acceleration component in the same server may adopt the same IP address.
  • the server can determine whether an incoming packet is destined for the local host component or destined for the local acceleration component in different ways. For example, packets that are destined for the local acceleration component can be formulated as UDP packets having a specific port; host-defined packets, on the other hand, may not be formulated in this way.
  • packets belonging to acceleration plane 106 can be distinguished from packets belonging to software plane 104 based on the value of a status flag in each of the packets.
  • architecture 100 can be viewed as two logical networks (software plane 104 and acceleration plane 106) that may share the same physical network communication links. Packets associated with the two logical networks may be distinguished from each other by their respective traffic classes.
  • each host component in the architecture 100 is coupled to at least one acceleration component in acceleration plane 104 through a local link.
  • a host component and acceleration component can be arranged together and maintained as a single serviceable unit (e.g., a server) within architecture 100.
  • the server can be referred to as the "local" host component to distinguish it from other host components that are associated with other servers.
  • acceleration component(s) of a server can be referred to as the "local" acceleration component(s) to distinguish them from other acceleration components that are associated with other servers.
  • host component 108 may be coupled to acceleration component 110 through local link 112 (e.g., a Peripheral Component Interconnect Express (PCIe) link).
  • host component 108 may be a local host component from the perspective of acceleration component 110 and acceleration component 110 may be a local acceleration component from the perspective of host component 108.
  • the local linking of host component 108 and acceleration component 110 can form part of a server.
  • host components in software plane 104 can be locally coupled to acceleration components in acceleration plane 106 through many individual links collectively represented as a local A -to-local S coupling 114.
  • a host component can interact directly with any locally linked acceleration components.
  • a host component can initiate communication to a locally linked acceleration component to cause further communication among multiple acceleration components. For example, a host component can issue a request for a service (or portion thereof) where functionality for the service *or portion thereof) is composed across a group of one or more acceleration components in acceleration plane 106.
  • a host component can also interact indirectly with other acceleration components in acceleration plane 106 to which the host component is not locally linked.
  • host component 108 can indirectly communicate with acceleration component 116 via acceleration component 110.
  • acceleration component 110 communicates with acceleration component 116 via a link 118 of a network (e.g., network 120).
  • Acceleration components in acceleration plane 106 may advantageously be used to accelerate larger scale services robustly in a data center. Substantial portions of complex datacenter services can be mapped to acceleration components (e.g., FPGAs) by using low latency interconnects for computations spanning multiple acceleration components. Acceleration components can also be reconfigured as appropriate to provide different service functionality at different times.
  • FIG. 1 shows a certain number of components of architecture 100 arranged in a certain manner, there could be more or fewer number of components arranged differently. In addition, various components of architecture 100 may be implemented using other technologies as well.
  • FIG. 2 shows a diagram of a system 200 for transmission or retransmission of messages by acceleration components configured to accelerate a service in accordance with one example.
  • system 200 may be implemented as a rack of servers in a data center.
  • Servers 204, 206, and 208 can be included in a rack in the data center.
  • Each of servers 204, 206, and 208 can be coupled to top-of-rack (TOR) switch 210.
  • TOR top-of-rack
  • Other racks although not shown, may have a similar configuration.
  • Server 204 may further include host component 212 including CPUs 214, 216, etc. Host component 212 along with host components from servers 206 and 208 can be included in software plane 104.
  • Server 204 may also include acceleration component 218. Acceleration component 218 along with acceleration components from servers 206 and 208 can be included in acceleration plane 106.
  • Acceleration component 218 may be directly coupled to a host component 212 via local link 220 (e.g., a PCIe link). Thus, acceleration component 218 can view host component 212 as a local host component. Acceleration component 218 and host component 212 may also be indirectly coupled by way of network interface controller 222 (e.g., used to communicate across network infrastructure 120). In this example, server 204 can load images representing service functionality onto acceleration component 218.
  • local link 220 e.g., a PCIe link
  • server 204 can load images representing service functionality onto acceleration component 218.
  • Acceleration component 218 may also be coupled to TOR switch 210. Hence, in system 200, acceleration component 218 may represent the path through which host component 212 interacts with other components in the data center (including other host components and other acceleration components). System 200 allows acceleration components 218 to perform processing on packets that are received from (and/or sent to) TOR switch 210 (e.g., by performing encryption, compression, etc.), without burdening the CPU-based operations performed by host component 212.
  • FIG. 2 shows a certain number of components of system 200 arranged in a certain manner, there could be more or fewer number of components arranged differently. In addition, various components of system 200 may be implemented using other technologies as well.
  • FIG. 3 shows a diagram of a system 300 for transmission or retransmission of messages by acceleration components configured to accelerate a service in accordance with one example.
  • IP routing may be used for transmitting or receiving messages among TOR switches, including TOR Switch 1 302, TOR Switch 2 304, and TOR Switch N 306.
  • Each server or sever group may have a single "physical" IP address that may be provided by the network administrator.
  • Server Group 1 320, Server Group 2 322, and Server Group N 324 may each include servers, where each of them may have a "physical" IP address.
  • Acceleration components may use its server's physical IP as its address.
  • UDP packets may be used.
  • An acceleration component may transmit a message to a selected set of acceleration components associated with different TOR switches using Layer 3 functionality corresponding to the seven-layer open-systems interconnection (OSI) model. Layer 3 functionality may be similar to that provided by the network layer of the OSI model.
  • OSI open-systems interconnection
  • an acceleration component may transmit a point-to-point message to each of the other relevant acceleration components associated with respective TOR switches. Those acceleration components may then use a Layer 2 Ethernet broadcast packet to send the data to all of the acceleration components associated with the TOR switch.
  • Layer 2 functionality may be similar to that provided by the data-link layer of the OSI model.
  • Layer 2 functionality may include media access control, flow control, and error checking. In one example, this step will not require any broadcasting support from a network interconnecting the acceleration plane and the software plane. This may advantageously alleviate the need for multicasting functionality provided by the routers or other network infrastructure. This, in turn, may reduce the complexity of deploying and managing acceleration components.
  • the higher levels of the network e.g., the network including routers and other TOR switches
  • the acceleration components that share a TOR switch may advantageously have a higher bandwidth available to them for any transmission of messages from one acceleration component to another.
  • FIG. 4 shows a diagram of an acceleration component 400 in accordance with one example.
  • Acceleration component 400 can be included in acceleration plane 106.
  • Components included in acceleration component 400 can be implemented on hardware resources (e.g., logic blocks and programmable interconnects) of acceleration component 400.
  • Acceleration component 400 may include application logic 406, soft shell 404 associated with a first set of resources and shell 402 associated with a second set of resources.
  • the resources associated with shell 402 may correspond to lower-level interface-related components that may generally remain the same across many different application scenarios.
  • the resources associated with soft shell 404 can remain the same across at least some different application scenarios.
  • the application logic 406 may be further conceptualized as including an application domain (e.g., a "role").
  • the application domain or role can represent a portion of functionality included in a composed service spread out over a plurality of acceleration components. Roles at each acceleration component in a group of acceleration components may be linked together to create a group that provides the service acceleration for the application domain.
  • the application domain hosts application logic 406 that performs service specific tasks (such as a portion of functionality for ranking documents, encrypting data, compressing data, facilitating computer vision, facilitating speech translation, machine learning, etc.).
  • Resources associated with soft shell 404 are generally less subject to change compared to the application resources, and the resources associated with shell 402 are less subject to change compared to the resources associated with soft shell 404 (although it is possible to change (reconfigure) any component of acceleration component 400).
  • application logic 406 interacts with the shell resources and soft shell resources in a manner analogous to the way a software-implemented application interacts with its underlying operating system resources. From an application development standpoint, the use of common shell resources and soft shell resources frees a developer from having to recreate these common components for each service.
  • shell resources may include bridge 408 for coupling acceleration component 400 to the network interface controller (via an NIC interface 410) and a local top-of-rack switch (via a TOR interface 412).
  • Bridge 408 also includes a data path that allows traffic from the NIC or TOR to flow into acceleration component 400, and traffic from the acceleration component 400 to flow out to the NIC or TOR.
  • bridge 408 may be composed of various FIFOs (414, 416) which buffer received packets, and various selectors and arbitration logic which route packets to their desired destinations.
  • a bypass control component 418 when activated, can control bridge 408 so that packets are transmitted between the NIC and TOR without further processing by the acceleration component 400.
  • Memory controller 420 governs interaction between the acceleration component 400 and local memory 422 (such as DRAM memory).
  • the memory controller 420 may perform error correction as part of its services.
  • Host interface 424 may provide functionality that enables acceleration component 400 to interact with a local host component (not shown).
  • the host interface 424 may use Peripheral Component Interconnect Express (PCIe), in conjunction with direct memory access (DMA), to exchange information with the local host component.
  • PCIe Peripheral Component Interconnect Express
  • DMA direct memory access
  • the outer shell may also include various other features 426, such as clock signal generators, status LEDs, error correction functionality, and so on.
  • Elastic router 428 may be used for routing messages between various internal components of the acceleration component 400, and between the acceleration component and external entities (e.g., via a transport component 430). Each such endpoint may be associated with a respective port.
  • elastic router 428 is coupled to memory controller 420, host interface 424, application logic 406, and transport component 430.
  • Transport component 430 may formulate packets for transmission to remote entities (such as other acceleration components), and receive packets from the remote entities (such as other acceleration components).
  • a 3-port switch 432 when activated, takes over the function of the bridge 408 by routing packets between the NIC and TOR, and between the NIC or TOR and a local port associated with the acceleration component 400.
  • Diagnostic recorder 434 may store information regarding operations performed by the router 428, transport component 430, and 3-port switch 432 in a circular buffer.
  • the information may include data about a packet's origin and destination IP addresses, host-specific data, or timestamps.
  • the log may be stored as part of a telemetry system (not shown) such that a technician may study the log to diagnose causes of failure or sub-optimal performance in the acceleration component 400.
  • a plurality of acceleration components like acceleration component 400 can be included in acceleration plane 106. Acceleration components can use different network topologies (instead of using network 120 for communication) to communicate with one another. In one aspect, acceleration components are connected directly to one another, such as, for example, in a two-dimensional torus. Although FIG. 4 shows a certain number of components of acceleration component 400 arranged in a certain manner, there could be more or fewer number of components arranged differently. In addition, various components of acceleration component 400 may be implemented using other technologies as well.
  • FIG. 5 shows a diagram of a 3-port switch 500 in accordance with one example (solid lines represent data paths, and dashed lines represent control signals).
  • 3-port switch 500 may provide features to prevent packets for acceleration components from being sent on to the host system. If the data network supports several lossless classes of traffic, 3-port switch 500 can be configured to provide sufficient support to buffer and pause incoming lossless flows to allow it to insert its own traffic into the network. To support that, 3-port switch 500 can be configured to distinguish lossless traffic classes (e.g., Remote Direct Memory Access (RDMA)) from lossy (e.g., TCP/IP) classes of flows. A field in a packet header can be used to identify which traffic class the packet belongs to.
  • Configuration memory 530 may be used to store any configuration files or data structures corresponding to 3-port switch 500.
  • 3-port switch 500 may have a first port 502 (host-side) to connect to a first MAC and a second port 504 (network-side) to connect to a second MAC.
  • a third local port may provide internal service to a transport component (e.g., transport component 430).
  • 3-port switch 500 may generally operate as a network switch, with some limitations. Specifically, 3-port switch 500 may be configured to pass packets received on the local port (e.g., Lightweight Transport Layer (LTL) packets) only to the second port 504 (not the first port 502). Similarly, 3-port switch 500 may be designed to not deliver packets from the first port 502 to the local port.
  • LTL Lightweight Transport Layer
  • 3-port switch 500 may have two packet buffers: one for the receiving (Rx) first port 502 and one for the receiving second port 504.
  • the packet buffers may be split into several regions. Each region may correspond to a packet traffic class. As packets arrive and are extracted from their frames (e.g., Ethernet frames), they may be classified by packet classifiers (e.g., packet classifier 520 and packet classifier 522) into one of the available packet classes (lossy, lossless, etc.) and written into a corresponding packet buffer. If no buffer space is available for an inbound packet, then the packet may be dropped.
  • an arbiter e.g., arbiter 512 or arbiter 514) may select from among the available packets and may transmit the packet.
  • a priority flow control (PFC) insertion block e.g. PFC insert 526 or PFC insert 528) may allow 3-port switch 500 to insert PFC frames between flow packets at the transmit half of either of the ports 502, 504.
  • 3-port switch 500 can handle a lossless traffic class as follows. All packets arriving on the receiving half of the first port 502 and on the receiving half of the second port 504 should eventually be transmitted on the corresponding transmit (Tx) halves of the ports. Packets may be store-and-forward routed. Priority flow control (PFC) can be implemented to avoid packet loss.
  • PFC Priority flow control
  • 3-port switch 500 may generate PFC messages and send them on the transmit parts of the first and second ports 502 and 504.
  • PFC messages are sent when a packet buffer fills up. When a buffer is full or about to be full, a PFC message is sent to the link partner requesting that traffic of that class be paused. PFC messages can also be received and acted on.
  • 3-port switch 500 may suspend sending packets on the transmit part of the port that received the control frame. Packets may buffer internally until the buffers are full, at which point a PFC frame will be generated to the link partner.
  • 3-port switch 500 can handle a lossy traffic class as follows. Lossy traffic (everything not classified as lossless) may be forwarded on a best-effort basis. 3-port switch 500 may be free to drop packets if congestion is encountered.
  • Signs of congestion can be detected in packets or flows before the packets traverse a host and its network stack. For example, if a congestion marker is detected in a packet on its way to the host, the transport component can quickly stop or start the flow, increase, or decrease available bandwidth, throttle other flows/connections, etc., before effects of congestion start to manifest at the host.
  • FIG. 6 shows a transport component 600 (an example corresponding to transport component 430 of FIG. 4 ) coupled to a 3-port switch 602 (an example corresponding to 3-port 432 of FIG. 4 ) and an elastic router 604 in accordance with one example.
  • Transport component 600 may be configured to act as an autonomous node on a network.
  • transport component 600 may be configured within an environment or a shell within which arbitrary processes or execution units can be instantiated.
  • the use of transport component 600 may be advantageous because of the proximity between the application logic and the network, and the removal of host-based burdens such as navigating a complex network stack, interrupt handling, and resource sharing.
  • applications or services using acceleration components with transport components, such as transport component 600 may be able to communicate with lower latencies and higher throughputs.
  • Transport component 600 may itself be an agent that generates and consumes network traffic for its own purposes.
  • Transport component 600 may be used to implement functionality associated with the mechanism or protocol for exchanging data, including transmitting or retransmitting messages.
  • transport component 600 may include transmit logic 610, receive logic 612, soft shell 614, connection management logic 616, configuration memory 618, transmit buffer 620, and receive buffer 622. These elements may operate to provide efficient and reliable communication among transport components that may be included as part of the acceleration components.
  • transport component 600 may be used to implement the functionality associated with the Lightweight Transport Layer (LTL). Consistent with this example of the LTL, transport component 600 may expose two main interfaces for the LTL: one for communication with 3-port switch 602 (e.g., a local network interface that may then connect to a network switch, such as a TOR switch) and the other for communication with elastic router 604 (e.g., an elastic router interface).
  • the local network interface (local_*) may contain a NeworkStream, Ready, and Valid for both Rx and Tx directions.
  • the elastic router interface (router_*) may expose a FIFO-like interface supporting multiple virtual channels and a credit-based flow control scheme.
  • Transport component 600 may be configured via a configuration data structure (struct) for runtime controllable parameters, and may output a status data structure (struct for status monitoring by a host or other soft-shell logic).
  • Table 1 shows an example of the LTL top-level module interface.
  • module LTL Base input core_clk, input core reset, input LTLConfiguration cfg, output LTLStatus status, output NetworkStream local_tx_out, output logic local_tx_empty_out, input local_tx_rden_in, input NetworkStream local_rx_in, input local_rx_wren_in, output logic local_rx_full_out, input RouterInterface router in, input router_valid_in, output RouterInterface router_out, output router_valid_out, output RouterCredit router_credit_out, input router_credit_ack_in, input RouterCredit router_credit_in, input router_credit_ack_out, input LTLRegAccess register wrdata in, input register_
  • Table 2 shows example static parameters that may be set for an LTL instance at compile time. The values for these parameters are merely examples and additional or fewer parameters may be specified.
  • Parameter Name Configured Value MAX_VIRTUAL CHANNELS 8 ER_PHITS_PER_FLIT 4 MAX_ER_CREDITS 256 EXTRA_SFQ_ ENTRIES 32
  • Src_port The UDP source port used in all LTL messages.
  • Dst_port The UDP destination port for all LTL messages.
  • DSCP The DSCP value set in IPv4 header of LTL messages - controls which Traffic Class (TC) LTL packets are routed in the datacenter.
  • Throttle_credits_per_scrub Number of cycles to reduce the per-flow inter_packet gap by on each scrub of the connection table. This may effectively provide a measure of bandwidth to return to each flow per time-period. This may be used as part of congestion management.
  • Timeout_Period Number of time-period counts to wait before timing out an unacknowledged packet and resending it. Disable_timeouts When set to 1, flows may never "give up”; in other words, unacknowledged packets will be resent continually. Throttle_min Minimum value of throttling IPG. Throttle_max Maximum value of throttling IPG. Throttle_credit_multiple Amount by which throttling IPG is multiplied on Timeouts, NACKs, and congestion events.
  • This multiplier may also be used for decreasing/increasing the per-flow inter-packet gap when exponential backout/comeback is used (see, for example, throttle _linear_ backoff and throttle_exponential_comeback).
  • Disable timeouts Disable timeout retries.
  • Disable_timeout_drops Disable timeout drops that happen after 128 timeout retries.
  • Xoff_period Controls how long of a pause to insert before attempting to send subsequent messages when a remote receiver is receiving XOFF NACKs indicating that it is currently receiving traffic from multiple senders (e.g., has VC locking enabled).
  • all messages may be encapsulated within IPv4/UDP frames.
  • Table 5 below shows an example packet format for encapsulating messages in such frames.
  • the Group column shows the various groups of fields in the packet structure.
  • the Description column shows the fields corresponding to each group in the packet structure.
  • the Size column shows the size in bits of each field.
  • the Value column provides a value for the field and, as needed, provides example description of the relevant field.
  • Connection management logic 616 may provide a register interface to establish connections between transport components. Connection management logic 616 along with software (e.g., a soft shell) may setup the connections before data can be transmitted or received. In one example, there are two connection tables that may control the state of connections, the Send Connection Table (SCT) and the Receive Connection Table (RCT). Each of these tables may be stored as part of configuration memory 618 or some other memory associated with transport component 600. Each entry in the SCT, a Send Connection Table Entry (SCTE), may store the current sequence number of a packet and other connection state used to build packets, such as the destination MAC address. Requests arriving from elastic router 604 may be matched to an SCTE by comparing the destination IP address and the virtual channel fields provided by elastic router 604.
  • SCT Send Connection Table
  • RTT Receive Connection Table
  • the tuple ⁇ IP, VC ⁇ may be a unique key (in database terms) in the table. It may be possible to have two entries in the table with the same VC-for example, ⁇ IP: 10.0.0.1, VC: 0 ⁇ , and ⁇ IP: 10.0.0.2, VC:0 ⁇ . It may also be possible to have two entries with the same IP address and different VCs: ⁇ IP: 10.0.0.1, VC: 0 ⁇ and ⁇ IP: 10.0.0.2, VC: 1 ⁇ . However, two entries with the same ⁇ IP, VC ⁇ pair may not be allowed. The number of entries that LTL supports may be configured at compile time.
  • Elastic router 604 may move data in Flits, which may be 128B in size (32B x 4 cycles). Messages may be composed of multiple flits, de-marked by start and last flags. In one example, once elastic router 604 selects a flow to send from an input port to an output port for a given virtual channel, the entire message must be delivered before another message will start to arrive on the same virtual channel. Connection management logic 616 may need to packetize messages from elastic router 604 into the network's maximum transport unit (MTU) sized pieces. This may be done by buffering data on each virtual channel until one of the following conditions is met: (1) the last flag is seen in a flit and (2) an MTU's worth of data (or appropriately reduced size to fit headers and alignment requirements).
  • MTU maximum transport unit
  • the MTU for an LTL payload may be 1408 bytes.
  • transport component 600 via transmit logic 610, may attempt to send that packet.
  • Packet destinations may be determined through a combination of which virtual channel the message arrives on at transport component 600 input (from elastic router 604) and a message header that may arrive during the first cycle of the messages from elastic router 604. These two values may be used to index into the Send Connection Table, which may provide the destination IP address and sequence numbers for the connection.
  • each packet transmitted on a given connection should have a sequence number one greater than the previous packet for that connection. The only exception may be for retransmits, which may see a dropped or unacknowledged packet retransmitted with the same sequence number as it was originally sent with.
  • the first packet sent on a connection may have Sequence Number set to 1. So, as an example, for a collection of flits arriving on various virtual channels (VCs) into transport component 600 from elastic router 604, data may be buffered using buffers (e.g., receive buffer 622) until the end of a message or MTU worth of data has been received and then a packet may be output. So, as an example, if 1500B is sent from elastic router 604 to at least one LTL instance associated with transport component 600 as a single message (e.g. multiple flits on the same VC with zero or one LAST flags), at least one packet may be generated. In this example, the LTL instance may send messages as soon as it has buffered the data - i.e.
  • Transport component 600 may not have advance knowledge of how much data a message will contain.
  • an instance of LTL associated with transport component 600 may deliver arriving flits that match a given SCT entry, in-order, even in the face of drops and timeouts. Flits that match different SCT entries may have no ordering guarantees.
  • transport component 600 will output one credit for each virtual channel, and then one credit for each shared buffer. Credits will be returned after each flit, except for when a flit finishes an MTU buffer. This may happen if a last flag is received or when a flit contains the MTUth byte of a message. Credits consumed in this manner may be held by transport component 600 until the packet is acknowledged.
  • packets arriving from the network are matched to an RCT entry (RCTE) through a field in the packet header.
  • RCTE stores the last sequence number and which virtual channel (VC) to output packets from transport component 600 to elastic router 604 on.
  • Multiple entries in the RCT can point to the same output virtual channel.
  • the number of entries that LTL supports may be configured at compile time.
  • transport component 600 may determine which entry in the Receive Connection Table (RCT) the packet pairs with. If no matching RCT table exists, the packet may be dropped.
  • Transport component may check that the sequence number matches the expected value from the RCT entry.
  • the packet may be dropped. If the sequence number is less than the RCT entry expects, an acknowledgement (ACK) may be generated and the packet may be dropped. If it matches, transport component may grab the virtual channel field of the RCT entry. If the number of available elastic router (ER) credits for that virtual channel is sufficient to cover the packet size, transport component 600 may accept the packet. If there are insufficient credits, transport component 600 may drop the packet. Once the packet is accepted, an acknowledgement (ACK) may be generated and the RCT entry sequence number may be incremented. Elastic router 604 may use the packet header to determine the final endpoint that the message is destined for.
  • Transport component may need sufficient credits to be able to transfer a whole packet's worth of data into elastic router 604 to make forward progress.
  • transport component 600 may require elastic router 604 to provide dedicated credits for each VC to handle at least one MTU of data for each VC. In this example, no shared credits may be assumed.
  • SCT/RCT entries can be written by software.
  • software may keep a mirror of the connection setup.
  • the user may write to the register_wrdata in port, which may be hooked to registers in the soft shell or environment corresponding to the application logic.
  • Table 6, below, is an example of the format of a data structure that can be used for updating entries in the SCT or the RCT.
  • scte_not_rcte 1
  • sCTI value 1
  • sCTI value 1
  • IPAddr 1
  • MacAddr may be set to the MAC address of a host on the same LAN segment as the acceleration component or the MAC address of the router for the remote hosts.
  • VirtualChannel may be set by looking it up from the flit that arrives from elastic router 604. To write to an RCT entry, one may set scte_not_rcte to 0, set rCTI value to the value of the index of the RCT that is being written to, and then set the other fields of the data structure in Table 6 appropriately.
  • rCTI may be set to the sending acceleration component's RCT entry.
  • IPAddr may be set to the sending acceleration component's IP address. MacAddr may be ignored for the purposes of writing to the RCT.
  • VirtualChannel may be set to the channel on which the message will be sent to elastic router 604.
  • a node A e.g., transport component A (10.0.0.1)
  • node B e.g., transport component B (10.0.0.2)
  • transport component A create SCTE ⁇ sCTI: 1, rCTI: 4, IP: 10.0.0.2, VC: 1, Mac:01-02-03-04-05-06 ⁇
  • transport component B create RCTE ⁇ rCTI: 4, sCTI: 1, IP: 10.0.0.1, VC: 2 ⁇ .
  • the packet header will have the rCTI field set to 4 (the rCTI value read from the SCT).
  • Transport component B will access its RCT entry 4, and learn that the message should be output on VC 2. It will also generate an ACK back to transport component A.
  • the sCTI field will have the value 1 (populated from the sCTI value read from the RCT).
  • An instance of LTL associated with transport component 600 may buffer all sent packets until it receives an acknowledgement (ACK) from the receiving acceleration component. If an ACK for a connection doesn't arrive within a configurable timeout period, the packet may be retransmitted. In this example, all unacknowledged packets starting with the oldest will be retransmitted. A drop of a packet belonging to a given SCT may not alter the behavior of any other connections - i.e. packets for other connection may not be retransmitted. Because the LTL instance may require a reliable communication channel and packets can occasionally go missing on the network, in one example, a timeout based retry mechanism may be used.
  • a packet may be retransmitted.
  • the timeout period may be set via a configuration parameter. Once a timeout occurs, transport component 600 may adjust the congestion inter-packet gap for that flow.
  • Transport component 600 may also provide congestion control. If an LTL instance transmits data to a receiver incapable of absorbing traffic at full line rate, the congestion control functionality may allow it to gracefully reduce the frequency of packets being sent to the destination node.
  • Each LTL connection may have an associated inter-packet gap state that controls the minimum number of cycles between the transmission of packets in a flow.
  • the IPG may be set to 1, effectively allowing full use of any available bandwidth.
  • the delay may be multiplied by the cfg.throttle_credit_multiple parameter (see Table 3) or increased by the cfg.throttle_credits_per_scrub parameter (see Table 3; depending on if linear or exponential backoff is selected).
  • Each ACK received may reduce the IPG by the cfg.throttle_credits_per_scrub parameter (see Table 3) or divide it by the cfg.throttle_credit multiple parameter (see Table 3; depending on if linear or exponential comeback is selected).
  • An LTL instance may not increase a flow's IPG more than once every predetermined time period; for example, not more than every 2 microseconds (in this example, this may be controlled by the cfg.throttle_scrub_delay parameter (see Table 3)).
  • transport component 600 may attempt retransmission 128 times. If, after 128 retries, the packet is still not acknowledged, the packet may be discarded and the buffer freed. Unless the disable_timeouts configuration bit is set, transport component 600 may also clear the SCTE for this connection to prevent further messages and packets from being transmitted. In this example, at this point, no data can be exchanged between the link partners since their sequence numbers will be out of sync. The connection would need to be re-established.
  • ACKs may include a sequence number that tells the sender the last packet that was successfully received and the SCTI the sender should credit the ACK to (this value may be stored in the ACK-generator's RCT).
  • the following rules may be used for generating acks: (1) if the RX Sequence Number matches the expected Sequence Number (in RCT), an ACK is generated with the received sequence number; (2) if the RX Sequence Number is less than the expected Sequence Number, the packet is dropped, but an ACK with the highest received Sequence Number is generated (this may cover the case where a packet is sent twice (perhaps due to a timeout) but then received correctly); and (3) if the RX Sequence Number is greater than the expected Sequence Number, the packet is dropped and no ACK is generated.
  • An instance of an LTL associated with transport component 600 may generate NACKs under certain conditions. These may be packets flagged with both the ACK and NACK flag bits set. A NACK may be a request for the sender to retransmit a particular packet and all subsequent packets.
  • transport component 600 may require the generation of a NACK: (1) if a packet is dropped due to insufficient elastic router credits to accept the whole packet, transport component 600 may send a NACK once there are sufficient credits; or (2) if a packet is dropped because another sender currently holds the lock on a destination virtual channel, transport component 600 may send a NACK once the VC lock is released.
  • the receiver may maintain a side data structure per VC (VCLockQueue) that may keep track of which senders had their packets dropped because another message was being received on a specific VC.
  • This side data structure may be used to coordinate multiple senders through explicit retransmit requests (NACKs).
  • an instance of LTL associated with transport component 600 starts receiving a message on a specific VC, that VC is locked to that single sender until all packets of that message have been received. If another sender tries to send a packet on the same VC while it is locked or while there are not enough ER credits available, the packet will get dropped and the receiver will be placed on the VCLockQueue. Once the lock is released or there are enough ER credits, the LTL instance will pop the VCLockQueue and send a retransmit request (NACK) to the next sender that was placed in the VCLockQueue.
  • NACK retransmit request
  • a sender After being popped from the VCLockQueue, a sender may be given the highest priority for the next 200000 cycles (-1.15ms). Packets from the other senders on the same VC will be dropped during these 200000 cycles. This may ensure that all senders that had packets dropped will eventually get a chance to send their message.
  • the receiver may place the sender (that had its packet dropped) on the VCLockQueue and send a NACK that also includes the XOFF flag, indicating that the sender should not try retransmitting for some time (dictated by the cfg.xoff_period parameter). If the receiver was out of ER credits the NACK may also include the Congestion flag.
  • a sender When a sender receives a NACK with the XOFF flag it may delay the next packet per the back-off period (e.g., the xoff_period). If the NACK does not include the congestion flag (i.e., the drop was not due to insufficient credits but due to VC locking), then the sender may make a note that VC locking is active for that flow.
  • VC locking enabled senders may need to make sure to slow down after finishing every message, because they know that packets of subsequent messages will get dropped since they are competing with other senders that will be receiving the VC lock next. However, in this example, senders will need to make sure to send the first packet of a subsequent message before slowing down (even though they know it will be dropped) to make sure that they get placed in the VCLockQueue.
  • FIG. 7 shows a flow chart 700 for a method for processing messages using transport components to provide a service in accordance with one example.
  • the application logic e.g., application logic 608 of FIG. 6
  • the application logic may be divided up and mapped into multiple accelerator component's roles.
  • the application logic may be conceptualized as including an application domain (e.g., a "role").
  • the application domain or role can represent a portion of functionality included in a composed service spread out over a plurality of acceleration components. Roles at each acceleration component in a group of acceleration components may be linked together to create a group that provides the service acceleration for the application domain.
  • Each application domain may host application logic to perform service specific tasks (such as a portion of functionality for ranking documents, encrypting data, compressing data, facilitating computer vision, facilitating speech translation, machine learning, etc.).
  • Step 702 may include receiving a message from a host to perform a task corresponding to a service.
  • Step 704 may include forwarding the message to an acceleration component at the head of a multi-stage pipeline of acceleration components, where the acceleration components may be associated with a first switch.
  • a request for a function associated with a service for example, a ranking request arrives from a host in the form of a PCI express message, it may be forwarded to the head of a multi-stage pipeline of acceleration components.
  • the acceleration component that received the message may determine whether the message is for the acceleration component at the head. If the message is for the acceleration component at the head of a pipeline stage, then the acceleration component at the head of the pipeline stage may process the message at the acceleration component. As part of this step, an elastic router (e.g., elastic router 604 of FIG. 6 ) may forward the message directly to the role.
  • the LTL packet format e.g., as shown in Table 6) may include a broadcast flag (e.g., Bit 3 under the flags header as shown in Table 6) and a retransmission flag (e.g., Bit 2 under the flags header as shown in Table 6). The broadcast flag may signal to an acceleration component that the message is intended for multiple acceleration components.
  • the retransmission flag may indicate to an acceleration component that retransmission of the message is requested.
  • the LTL packet format may include headers that list the IP addresses for the specific destination(s).
  • the acceleration component e.g., using transport component 600
  • the point-to-point message is transmitted using a Layer 3 functionality.
  • transmit component 600 when transmit component 600 receives a packet (e.g., a part of the point-to-point message) with the broadcast flag set, it may process the destination list; if it contains more than just its own IP address, it will add the packet to its own transmit buffers, setup a bit-field to track receipt by each of the destinations and then start transmitting to each receiver one by one (without the broadcast or retransmit fields). Destination acceleration components may acknowledge (using ACK) each packet upon successful receipt and the sender will mark the appropriate bit in the bit-field. Once a packet is acknowledged by all destinations, the send buffer can be released.
  • ACK acknowledgment of ACK
  • each of the set of acceleration components may, using data-link layer functionality, broadcast that message to any other acceleration components associated with the respective TOR switch.
  • An example of data-link layer functionality may be Layer 2 Ethernet broadcast packets.
  • transport component 600 when transport component 600 receives a packet (e.g., a part of the point-to-point message) with the broadcast flag set, it may process the destination list; if it contains more than just its own IP address, it will add the packet to its own transmit buffers, setup a bit-field to track receipt by each of the destinations and then send a Layer 2 Ethernet broadcast (with the broadcast flag and the destination list) to all of the acceleration components that share a common TOR switch.
  • a packet e.g., a part of the point-to-point message
  • transport component 600 may process the destination list; if it contains more than just its own IP address, it will add the packet to its own transmit buffers, setup a bit-field to track receipt by each of the destinations and then send a Layer 2 Ethernet broadcast (with the broadcast flag and
  • Destination acceleration components may acknowledge (using ACK) each packet upon successful receipt and the sender will mark the appropriate bit in the bit-field. Once a packet is acknowledged by all destinations, the send buffer can be released.
  • FIG. 7 shows a certain number of steps listed in a certain order, there could be fewer or more steps and such steps could be performed in a different order.
  • the acceleration components may be grouped together as part of a graph.
  • the grouped acceleration components need not be physically proximate to each other; instead, they could be associated with different parts of the data center and still be grouped together by linking them as part of an acceleration plane.
  • the graph may have a certain network topology depending upon which of the acceleration components associated with which of the TOR switches are coupled together to accelerate a service.
  • the network topology may be dynamically created based on configuration information received from a service manager for the service. Service manager may be a higher-level software associated with the service.
  • the network topology may be dynamically adjusted based on at least one performance metric associated with the network (e.g., network 120) interconnecting the acceleration plane and a software plane including host components configured to execute instructions corresponding to the at least one service.
  • Service manager may use a telemetry service to monitor network performance.
  • the network performance metric may be selected substantially, in real time, based on at least on the requirements of the at least one service.
  • the at least one network performance metric may comprise latency, bandwidth, or any other performance metric specified by a service manager or application logic corresponding to the at least one service.
  • the acceleration components may broadcast messages using a tree-based transmission process including point-to-point links for acceleration components connected via Layer 3 and Layer 2 Ethernet broadcasts for the acceleration components that share a TOR switch.
  • the tree may be two-level or may have more levels depending upon bandwidth limitations imposed by the network interconnecting the acceleration components.
  • FIG. 8 shows a flow chart for a method for transmitting messages in accordance with one example.
  • a first acceleration component associated with a first TOR switch may receive a message from a host.
  • an acceleration component may include a transport component to handle messaging (e.g., transport component of 430 of FIG. 4 , which is further described with respect to transport component 600 of FIG. 6 ).
  • the first acceleration component may transmit the message to a second acceleration component associated with a second TOR switch, different from the first TOR switch.
  • a network layer functionality e.g., Layer 3 functionality
  • the first acceleration component may transmit the message to a third acceleration component associated with a third TOR switch, different from the first TOR switch and the second TOR switch.
  • transmit component 600 may process the destination list; if it contains more than just its own IP address, it will add the packet to its own transmit buffers, setup a bit-field to track receipt by each of the destinations and then start transmitting to each receiver one by one (without the broadcast or retransmit fields).
  • Destination acceleration components may acknowledge (using ACK) each packet upon successful receipt and the sender will mark the appropriate bit in the bit-field. Once a packet is acknowledged by all destinations, the send buffer can be released.
  • the second acceleration component may broadcast the message to the other acceleration components associated with the second TOR switch.
  • transport component 600 may process the destination list; if it contains more than just its own IP address, it will add the packet to its own transmit buffers, setup a bit-field to track receipt by each of the destinations and then send a Layer 2 Ethernet broadcast (with the broadcast flag and the destination list) to all of the acceleration components that share a common TOR switch.
  • Destination acceleration components may acknowledge (using ACK) each packet upon successful receipt and the sender will mark the appropriate bit in the bit-field. Once a packet is acknowledged by all destinations, the send buffer can be released.
  • the third acceleration component may broadcast the message to the other acceleration components associated with the third TOR switch.
  • transport component 600 may process the destination list; if it contains more than just its own IP address, it will add the packet to its own transmit buffers, setup a bit-field to track receipt by each of the destinations and then send a Layer 2 Ethernet broadcast (with the broadcast flag and the destination list) to all of the acceleration components that share a common TOR switch.
  • Destination acceleration components may acknowledge (using ACK) each packet upon successful receipt and the sender will mark the appropriate bit in the bit-field. Once a packet is acknowledged by all destinations, the send buffer can be released.
  • FIG. 8 shows a certain number of steps listed in a certain order, there could be fewer or more steps and such steps could be performed in a different order.
  • a system comprising a software plane including a plurality of host components configured to execute instructions corresponding to at least one service and an acceleration plane including a plurality of acceleration components configurable to accelerate the at least one service.
  • the system may further include a network configured to interconnect the software plane and the acceleration plane, the network may include a first top-of-rack (TOR) switch associated with a first subset of the plurality of acceleration components, a second TOR switch associated with a second subset of the plurality of acceleration components, and a third TOR switch associated with a third subset of the plurality of acceleration components, where any of the first subset of the plurality of acceleration components is configurable to transmit a point-to-point message to any of the second subset of the plurality of acceleration components and to transmit the point-to-point message to any of the third subset of the plurality of acceleration components, and where any of the second subset of the plurality of acceleration components is configurable to broadcast the point-to-point message to all of the second subset of the plurality of acceleration
  • the point-to-point message may be transmitted using a Layer 3 functionality.
  • the point-to-point message may be broadcasted using a Layer 2 Ethernet broadcast functionality.
  • the point-to-point message may be broadcasted without relying upon any broadcast support from higher layers of the network than a Layer 2 of the network or any multicast support from the higher layers of the network than the Layer 2 of the network.
  • Any of the first subset of the plurality of acceleration components may be configured to dynamically create a network topology comprising any of the second subset of the plurality of the acceleration components and the third subset of the plurality of the acceleration components.
  • Any of the first subset of the plurality of acceleration components may further be configured to dynamically adjust the network topology based on at least one performance metric associated with the network configured to interconnect the software plane and the acceleration plane.
  • the at least one network performance metric may be selected substantially in real-time by the any of the first subset of the plurality of acceleration components based at least on requirements of the at least one service.
  • the at least one network performance metric may comprise latency, bandwidth, or any other performance metric specified by an application logic corresponding to the at least one service.
  • the present disclosure relates to a method for allowing a first acceleration component among a first plurality of acceleration components, associated with a first top-of-rack (TOR) switch, to transmit messages to other acceleration components in an acceleration plane configurable to provide service acceleration for at least one service.
  • the method may include the first acceleration component transmitting a point-to-point message to a second acceleration component, associated with a second TOR switch different form the first TOR switch, and to a third acceleration component, associated with a third TOR switch different form the first TOR switch and the second TOR switch.
  • the method may further include the second acceleration component broadcasting the point-to-point message to all of a second plurality of acceleration components associated with the second TOR switch.
  • the method may further include the third acceleration component broadcasting the point-to-point message to all of a third plurality of acceleration components associated with the third TOR switch.
  • the first acceleration component may transmit the point-to-point message using a Layer 3 functionality.
  • Each of the second acceleration component and the third acceleration component may broadcast the point-to-point message using a Layer 2 Ethernet broadcast functionality.
  • the method may further include dynamically adjusting the network topology based on at least one performance metric associated with a network interconnecting the acceleration plane and a software plane including a plurality of host components configured to execute instructions corresponding to the at least one service.
  • the at least one network performance metric may be selected substantially in real-time by the any of the first subset of the plurality of acceleration components based at least on requirements of the at least one service.
  • the at least one network performance metric may comprise latency, bandwidth, or any other performance metric specified by an application logic corresponding to the at least one service.
  • the present disclosure relates to an acceleration component for use among a first plurality of acceleration components, associated with a first top-of-rack (TOR) switch, to transmit messages to other acceleration components in an acceleration plane configurable to provide service acceleration for at least one service.
  • the acceleration component may include a transport component configured to transmit a first point-to-point message to a second acceleration component, associated with a second TOR switch different form the first TOR switch, and to a third acceleration component, associated with a third TOR switch different from the first TOR switch and the second TOR switch.
  • the transport component may further be configured to broadcast a second point-to-point message to all of a second plurality of acceleration components associated with the second TOR switch and to all of a third plurality of acceleration components associated with the third TOR switch.
  • the point-to-point message may be transmitted using a Layer 3 functionality.
  • the point-to-point message may be broadcasted using a Layer 2 Ethernet broadcast functionality.
  • the acceleration component may further be configured to broadcast the second point-to-point message in a network interconnecting acceleration components in the acceleration plane without relying upon any support for broadcasting or multicasting from higher layers of the network than a Layer 2 of the network.
  • any arrangement of components to achieve the same functionality is effectively "associated" such that the desired functionality is achieved.
  • any two components herein combined to achieve a particular functionality can be seen as “associated with” each other such that the desired functionality is achieved, irrespective of architectures or inter-medial components.
  • any two components so associated can also be viewed as being “operably connected,” or “coupled,” to each other to achieve the desired functionality.
  • non-transitory media refers to any media storing data and/or instructions that cause a machine to operate in a specific manner.
  • exemplary non-transitory media include non-volatile media and/or volatile media.
  • Non-volatile media include, for example, a hard disk, a solid state drive, a magnetic disk or tape, an optical disk or tape, a flash memory, an EPROM, NVRAM, PRAM, or other such media, or networked versions of such media.
  • Volatile media include, for example, dynamic memory such as DRAM, SRAM, a cache, or other such media.
  • Non-transitory media is distinct from, but can be used in conjunction with transmission media.
  • Transmission media is used for transferring data and/or instruction to or from a machine.
  • Exemplary transmission media include coaxial cables, fiber-optic cables, copper wires, and wireless media, such as radio waves.

Description

    BACKGROUND
  • Increasingly, users access applications offered via computing, networking, and storage resources located in a data center. These applications run in a distributed computing environment, which is sometimes referred to as the cloud computing environment. Computer servers in a data center are interconnected via a network and thus the applications running on the computer servers can communicate with each other via the network. In large data centers the communication of messages among the computer servers can include broadcasting or multicasting a message from a computer server to several other computer servers. Broadcasting or multicasting in such data centers can use a significant portion of the bandwidth available for the applications. That, in turn, can degrade the performance of these applications as they experience lower throughput and higher latency.
  • Thus, there is a need for methods and systems that alleviate at least some of these issues. US2007/0053356A1 describes a system for scheduling multirate multicast packets through an interconnection network having a plurality of input ports, a plurality of output ports, and a plurality of input queues, comprising multirate multicast packets with rate weight, at each input port is operated in nonblocking manner by scheduling corresponding to the packet rate weight, at most as many packets equal to the number of input queues from each input port to each output port. The scheduling is performed so that each multicast packet is fan-out split through not more than two interconnection networks and not more than two switching times. The system is operated at 100% throughput, work conserving, fair, and yet deterministically thereby never congesting the output ports. The system performs arbitration in only one iteration, with mathematical minimum speedup in the interconnection network. The system operates with absolutely no packet reordering issues, no internal buffering of packets in the interconnection network, and hence in a truly cut-through and distributed manner.
  • SUMMARY
  • According to aspects of the present invention there is provided a system and a method as defined in the accompanying claims.
  • In one example, the present disclosure relates to a system comprising a software plane including a plurality of host components configured to execute instructions corresponding to at least one service and an acceleration plane including a plurality of acceleration components configurable to accelerate the at least one service. The system may further include a network configured to interconnect the software plane and the acceleration plane, the network may include a first top-of-rack (TOR) switch associated with a first subset of the plurality of acceleration components, a second TOR switch associated with a second subset of the plurality of acceleration components, and a third TOR switch associated with a third subset of the plurality of acceleration components, where any of the first subset of the plurality of acceleration components is configurable to transmit a point-to-point message to any of the second subset of the plurality of acceleration components and to transmit the point-to-point message to any of the third subset of the plurality of acceleration components, and where any of the second subset of the plurality of acceleration components is configurable to broadcast the point-to-point message to all of the second subset of the plurality of acceleration components, and where any of the third subset of the plurality of acceleration components is configurable to broadcast the point-to-point message to all of the third subset of the plurality of acceleration components.
  • In another example, the present disclosure relates to a method for allowing a first acceleration component among a first plurality of acceleration components, associated with a first top-of-rack (TOR) switch, to transmit messages to other acceleration components in an acceleration plane configurable to provide service acceleration for at least one service. The method may include the first acceleration component transmitting a point-to-point message to a second acceleration component, associated with a second TOR switch different form the first TOR switch, and to a third acceleration component, associated with a third TOR switch different form the first TOR switch and the second TOR switch. The method may further include the second acceleration component broadcasting the point-to-point message to all of a second plurality of acceleration components associated with the second TOR switch. The method may further include the third acceleration component broadcasting the point-to-point message to all of a third plurality of acceleration components associated with the third TOR switch.
  • In yet another example, the present disclosure relates to an acceleration component for use among a first plurality of acceleration components, associated with a first top-of-rack (TOR) switch, to transmit messages to other acceleration components in an acceleration plane configurable to provide service acceleration for at least one service. The acceleration component may include a transport component configured to transmit a first point-to-point message to a second acceleration component, associated with a second TOR switch different form the first TOR switch, and to a third acceleration component, associated with a third TOR switch different from the first TOR switch and the second TOR switch. The transport component may further be configured to broadcast a second point-to-point message to all of a second plurality of acceleration components associated with the second TOR switch and to all of a third plurality of acceleration components associated with the third TOR switch.
  • This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The present disclosure is illustrated by way of example and is not limited by the accompanying figures, in which like references indicate similar elements. Elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale.
    • FIG. 1 is a diagram of an architecture that may include a software plane and an acceleration plane in accordance with one example;
    • FIG. 2 shows a diagram of a system for transmission of messages by acceleration components configured to accelerate a service in accordance with one example;
    • FIG. 3 shows a diagram of a system for transmission of messages by acceleration components configured to accelerate a service in accordance with one example;
    • FIG. 4 shows a diagram of an acceleration component in accordance with one example;
    • FIG. 5 shows a diagram of a 3-port switch in accordance with one example;
    • FIG. 6 shows a diagram of a system for transmission of messages by an acceleration component configured to accelerate a service in accordance with one example;
    • FIG. 7 shows a flow chart of a method for transmission of messages by an acceleration component configured to accelerate a service in accordance with one example; and
    • FIG. 8 shows a flow chart of another method for transmission of messages by acceleration components configured to accelerate a service in accordance with one example.
    DETAILED DESCRIPTION
  • Examples described in this disclosure relate to methods and systems that provide for transmission of messages among acceleration components configurable to accelerate a service. An acceleration component includes, but is not limited to, a hardware component configurable (or configured) to perform a function corresponding to a service being offered by, for example, a data center more efficiently than software running on a general-purpose central processing unit (CPU). Acceleration components may include Field Programmable Gate Arrays (FPGAs), Graphics Processing Units (GPUs), Application Specific Integrated Circuits (ASICs), Erasable and/or Complex programmable logic devices (PLDs), Programmable Array Logic (PAL) devices, Generic Array Logic (GAL) devices, and massively parallel processor array (MPPA) devices. An image file may be used to configure or re-configure acceleration components such as FPGAs. Information included in an image file can be used to program hardware components of an acceleration component (e.g., logic blocks and reconfigurable interconnects of an FPGA) to implement desired functionality. Desired functionality can be implemented to support any service that can be offered via a combination of computing, networking, and storage resources such as via a data center or other infrastructure for delivering a service.
  • The described aspects can also be implemented in cloud computing environments. Cloud computing may refer to a model for enabling on-demand network access to a shared pool of configurable computing resources. For example, cloud computing can be employed in the marketplace to offer ubiquitous and convenient on-demand access to the shared pool of configurable computing resources. The shared pool of configurable computing resources can be rapidly provisioned via virtualization and released with low management effort or service provider interaction, and then scaled accordingly. A cloud computing model can be composed of various characteristics such as, for example, on-demand self-service, broad network access, resource pooling, rapid elasticity, measured service, and so forth. A cloud computing model can also expose various service models, such as, for example, Software as a Service ("SaaS"), Platform as a Service ("PaaS"), and Infrastructure as a Service ("IaaS"). A cloud computing model can also be deployed using different deployment models such as private cloud, community cloud, public cloud, hybrid cloud, and so forth.
  • A data center deployment may include a hardware acceleration plane and a software plane. The hardware acceleration plane can include a plurality of networked acceleration components (e.g., FPGAs). The software plane can include a plurality of networked software implemented host components (e.g., central processing units (CPUs)). A network infrastructure can be shared between the hardware acceleration plane and the software plane. In some environments, software-implemented host components are locally linked to corresponding acceleration components. Acceleration components may communicate with each other via a network protocol. To provide reliable service to a user of the service being offered via a data center, any communication mechanisms may be required to meet certain performance requirements, including reliability. In certain examples, the present disclosure provides for a lightweight transport layer for meeting such requirements. In one example, the acceleration components may communicate with each other via a Lightweight Transport Layer (LTL).
  • FIG. 1 shows architecture 100 that may include a software plane 104 and an acceleration plane 106 in accordance with one example. The software plane 104 may include a collection of software-driven host components (each denoted by the symbol "S") while the acceleration plane may include a collection of acceleration components (each denoted by the symbol "A"). In this example, each host component may correspond to a server computer that executes machine-readable instruction using one or more central processing units (CPUs). In one example, these instructions may correspond to a service, such as a text/image/video search service, a translation service, or any other service that may be configured to provide a user of a device a useful result. Each CPU may execute the instructions corresponding to the various components (e.g., software modules or libraries) of the service. Each acceleration component may include hardware logic for implementing functions, such as, for example, portions of services offered by a data center.
  • Acceleration plane 106 may be constructed using a heterogenous or a homogenous collection of acceleration components, including different types of acceleration components and/or the same type of acceleration components with different capabilities. For example, acceleration plane 106 may include Field-Programmable Gate Arrays (FPGAs), Application-Specific Integrated Circuits (ASICs), Application-Specific Standard Products (ASSPs), System-on-a-Chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), other types of programmable hardware logic devices and so on. Acceleration plane 106 may provide a reconfigurable fabric of acceleration components.
  • A host component may generally be any compute component that may perform operations by using each of its CPU hardware threads to execute machine-readable instructions. An acceleration component may perform operations using several parallel logic elements to perform computational tasks. As an example, an FPGA may include several gate arrays that may be configured to perform certain computational tasks in parallel. Thus, an acceleration component can perform some operations in less time compared to a software-driven host component. In the context of the architecture 100, the "acceleration" reflects its potential for accelerating the functions that are performed by the host components.
  • In one example, architecture 100 may correspond to a data center environment that includes a large number of servers. The servers may correspond to the host components in software plane 104. In another example, architecture 100 may correspond to an enterprise system. In a further example, architecture 100 may correspond to a user device or appliance which uses at least one host component that has access to two or more acceleration components. Indeed, depending upon the requirements of a service other implementations for architecture 100 are also possible.
  • Network 120 may couple host components in software plane 104 to the other host components and couple acceleration components in acceleration plane 106 to other acceleration components. In this example, host components can use network 120 to interact with one another and acceleration components can use network 120 to interact with one another. Interaction among host components in software plane 104 may be independent of the interaction among acceleration components in acceleration plane 106. In this example, two or more acceleration components may communicate in a transparent manner relative to host components in software plane 104, outside the direction of the host components, and without the host components being "aware" of a particular interaction even taking place in acceleration plane 106.
  • Architecture 100 may use any of a variety of different protocols to facilitate communication among acceleration components over network 120 and can use any of a variety of different protocols to facilitate communication between host components over network 120. For example, architecture 100 can use Ethernet protocol to transmit Internet Protocol (IP) packets over network 120. In one implementation, each local host component in a server is given a single physical IP address. The local acceleration component in the same server may adopt the same IP address. The server can determine whether an incoming packet is destined for the local host component or destined for the local acceleration component in different ways. For example, packets that are destined for the local acceleration component can be formulated as UDP packets having a specific port; host-defined packets, on the other hand, may not be formulated in this way. In another example, packets belonging to acceleration plane 106 can be distinguished from packets belonging to software plane 104 based on the value of a status flag in each of the packets. In one example, architecture 100 can be viewed as two logical networks (software plane 104 and acceleration plane 106) that may share the same physical network communication links. Packets associated with the two logical networks may be distinguished from each other by their respective traffic classes.
  • In another aspect, each host component in the architecture 100 is coupled to at least one acceleration component in acceleration plane 104 through a local link. For example, a host component and acceleration component can be arranged together and maintained as a single serviceable unit (e.g., a server) within architecture 100. In this arrangement, the server can be referred to as the "local" host component to distinguish it from other host components that are associated with other servers. Similarly, acceleration component(s) of a server can be referred to as the "local" acceleration component(s) to distinguish them from other acceleration components that are associated with other servers.
  • As depicted in architecture 100, host component 108 may be coupled to acceleration component 110 through local link 112 (e.g., a Peripheral Component Interconnect Express (PCIe) link). Thus, host component 108 may be a local host component from the perspective of acceleration component 110 and acceleration component 110 may be a local acceleration component from the perspective of host component 108. The local linking of host component 108 and acceleration component 110 can form part of a server. More generally, host components in software plane 104 can be locally coupled to acceleration components in acceleration plane 106 through many individual links collectively represented as a local A -to-local S coupling 114. In this example, a host component can interact directly with any locally linked acceleration components. A host component can initiate communication to a locally linked acceleration component to cause further communication among multiple acceleration components. For example, a host component can issue a request for a service (or portion thereof) where functionality for the service *or portion thereof) is composed across a group of one or more acceleration components in acceleration plane 106. A host component can also interact indirectly with other acceleration components in acceleration plane 106 to which the host component is not locally linked. For example, host component 108 can indirectly communicate with acceleration component 116 via acceleration component 110. In this example, acceleration component 110 communicates with acceleration component 116 via a link 118 of a network (e.g., network 120).
  • Acceleration components in acceleration plane 106 may advantageously be used to accelerate larger scale services robustly in a data center. Substantial portions of complex datacenter services can be mapped to acceleration components (e.g., FPGAs) by using low latency interconnects for computations spanning multiple acceleration components. Acceleration components can also be reconfigured as appropriate to provide different service functionality at different times. Although FIG. 1 shows a certain number of components of architecture 100 arranged in a certain manner, there could be more or fewer number of components arranged differently. In addition, various components of architecture 100 may be implemented using other technologies as well.
  • FIG. 2 shows a diagram of a system 200 for transmission or retransmission of messages by acceleration components configured to accelerate a service in accordance with one example. In one example, system 200 may be implemented as a rack of servers in a data center. Servers 204, 206, and 208 can be included in a rack in the data center. Each of servers 204, 206, and 208 can be coupled to top-of-rack (TOR) switch 210. Other racks, although not shown, may have a similar configuration. Server 204 may further include host component 212 including CPUs 214, 216, etc. Host component 212 along with host components from servers 206 and 208 can be included in software plane 104. Server 204 may also include acceleration component 218. Acceleration component 218 along with acceleration components from servers 206 and 208 can be included in acceleration plane 106.
  • Acceleration component 218 may be directly coupled to a host component 212 via local link 220 (e.g., a PCIe link). Thus, acceleration component 218 can view host component 212 as a local host component. Acceleration component 218 and host component 212 may also be indirectly coupled by way of network interface controller 222 (e.g., used to communicate across network infrastructure 120). In this example, server 204 can load images representing service functionality onto acceleration component 218.
  • Acceleration component 218 may also be coupled to TOR switch 210. Hence, in system 200, acceleration component 218 may represent the path through which host component 212 interacts with other components in the data center (including other host components and other acceleration components). System 200 allows acceleration components 218 to perform processing on packets that are received from (and/or sent to) TOR switch 210 (e.g., by performing encryption, compression, etc.), without burdening the CPU-based operations performed by host component 212. Although FIG. 2 shows a certain number of components of system 200 arranged in a certain manner, there could be more or fewer number of components arranged differently. In addition, various components of system 200 may be implemented using other technologies as well.
  • FIG. 3 shows a diagram of a system 300 for transmission or retransmission of messages by acceleration components configured to accelerate a service in accordance with one example. In this example, IP routing may be used for transmitting or receiving messages among TOR switches, including TOR Switch 1 302, TOR Switch 2 304, and TOR Switch N 306. Each server or sever group may have a single "physical" IP address that may be provided by the network administrator. Thus, in this example, Server Group 1 320, Server Group 2 322, and Server Group N 324 may each include servers, where each of them may have a "physical" IP address. Acceleration components may use its server's physical IP as its address. To distinguish between IP packets destined for the host from packets destined for an acceleration component, UDP packets, with a specific port to designate the acceleration component as the destination, may be used. An acceleration component may transmit a message to a selected set of acceleration components associated with different TOR switches using Layer 3 functionality corresponding to the seven-layer open-systems interconnection (OSI) model. Layer 3 functionality may be similar to that provided by the network layer of the OSI model. In this example, an acceleration component may transmit a point-to-point message to each of the other relevant acceleration components associated with respective TOR switches. Those acceleration components may then use a Layer 2 Ethernet broadcast packet to send the data to all of the acceleration components associated with the TOR switch. Layer 2 functionality may be similar to that provided by the data-link layer of the OSI model. Layer 2 functionality may include media access control, flow control, and error checking. In one example, this step will not require any broadcasting support from a network interconnecting the acceleration plane and the software plane. This may advantageously alleviate the need for multicasting functionality provided by the routers or other network infrastructure. This, in turn, may reduce the complexity of deploying and managing acceleration components. In addition, in general, the higher levels of the network (e.g., the network including routers and other TOR switches) may be oversubscribed, which, in turn, may lower the bandwidth available to acceleration components communicating using the higher network. In contrast in this example, the acceleration components that share a TOR switch may advantageously have a higher bandwidth available to them for any transmission of messages from one acceleration component to another.
  • FIG. 4 shows a diagram of an acceleration component 400 in accordance with one example. Acceleration component 400 can be included in acceleration plane 106. Components included in acceleration component 400 can be implemented on hardware resources (e.g., logic blocks and programmable interconnects) of acceleration component 400.
  • Acceleration component 400 may include application logic 406, soft shell 404 associated with a first set of resources and shell 402 associated with a second set of resources. The resources associated with shell 402 may correspond to lower-level interface-related components that may generally remain the same across many different application scenarios. The resources associated with soft shell 404 can remain the same across at least some different application scenarios. The application logic 406 may be further conceptualized as including an application domain (e.g., a "role"). The application domain or role can represent a portion of functionality included in a composed service spread out over a plurality of acceleration components. Roles at each acceleration component in a group of acceleration components may be linked together to create a group that provides the service acceleration for the application domain.
  • The application domain hosts application logic 406 that performs service specific tasks (such as a portion of functionality for ranking documents, encrypting data, compressing data, facilitating computer vision, facilitating speech translation, machine learning, etc.). Resources associated with soft shell 404 are generally less subject to change compared to the application resources, and the resources associated with shell 402 are less subject to change compared to the resources associated with soft shell 404 (although it is possible to change (reconfigure) any component of acceleration component 400).
  • In operation, in this example, application logic 406 interacts with the shell resources and soft shell resources in a manner analogous to the way a software-implemented application interacts with its underlying operating system resources. From an application development standpoint, the use of common shell resources and soft shell resources frees a developer from having to recreate these common components for each service.
  • Referring first to the shell 402, shell resources may include bridge 408 for coupling acceleration component 400 to the network interface controller (via an NIC interface 410) and a local top-of-rack switch (via a TOR interface 412). Bridge 408 also includes a data path that allows traffic from the NIC or TOR to flow into acceleration component 400, and traffic from the acceleration component 400 to flow out to the NIC or TOR. Internally, bridge 408 may be composed of various FIFOs (414, 416) which buffer received packets, and various selectors and arbitration logic which route packets to their desired destinations. A bypass control component 418, when activated, can control bridge 408 so that packets are transmitted between the NIC and TOR without further processing by the acceleration component 400.
  • Memory controller 420 governs interaction between the acceleration component 400 and local memory 422 (such as DRAM memory). The memory controller 420 may perform error correction as part of its services.
  • Host interface 424 may provide functionality that enables acceleration component 400 to interact with a local host component (not shown). In one implementation, the host interface 424 may use Peripheral Component Interconnect Express (PCIe), in conjunction with direct memory access (DMA), to exchange information with the local host component. The outer shell may also include various other features 426, such as clock signal generators, status LEDs, error correction functionality, and so on.
  • Elastic router 428 may be used for routing messages between various internal components of the acceleration component 400, and between the acceleration component and external entities (e.g., via a transport component 430). Each such endpoint may be associated with a respective port. For example, elastic router 428 is coupled to memory controller 420, host interface 424, application logic 406, and transport component 430.
  • Transport component 430 may formulate packets for transmission to remote entities (such as other acceleration components), and receive packets from the remote entities (such as other acceleration components). In this example, a 3-port switch 432, when activated, takes over the function of the bridge 408 by routing packets between the NIC and TOR, and between the NIC or TOR and a local port associated with the acceleration component 400.
  • Diagnostic recorder 434 may store information regarding operations performed by the router 428, transport component 430, and 3-port switch 432 in a circular buffer. For example, the information may include data about a packet's origin and destination IP addresses, host-specific data, or timestamps. The log may be stored as part of a telemetry system (not shown) such that a technician may study the log to diagnose causes of failure or sub-optimal performance in the acceleration component 400.
  • A plurality of acceleration components like acceleration component 400 can be included in acceleration plane 106. Acceleration components can use different network topologies (instead of using network 120 for communication) to communicate with one another. In one aspect, acceleration components are connected directly to one another, such as, for example, in a two-dimensional torus. Although FIG. 4 shows a certain number of components of acceleration component 400 arranged in a certain manner, there could be more or fewer number of components arranged differently. In addition, various components of acceleration component 400 may be implemented using other technologies as well.
  • FIG. 5 shows a diagram of a 3-port switch 500 in accordance with one example (solid lines represent data paths, and dashed lines represent control signals). 3-port switch 500 may provide features to prevent packets for acceleration components from being sent on to the host system. If the data network supports several lossless classes of traffic, 3-port switch 500 can be configured to provide sufficient support to buffer and pause incoming lossless flows to allow it to insert its own traffic into the network. To support that, 3-port switch 500 can be configured to distinguish lossless traffic classes (e.g., Remote Direct Memory Access (RDMA)) from lossy (e.g., TCP/IP) classes of flows. A field in a packet header can be used to identify which traffic class the packet belongs to. Configuration memory 530 may be used to store any configuration files or data structures corresponding to 3-port switch 500.
  • 3-port switch 500 may have a first port 502 (host-side) to connect to a first MAC and a second port 504 (network-side) to connect to a second MAC. A third local port may provide internal service to a transport component (e.g., transport component 430). 3-port switch 500 may generally operate as a network switch, with some limitations. Specifically, 3-port switch 500 may be configured to pass packets received on the local port (e.g., Lightweight Transport Layer (LTL) packets) only to the second port 504 (not the first port 502). Similarly, 3-port switch 500 may be designed to not deliver packets from the first port 502 to the local port.
  • 3-port switch 500 may have two packet buffers: one for the receiving (Rx) first port 502 and one for the receiving second port 504. The packet buffers may be split into several regions. Each region may correspond to a packet traffic class. As packets arrive and are extracted from their frames (e.g., Ethernet frames), they may be classified by packet classifiers (e.g., packet classifier 520 and packet classifier 522) into one of the available packet classes (lossy, lossless, etc.) and written into a corresponding packet buffer. If no buffer space is available for an inbound packet, then the packet may be dropped. Once a packet is stored and ready to transmit, an arbiter (e.g., arbiter 512 or arbiter 514) may select from among the available packets and may transmit the packet. A priority flow control (PFC) insertion block (e.g. PFC insert 526 or PFC insert 528) may allow 3-port switch 500 to insert PFC frames between flow packets at the transmit half of either of the ports 502, 504.
  • 3-port switch 500 can handle a lossless traffic class as follows. All packets arriving on the receiving half of the first port 502 and on the receiving half of the second port 504 should eventually be transmitted on the corresponding transmit (Tx) halves of the ports. Packets may be store-and-forward routed. Priority flow control (PFC) can be implemented to avoid packet loss. For lossless traffic classes, 3-port switch 500 may generate PFC messages and send them on the transmit parts of the first and second ports 502 and 504. In one embodiment, PFC messages are sent when a packet buffer fills up. When a buffer is full or about to be full, a PFC message is sent to the link partner requesting that traffic of that class be paused. PFC messages can also be received and acted on. If a PFC control frame is received for a lossless traffic class on the receive part of either the first or second port (502 or 504), 3-port switch 500 may suspend sending packets on the transmit part of the port that received the control frame. Packets may buffer internally until the buffers are full, at which point a PFC frame will be generated to the link partner.
  • 3-port switch 500 can handle a lossy traffic class as follows. Lossy traffic (everything not classified as lossless) may be forwarded on a best-effort basis. 3-port switch 500 may be free to drop packets if congestion is encountered.
  • Signs of congestion can be detected in packets or flows before the packets traverse a host and its network stack. For example, if a congestion marker is detected in a packet on its way to the host, the transport component can quickly stop or start the flow, increase, or decrease available bandwidth, throttle other flows/connections, etc., before effects of congestion start to manifest at the host.
  • FIG. 6 shows a transport component 600 (an example corresponding to transport component 430 of FIG. 4) coupled to a 3-port switch 602 (an example corresponding to 3-port 432 of FIG. 4) and an elastic router 604 in accordance with one example. Transport component 600 may be configured to act as an autonomous node on a network. In one embodiment, transport component 600 may be configured within an environment or a shell within which arbitrary processes or execution units can be instantiated. The use of transport component 600 may be advantageous because of the proximity between the application logic and the network, and the removal of host-based burdens such as navigating a complex network stack, interrupt handling, and resource sharing. Thus, applications or services using acceleration components with transport components, such as transport component 600, may be able to communicate with lower latencies and higher throughputs. Transport component 600 may itself be an agent that generates and consumes network traffic for its own purposes.
  • Transport component 600 may be used to implement functionality associated with the mechanism or protocol for exchanging data, including transmitting or retransmitting messages. In this example, transport component 600 may include transmit logic 610, receive logic 612, soft shell 614, connection management logic 616, configuration memory 618, transmit buffer 620, and receive buffer 622. These elements may operate to provide efficient and reliable communication among transport components that may be included as part of the acceleration components.
  • In one example, transport component 600 may be used to implement the functionality associated with the Lightweight Transport Layer (LTL). Consistent with this example of the LTL, transport component 600 may expose two main interfaces for the LTL: one for communication with 3-port switch 602 (e.g., a local network interface that may then connect to a network switch, such as a TOR switch) and the other for communication with elastic router 604 (e.g., an elastic router interface). In this example, the local network interface (local_*) may contain a NeworkStream, Ready, and Valid for both Rx and Tx directions. In this example, the elastic router interface (router_*) may expose a FIFO-like interface supporting multiple virtual channels and a credit-based flow control scheme. Transport component 600 may be configured via a configuration data structure (struct) for runtime controllable parameters, and may output a status data structure (struct for status monitoring by a host or other soft-shell logic). Table 1 below shows an example of the LTL top-level module interface.
    module LTL Base
    (
      input core_clk,
      input core reset,
      input LTLConfiguration cfg,
      output LTLStatus status,
      output NetworkStream local_tx_out,
      output logic local_tx_empty_out,
      input local_tx_rden_in,
      input NetworkStream local_rx_in,
      input local_rx_wren_in,
      output logic local_rx_full_out,
      input RouterInterface router in,
      input router_valid_in,
      output RouterInterface router_out,
      output router_valid_out,
      output RouterCredit router_credit_out,
      input router_credit_ack_in,
      input RouterCredit router_credit_in,
      input router_credit_ack_out,
      input LTLRegAccess register wrdata in,
      input register_write_in,
      output logic LTL_event_valid_out,
      output LTLEventQueueEntry LTL_event_data_out
    );
  • Table 2 below shows example static parameters that may be set for an LTL instance at compile time. The values for these parameters are merely examples and additional or fewer parameters may be specified.
    Parameter Name Configured Value
    MAX_VIRTUAL CHANNELS 8
    ER_PHITS_PER_FLIT 4
    MAX_ER_CREDITS 256
    EXTRA_SFQ_ ENTRIES 32
  • Thus, as noted above in Table 3, this will configure MAX VIRTUAL_CHANNELS + EXTRA SFQ ENTRIES MTU sized buffers for the LTL instance. Elastic router credits (ER CREDITS) may be issued with a guarantee of at least 1 credit for each virtual channel (VC), and a dynamically calculated number of extra credits. Transport component 600 may expose a configuration input port which sets a number of run-time values. This configuration port may be defined as part of the LTLConfiguration struct data structure. The fields for an example data structure are enumerated in the following table (Table 4):
    Field Name Example description
    Src_IP IPv4 Source Address.
    Src_MAC Ethernet MAC address used as the source of all LTL generated messages.
    Src_port The UDP source port used in all LTL messages.
    Dst_port The UDP destination port for all LTL messages.
    DSCP The DSCP value set in IPv4 header of LTL messages - controls which Traffic Class (TC) LTL packets are routed in the datacenter.
    Throttle_credits_per_scrub Number of cycles to reduce the per-flow inter_packet gap by on each scrub of the connection table. This may effectively provide a measure of bandwidth to return to each flow per time-period. This may be used as part of congestion management.
    Throttle_scrub_delay Cycles to delay starting the next credit scrubbing process.
    Timeout_Period Number of time-period counts to wait before timing out an unacknowledged packet and resending it.
    Disable_timeouts When set to 1, flows may never "give up"; in other words, unacknowledged packets will be resent continually.
    Throttle_min Minimum value of throttling IPG.
    Throttle_max Maximum value of throttling IPG.
    Throttle_credit_multiple Amount by which throttling IPG is multiplied on Timeouts, NACKs, and congestion events. This multiplier may also be used for decreasing/increasing the per-flow inter-packet gap when exponential backout/comeback is used (see, for example, throttle _linear_ backoff and throttle_exponential_comeback).
    Disable timeouts Disable timeout retries.
    Disable_timeout_drops Disable timeout drops that happen after 128 timeout retries.
    Xoff_period Controls how long of a pause to insert before attempting to send subsequent messages when a remote receiver is receiving XOFF NACKs indicating that it is currently receiving traffic from multiple senders (e.g., has VC locking enabled).
    Credit_congest_threshold When delivering traffic to the ER, if a receiver has fewer than credit_congest_threshold credits, sends a congestion ACK so the sender slows down.
    throttle_slow_start_ipg Delays sending of a subsequent message when a remote receiver has indicated that it is receiving traffic from multiple senders (e.g., has VC locking enabled).
    throttle_linear_backoff Enables linear comeback (i.e., linear increase of inter-packet gap) instead of multiplicative/exponential.
    ltl_event_mask_enable Controls which messages to filter when posting LTL events to the LTL event queue.
    mid_message_timeout Controls how long a receiver should wait before draining half-received messages (e.g., when a sender fails mid-message).
  • The functionality corresponding to the fields, shown in Table 4, may be combined, or further separated. Certain fields could also be in a memory indexed by an address or a descriptor field in the LTLConfiguration struct data structure. Similarly, a special instruction may provide information related to any one of the fields in Table 4 or it may combine the information from such fields. Other changes could be made to the LTLConfiguration struct data structure and format without departing from the scope of this disclosure.
  • As part of LTL, in one example, all messages may be encapsulated within IPv4/UDP frames. Table 5 below shows an example packet format for encapsulating messages in such frames. The Group column shows the various groups of fields in the packet structure. The Description column shows the fields corresponding to each group in the packet structure. The Size column shows the size in bits of each field. The Value column provides a value for the field and, as needed, provides example description of the relevant field.
    Group Description Value
    Ethernet destination MAC SendConnections[sCTI].DstMac
    Header source MAC Cfg.src_mac
    IPv4 Version 0x4
    IHL 0x5
    DSCP Cfg.DSCP
    ECN 0b01
    Total Length Entire packet length in bytes
    Identification 0x0000
    Flags 0b000
    Fragment Offset 0
    TTL 0xFF
    Protocol 0x11 (UDP)
    Header Checksum IPv4 Checksum
    Source IP Address Cfg.SrcIP
    Destination IP Address SendConnections[sCTI].DstIP
    UDP Header Source Port Cfg. SrcPort
    Destination Port Cfg.DestPort
    Length Length of UDP header and data
    LTL Flags Bit 7: Last
    Bit 6: ACK
    Bit 5: Congestion
    Bit 4: NACK
    Bit 3: Broadcast
    Bit 2: Retransmit
    Bits1-0: 0 (Reserved
    CTI Stores the connection table index the receiving
    node should look up. (Receive CTI for non-ACKs, and Send CTI for ACKs).
    The sequence number of this packet
    Sequence Number Length (bytes) Length of the data payload in bytes
  • The functionality corresponding to the fields, shown in Table 5, may be combined, or further separated. Certain fields could also be in a memory indexed by an address or a descriptor field in the packet. Similarly, a special instruction may provide information related to any one of the fields in Table 5 or it may combine the information from such fields. Other changes could be made to the packet structure and format without departing from the scope of this disclosure.
  • Connection management logic 616 may provide a register interface to establish connections between transport components. Connection management logic 616 along with software (e.g., a soft shell) may setup the connections before data can be transmitted or received. In one example, there are two connection tables that may control the state of connections, the Send Connection Table (SCT) and the Receive Connection Table (RCT). Each of these tables may be stored as part of configuration memory 618 or some other memory associated with transport component 600. Each entry in the SCT, a Send Connection Table Entry (SCTE), may store the current sequence number of a packet and other connection state used to build packets, such as the destination MAC address. Requests arriving from elastic router 604 may be matched to an SCTE by comparing the destination IP address and the virtual channel fields provided by elastic router 604. At most one connection may target a destination IP address and a VC pair. Thus, the tuple {IP, VC} may be a unique key (in database terms) in the table. It may be possible to have two entries in the table with the same VC-for example, {IP: 10.0.0.1, VC: 0}, and {IP: 10.0.0.2, VC:0}. It may also be possible to have two entries with the same IP address and different VCs: {IP: 10.0.0.1, VC: 0} and {IP: 10.0.0.2, VC: 1}. However, two entries with the same {IP, VC} pair may not be allowed. The number of entries that LTL supports may be configured at compile time.
  • Elastic router 604 may move data in Flits, which may be 128B in size (32B x 4 cycles). Messages may be composed of multiple flits, de-marked by start and last flags. In one example, once elastic router 604 selects a flow to send from an input port to an output port for a given virtual channel, the entire message must be delivered before another message will start to arrive on the same virtual channel. Connection management logic 616 may need to packetize messages from elastic router 604 into the network's maximum transport unit (MTU) sized pieces. This may be done by buffering data on each virtual channel until one of the following conditions is met: (1) the last flag is seen in a flit and (2) an MTU's worth of data (or appropriately reduced size to fit headers and alignment requirements). In this implementation, the MTU for an LTL payload may be 1408 bytes. Once one of the requirements is met, transport component 600, via transmit logic 610, may attempt to send that packet. Packet destinations may be determined through a combination of which virtual channel the message arrives on at transport component 600 input (from elastic router 604) and a message header that may arrive during the first cycle of the messages from elastic router 604. These two values may be used to index into the Send Connection Table, which may provide the destination IP address and sequence numbers for the connection. In this example, each packet transmitted on a given connection should have a sequence number one greater than the previous packet for that connection. The only exception may be for retransmits, which may see a dropped or unacknowledged packet retransmitted with the same sequence number as it was originally sent with. The first packet sent on a connection may have Sequence Number set to 1. So, as an example, for a collection of flits arriving on various virtual channels (VCs) into transport component 600 from elastic router 604, data may be buffered using buffers (e.g., receive buffer 622) until the end of a message or MTU worth of data has been received and then a packet may be output. So, as an example, if 1500B is sent from elastic router 604 to at least one LTL instance associated with transport component 600 as a single message (e.g. multiple flits on the same VC with zero or one LAST flags), at least one packet may be generated. In this example, the LTL instance may send messages as soon as it has buffered the data - i.e. it will NOT wait for an ACK of the first message before sending the next. There may be no maximum message size. The LTL instance may just keep chunking up a message into MTU-sized packets and transmit them as soon as an MTU's worth data is ready. Similarly, in this example, there is no "message length" field in the packets anywhere - only a payload size for each packet. Transport component 600 may not have advance knowledge of how much data a message will contain. Preferably, an instance of LTL associated with transport component 600 may deliver arriving flits that match a given SCT entry, in-order, even in the face of drops and timeouts. Flits that match different SCT entries may have no ordering guarantees.
  • In this example, transport component 600 will output one credit for each virtual channel, and then one credit for each shared buffer. Credits will be returned after each flit, except for when a flit finishes an MTU buffer. This may happen if a last flag is received or when a flit contains the MTUth byte of a message. Credits consumed in this manner may be held by transport component 600 until the packet is acknowledged.
  • In terms of the reception of the packets by an instance of LTL associated with transport component 600, in one example, packets arriving from the network are matched to an RCT entry (RCTE) through a field in the packet header. Each RCTE stores the last sequence number and which virtual channel (VC) to output packets from transport component 600 to elastic router 604 on. Multiple entries in the RCT can point to the same output virtual channel. The number of entries that LTL supports may be configured at compile time. When packets arrive on the local port from the Network Switch, transport component 600 may determine which entry in the Receive Connection Table (RCT) the packet pairs with. If no matching RCT table exists, the packet may be dropped. Transport component may check that the sequence number matches the expected value from the RCT entry. If the sequence number is greater than the RCT entry, the packet may be dropped. If the sequence number is less than the RCT entry expects, an acknowledgement (ACK) may be generated and the packet may be dropped. If it matches, transport component may grab the virtual channel field of the RCT entry. If the number of available elastic router (ER) credits for that virtual channel is sufficient to cover the packet size, transport component 600 may accept the packet. If there are insufficient credits, transport component 600 may drop the packet. Once the packet is accepted, an acknowledgement (ACK) may be generated and the RCT entry sequence number may be incremented. Elastic router 604 may use the packet header to determine the final endpoint that the message is destined for. Transport component may need sufficient credits to be able to transfer a whole packet's worth of data into elastic router 604 to make forward progress. To help ensure that all VCs can make progress, transport component 600 may require elastic router 604 to provide dedicated credits for each VC to handle at least one MTU of data for each VC. In this example, no shared credits may be assumed.
  • SCT/RCT entries can be written by software. In one example, software may keep a mirror of the connection setup. To update an SCT or an RCT entry, the user may write to the register_wrdata in port, which may be hooked to registers in the soft shell or environment corresponding to the application logic. Table 6, below, is an example of the format of a data structure that can be used for updating entries in the SCT or the RCT.
    Figure imgb0001
    Figure imgb0002
  • To write to an SCT entry, one may set scte_not_rcte to 1, set sCTI value to the value of the index for the SCT that is being written to, and then set the other fields of the data structure in Table 6 appropriately. With respect to timing, the value of register_write_in may be toggled high for at least one cycle. rCTI may be set to the remote acceleration component's RCT entry (in this example, rCTI is included in the UDP packets sent to that acceleration component and this is how the correct connection on the other end is looked up). IPAddr may be set to the destination acceleration component's IP address. MacAddr may be set to the MAC address of a host on the same LAN segment as the acceleration component or the MAC address of the router for the remote hosts. VirtualChannel may be set by looking it up from the flit that arrives from elastic router 604. To write to an RCT entry, one may set scte_not_rcte to 0, set rCTI value to the value of the index of the RCT that is being written to, and then set the other fields of the data structure in Table 6 appropriately. rCTI may be set to the sending acceleration component's RCT entry. IPAddr may be set to the sending acceleration component's IP address. MacAddr may be ignored for the purposes of writing to the RCT. VirtualChannel may be set to the channel on which the message will be sent to elastic router 604.
  • As an example, to establish a one-way link from a node A (e.g., transport component A (10.0.0.1)) to node B (e.g., transport component B (10.0.0.2)), one could: (1) on transport component A create SCTE {sCTI: 1, rCTI: 4, IP: 10.0.0.2, VC: 1, Mac:01-02-03-04-05-06}; and (2) on transport component B create RCTE {rCTI: 4, sCTI: 1, IP: 10.0.0.1, VC: 2}. In this example, this would take messages that arrive from an elastic router on transport component A with DestIP==10.0.0.2 and VC==1 and send them to transport component B in a packet. The packet header will have the rCTI field set to 4 (the rCTI value read from the SCT). Transport component B will access its RCT entry 4, and learn that the message should be output on VC 2. It will also generate an ACK back to transport component A. In this packet, the sCTI field will have the value 1 (populated from the sCTI value read from the RCT).
  • An instance of LTL associated with transport component 600 may buffer all sent packets until it receives an acknowledgement (ACK) from the receiving acceleration component. If an ACK for a connection doesn't arrive within a configurable timeout period, the packet may be retransmitted. In this example, all unacknowledged packets starting with the oldest will be retransmitted. A drop of a packet belonging to a given SCT may not alter the behavior of any other connections - i.e. packets for other connection may not be retransmitted. Because the LTL instance may require a reliable communication channel and packets can occasionally go missing on the network, in one example, a timeout based retry mechanism may be used.
  • If a packet does not receive an acknowledgement within a certain time-period, it may be retransmitted. The timeout period may be set via a configuration parameter. Once a timeout occurs, transport component 600 may adjust the congestion inter-packet gap for that flow.
  • Transport component 600 may also provide congestion control. If an LTL instance transmits data to a receiver incapable of absorbing traffic at full line rate, the congestion control functionality may allow it to gracefully reduce the frequency of packets being sent to the destination node. Each LTL connection may have an associated inter-packet gap state that controls the minimum number of cycles between the transmission of packets in a flow. At the creation of a new connection, the IPG may be set to 1, effectively allowing full use of any available bandwidth. If a timeout, ECN notification, or NACK occurs on a flow, the delay may be multiplied by the cfg.throttle_credit_multiple parameter (see Table 3) or increased by the cfg.throttle_credits_per_scrub parameter (see Table 3; depending on if linear or exponential backoff is selected). Each ACK received may reduce the IPG by the cfg.throttle_credits_per_scrub parameter (see Table 3) or divide it by the cfg.throttle_credit multiple parameter (see Table 3; depending on if linear or exponential comeback is selected). An LTL instance may not increase a flow's IPG more than once every predetermined time period; for example, not more than every 2 microseconds (in this example, this may be controlled by the cfg.throttle_scrub_delay parameter (see Table 3)).
  • Consistent with one example of the LTL, transport component 600 may attempt retransmission 128 times. If, after 128 retries, the packet is still not acknowledged, the packet may be discarded and the buffer freed. Unless the disable_timeouts configuration bit is set, transport component 600 may also clear the SCTE for this connection to prevent further messages and packets from being transmitted. In this example, at this point, no data can be exchanged between the link partners since their sequence numbers will be out of sync. The connection would need to be re-established.
  • When an LTL instance associated with transport component 600 successfully receives a packet, it will generate an acknowledgement (for example, an empty payload packet with the ACK flag bit set). Acknowledgements (ACKs) may include a sequence number that tells the sender the last packet that was successfully received and the SCTI the sender should credit the ACK to (this value may be stored in the ACK-generator's RCT). Per one example of the LTL, the following rules may be used for generating acks: (1) if the RX Sequence Number matches the expected Sequence Number (in RCT), an ACK is generated with the received sequence number; (2) if the RX Sequence Number is less than the expected Sequence Number, the packet is dropped, but an ACK with the highest received Sequence Number is generated (this may cover the case where a packet is sent twice (perhaps due to a timeout) but then received correctly); and (3) if the RX Sequence Number is greater than the expected Sequence Number, the packet is dropped and no ACK is generated.
  • An instance of an LTL associated with transport component 600 may generate NACKs under certain conditions. These may be packets flagged with both the ACK and NACK flag bits set. A NACK may be a request for the sender to retransmit a particular packet and all subsequent packets.
  • In one example, under two conditions, transport component 600 may require the generation of a NACK: (1) if a packet is dropped due to insufficient elastic router credits to accept the whole packet, transport component 600 may send a NACK once there are sufficient credits; or (2) if a packet is dropped because another sender currently holds the lock on a destination virtual channel, transport component 600 may send a NACK once the VC lock is released.
  • When an LTL endpoint is receiving traffic from multiple senders, the receiver may maintain a side data structure per VC (VCLockQueue) that may keep track of which senders had their packets dropped because another message was being received on a specific VC. This side data structure may be used to coordinate multiple senders through explicit retransmit requests (NACKs).
  • In one example, once an instance of LTL associated with transport component 600 starts receiving a message on a specific VC, that VC is locked to that single sender until all packets of that message have been received. If another sender tries to send a packet on the same VC while it is locked or while there are not enough ER credits available, the packet will get dropped and the receiver will be placed on the VCLockQueue. Once the lock is released or there are enough ER credits, the LTL instance will pop the VCLockQueue and send a retransmit request (NACK) to the next sender that was placed in the VCLockQueue. After being popped from the VCLockQueue, a sender may be given the highest priority for the next 200000 cycles (-1.15ms). Packets from the other senders on the same VC will be dropped during these 200000 cycles. This may ensure that all senders that had packets dropped will eventually get a chance to send their message.
  • When a receiver drops a sender's packet because another sender has the VC lock, the receiver may place the sender (that had its packet dropped) on the VCLockQueue and send a NACK that also includes the XOFF flag, indicating that the sender should not try retransmitting for some time (dictated by the cfg.xoff_period parameter). If the receiver was out of ER credits the NACK may also include the Congestion flag.
  • When a sender receives a NACK with the XOFF flag it may delay the next packet per the back-off period (e.g., the xoff_period). If the NACK does not include the congestion flag (i.e., the drop was not due to insufficient credits but due to VC locking), then the sender may make a note that VC locking is active for that flow. When a flow has VC locking enabled senders may need to make sure to slow down after finishing every message, because they know that packets of subsequent messages will get dropped since they are competing with other senders that will be receiving the VC lock next. However, in this example, senders will need to make sure to send the first packet of a subsequent message before slowing down (even though they know it will be dropped) to make sure that they get placed in the VCLockQueue.
  • FIG. 7 shows a flow chart 700 for a method for processing messages using transport components to provide a service in accordance with one example. In one example, the application logic (e.g., application logic 608 of FIG. 6) corresponding to the service, such as a search results ranking service, may be divided up and mapped into multiple accelerator component's roles. As described earlier, the application logic may be conceptualized as including an application domain (e.g., a "role"). The application domain or role can represent a portion of functionality included in a composed service spread out over a plurality of acceleration components. Roles at each acceleration component in a group of acceleration components may be linked together to create a group that provides the service acceleration for the application domain. Each application domain may host application logic to perform service specific tasks (such as a portion of functionality for ranking documents, encrypting data, compressing data, facilitating computer vision, facilitating speech translation, machine learning, etc.). Step 702 may include receiving a message from a host to perform a task corresponding to a service. Step 704 may include forwarding the message to an acceleration component at the head of a multi-stage pipeline of acceleration components, where the acceleration components may be associated with a first switch. Thus, when a request for a function associated with a service, for example, a ranking request arrives from a host in the form of a PCI express message, it may be forwarded to the head of a multi-stage pipeline of acceleration components. In step 706, the acceleration component that received the message may determine whether the message is for the acceleration component at the head. If the message is for the acceleration component at the head of a pipeline stage, then the acceleration component at the head of the pipeline stage may process the message at the acceleration component. As part of this step, an elastic router (e.g., elastic router 604 of FIG. 6) may forward the message directly to the role. In one example, the LTL packet format (e.g., as shown in Table 6) may include a broadcast flag (e.g., Bit 3 under the flags header as shown in Table 6) and a retransmission flag (e.g., Bit 2 under the flags header as shown in Table 6). The broadcast flag may signal to an acceleration component that the message is intended for multiple acceleration components. The retransmission flag may indicate to an acceleration component that retransmission of the message is requested. In both cases, the LTL packet format may include headers that list the IP addresses for the specific destination(s). Thus, when an acceleration component receives a message with the broadcast flag set, in step 712, the acceleration component (e.g., using transport component 600) may transmit the message as a point-to-point message to a selected set of acceleration components, where each of them is associated with a different top-of-rack (TOR) switch. In one example, the point-to-point message is transmitted using a Layer 3 functionality. In this example, when transmit component 600 receives a packet (e.g., a part of the point-to-point message) with the broadcast flag set, it may process the destination list; if it contains more than just its own IP address, it will add the packet to its own transmit buffers, setup a bit-field to track receipt by each of the destinations and then start transmitting to each receiver one by one (without the broadcast or retransmit fields). Destination acceleration components may acknowledge (using ACK) each packet upon successful receipt and the sender will mark the appropriate bit in the bit-field. Once a packet is acknowledged by all destinations, the send buffer can be released.
  • Next, in step 714, each of the set of acceleration components may, using data-link layer functionality, broadcast that message to any other acceleration components associated with the respective TOR switch. An example of data-link layer functionality may be Layer 2 Ethernet broadcast packets. In one example, when transport component 600 receives a packet (e.g., a part of the point-to-point message) with the broadcast flag set, it may process the destination list; if it contains more than just its own IP address, it will add the packet to its own transmit buffers, setup a bit-field to track receipt by each of the destinations and then send a Layer 2 Ethernet broadcast (with the broadcast flag and the destination list) to all of the acceleration components that share a common TOR switch. Destination acceleration components may acknowledge (using ACK) each packet upon successful receipt and the sender will mark the appropriate bit in the bit-field. Once a packet is acknowledged by all destinations, the send buffer can be released. Although FIG. 7 shows a certain number of steps listed in a certain order, there could be fewer or more steps and such steps could be performed in a different order.
  • The acceleration components may be grouped together as part of a graph. The grouped acceleration components need not be physically proximate to each other; instead, they could be associated with different parts of the data center and still be grouped together by linking them as part of an acceleration plane. In one example, the graph may have a certain network topology depending upon which of the acceleration components associated with which of the TOR switches are coupled together to accelerate a service. The network topology may be dynamically created based on configuration information received from a service manager for the service. Service manager may be a higher-level software associated with the service. In one example, the network topology may be dynamically adjusted based on at least one performance metric associated with the network (e.g., network 120) interconnecting the acceleration plane and a software plane including host components configured to execute instructions corresponding to the at least one service. Service manager may use a telemetry service to monitor network performance. The network performance metric may be selected substantially, in real time, based on at least on the requirements of the at least one service. The at least one network performance metric may comprise latency, bandwidth, or any other performance metric specified by a service manager or application logic corresponding to the at least one service.
  • In one example, the acceleration components may broadcast messages using a tree-based transmission process including point-to-point links for acceleration components connected via Layer 3 and Layer 2 Ethernet broadcasts for the acceleration components that share a TOR switch. The tree may be two-level or may have more levels depending upon bandwidth limitations imposed by the network interconnecting the acceleration components.
  • FIG. 8 shows a flow chart for a method for transmitting messages in accordance with one example. In step 802, a first acceleration component, associated with a first TOR switch may receive a message from a host. As discussed earlier, an acceleration component may include a transport component to handle messaging (e.g., transport component of 430 of FIG. 4, which is further described with respect to transport component 600 of FIG. 6). In step 804, using a network layer functionality (e.g., Layer 3 functionality) the first acceleration component may transmit the message to a second acceleration component associated with a second TOR switch, different from the first TOR switch. In step 806, using a network layer functionality (e.g., Layer 3 functionality) the first acceleration component may transmit the message to a third acceleration component associated with a third TOR switch, different from the first TOR switch and the second TOR switch. In this example, when transmit component 600 receives a packet (e.g., a part of the received message) with the broadcast flag set, it may process the destination list; if it contains more than just its own IP address, it will add the packet to its own transmit buffers, setup a bit-field to track receipt by each of the destinations and then start transmitting to each receiver one by one (without the broadcast or retransmit fields). Destination acceleration components may acknowledge (using ACK) each packet upon successful receipt and the sender will mark the appropriate bit in the bit-field. Once a packet is acknowledged by all destinations, the send buffer can be released.
  • In step 810, the second acceleration component, using a data-link layer functionality (e.g., Layer 2 Ethernet broadcast functionality), may broadcast the message to the other acceleration components associated with the second TOR switch. In this example, when transport component 600 receives a packet (e.g., a part of the point-to-point message) with the broadcast flag set, it may process the destination list; if it contains more than just its own IP address, it will add the packet to its own transmit buffers, setup a bit-field to track receipt by each of the destinations and then send a Layer 2 Ethernet broadcast (with the broadcast flag and the destination list) to all of the acceleration components that share a common TOR switch. Destination acceleration components may acknowledge (using ACK) each packet upon successful receipt and the sender will mark the appropriate bit in the bit-field. Once a packet is acknowledged by all destinations, the send buffer can be released.
  • In step 810, the third acceleration component, using a data-link layer functionality (e.g., Layer 2 Ethernet broadcast functionality), may broadcast the message to the other acceleration components associated with the third TOR switch. In this example, when transport component 600 receives a packet (e.g., a part of the point-to-point message) with the broadcast flag set, it may process the destination list; if it contains more than just its own IP address, it will add the packet to its own transmit buffers, setup a bit-field to track receipt by each of the destinations and then send a Layer 2 Ethernet broadcast (with the broadcast flag and the destination list) to all of the acceleration components that share a common TOR switch. Destination acceleration components may acknowledge (using ACK) each packet upon successful receipt and the sender will mark the appropriate bit in the bit-field. Once a packet is acknowledged by all destinations, the send buffer can be released. Although FIG. 8 shows a certain number of steps listed in a certain order, there could be fewer or more steps and such steps could be performed in a different order.
  • In conclusion, a system comprising a software plane including a plurality of host components configured to execute instructions corresponding to at least one service and an acceleration plane including a plurality of acceleration components configurable to accelerate the at least one service is provided. The system may further include a network configured to interconnect the software plane and the acceleration plane, the network may include a first top-of-rack (TOR) switch associated with a first subset of the plurality of acceleration components, a second TOR switch associated with a second subset of the plurality of acceleration components, and a third TOR switch associated with a third subset of the plurality of acceleration components, where any of the first subset of the plurality of acceleration components is configurable to transmit a point-to-point message to any of the second subset of the plurality of acceleration components and to transmit the point-to-point message to any of the third subset of the plurality of acceleration components, and where any of the second subset of the plurality of acceleration components is configurable to broadcast the point-to-point message to all of the second subset of the plurality of acceleration components, and where any of the third subset of the plurality of acceleration components is configurable to broadcast the point-to-point message to all of the third subset of the plurality of acceleration components. The point-to-point message may be transmitted using a Layer 3 functionality. The point-to-point message may be broadcasted using a Layer 2 Ethernet broadcast functionality. The point-to-point message may be broadcasted without relying upon any broadcast support from higher layers of the network than a Layer 2 of the network or any multicast support from the higher layers of the network than the Layer 2 of the network. Any of the first subset of the plurality of acceleration components may be configured to dynamically create a network topology comprising any of the second subset of the plurality of the acceleration components and the third subset of the plurality of the acceleration components. Any of the first subset of the plurality of acceleration components may further be configured to dynamically adjust the network topology based on at least one performance metric associated with the network configured to interconnect the software plane and the acceleration plane. The at least one network performance metric may be selected substantially in real-time by the any of the first subset of the plurality of acceleration components based at least on requirements of the at least one service. The at least one network performance metric may comprise latency, bandwidth, or any other performance metric specified by an application logic corresponding to the at least one service.
  • In another example, the present disclosure relates to a method for allowing a first acceleration component among a first plurality of acceleration components, associated with a first top-of-rack (TOR) switch, to transmit messages to other acceleration components in an acceleration plane configurable to provide service acceleration for at least one service. The method may include the first acceleration component transmitting a point-to-point message to a second acceleration component, associated with a second TOR switch different form the first TOR switch, and to a third acceleration component, associated with a third TOR switch different form the first TOR switch and the second TOR switch. The method may further include the second acceleration component broadcasting the point-to-point message to all of a second plurality of acceleration components associated with the second TOR switch. The method may further include the third acceleration component broadcasting the point-to-point message to all of a third plurality of acceleration components associated with the third TOR switch. The first acceleration component may transmit the point-to-point message using a Layer 3 functionality. Each of the second acceleration component and the third acceleration component may broadcast the point-to-point message using a Layer 2 Ethernet broadcast functionality. The method may further include dynamically adjusting the network topology based on at least one performance metric associated with a network interconnecting the acceleration plane and a software plane including a plurality of host components configured to execute instructions corresponding to the at least one service. The at least one network performance metric may be selected substantially in real-time by the any of the first subset of the plurality of acceleration components based at least on requirements of the at least one service. The at least one network performance metric may comprise latency, bandwidth, or any other performance metric specified by an application logic corresponding to the at least one service.
  • In yet another example, the present disclosure relates to an acceleration component for use among a first plurality of acceleration components, associated with a first top-of-rack (TOR) switch, to transmit messages to other acceleration components in an acceleration plane configurable to provide service acceleration for at least one service. The acceleration component may include a transport component configured to transmit a first point-to-point message to a second acceleration component, associated with a second TOR switch different form the first TOR switch, and to a third acceleration component, associated with a third TOR switch different from the first TOR switch and the second TOR switch. The transport component may further be configured to broadcast a second point-to-point message to all of a second plurality of acceleration components associated with the second TOR switch and to all of a third plurality of acceleration components associated with the third TOR switch. The point-to-point message may be transmitted using a Layer 3 functionality. The point-to-point message may be broadcasted using a Layer 2 Ethernet broadcast functionality. The acceleration component may further be configured to broadcast the second point-to-point message in a network interconnecting acceleration components in the acceleration plane without relying upon any support for broadcasting or multicasting from higher layers of the network than a Layer 2 of the network.
  • It is to be understood that the systems, methods, modules, and components depicted herein are merely exemplary. Alternatively, or in addition, the functionality described herein can be performed, at least in part, by one or more hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-Programmable Gate Arrays (FPGAs), Application-Specific Integrated Circuits (ASICs), Application-Specific Standard Products (ASSPs), System-on-a-Chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc. In an abstract, but still definite sense, any arrangement of components to achieve the same functionality is effectively "associated" such that the desired functionality is achieved. Hence, any two components herein combined to achieve a particular functionality can be seen as "associated with" each other such that the desired functionality is achieved, irrespective of architectures or inter-medial components. Likewise, any two components so associated can also be viewed as being "operably connected," or "coupled," to each other to achieve the desired functionality.
  • The functionality associated with some examples described in this disclosure can also include instructions stored in a non-transitory media. The term "non-transitory media" as used herein refers to any media storing data and/or instructions that cause a machine to operate in a specific manner. Exemplary non-transitory media include non-volatile media and/or volatile media. Non-volatile media include, for example, a hard disk, a solid state drive, a magnetic disk or tape, an optical disk or tape, a flash memory, an EPROM, NVRAM, PRAM, or other such media, or networked versions of such media. Volatile media include, for example, dynamic memory such as DRAM, SRAM, a cache, or other such media. Non-transitory media is distinct from, but can be used in conjunction with transmission media. Transmission media is used for transferring data and/or instruction to or from a machine. Exemplary transmission media, include coaxial cables, fiber-optic cables, copper wires, and wireless media, such as radio waves.
  • Furthermore, those skilled in the art will recognize that boundaries between the functionality of the above described operations are merely illustrative. The functionality of multiple operations may be combined into a single operation, and/or the functionality of a single operation may be distributed in additional operations. Moreover, alternative embodiments may include multiple instances of a particular operation, and the order of operations may be altered in various other embodiments.
  • Although the disclosure provides specific examples, various modifications and changes can be made without departing from the scope of the disclosure as set forth in the claims below. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of the present disclosure. Any benefits, advantages, or solutions to problems that are described herein with regard to a specific example are not intended to be construed as a critical, required, or essential feature or element of any or all the claims.
  • Furthermore, the terms "a" or "an," as used herein, are defined as one or more than one. Also, the use of introductory phrases such as "at least one" and "one or more" in the claims should not be construed to imply that the introduction of another claim element by the indefinite articles "a" or "an" limits any particular claim containing such introduced claim element to inventions containing only one such element, even when the same claim includes the introductory phrases "one or more" or "at least one" and indefinite articles such as "a" or "an." The same holds true for the use of definite articles.
  • Unless stated otherwise, terms such as "first" and "second" are used to arbitrarily distinguish between the elements such terms describe. Thus, these terms are not necessarily intended to indicate temporal or other prioritization of such elements.

Claims (15)

  1. A system comprising:
    a software plane (104) including a plurality of host components (108) configured to execute instructions corresponding to at least one service;
    an acceleration plane (106) including a plurality of acceleration components (110, 116) configurable to accelerate the at least one service; and
    a network (120) configured to interconnect the software plane (104) and the acceleration plane (106), the network (104) comprising a first top-of-rack, TOR, switch (302) associated with a first subset of the plurality of acceleration components, a second TOR switch (304) associated with a second subset of the plurality of acceleration components, and a third TOR switch (306) associated with a third subset of the plurality of acceleration components, wherein any of the first subset of the plurality of acceleration components is configurable to transmit a point-to-point message to any of the second subset of the plurality of acceleration components and wherein any of the first subset of the plurality of acceleration components is characterized by being configurable to transmit the point-to-point message to any of the third subset of the plurality of acceleration components, and wherein any of the second subset of the plurality of acceleration components is configurable to broadcast the point-to-point message to all of the second subset of the plurality of acceleration components, and wherein any of the third subset of the plurality of acceleration components is configurable to broadcast the point-to-point message to all of the third subset of the plurality of acceleration components.
  2. The system of claim 1, wherein the point-to-point message is transmitted using a Layer 3 functionality.
  3. The system of claim 1, wherein the point-to-point message is broadcasted using a Layer 2 Ethernet broadcast functionality.
  4. The system of claim 1, wherein any of the first subset of the plurality of acceleration components is configured to dynamically create a network topology comprising any of the second subset of the plurality of the acceleration components and the third subset of the plurality of the acceleration components.
  5. The system of claim 4, wherein the any of the first subset of the plurality of acceleration components is further configured to dynamically adjust the network topology based on at least one performance metric associated with the network configured to interconnect the software plane and the acceleration plane.
  6. The system of claim 5, wherein the at least one network performance metric is selected substantially in real-time by the any of the first subset of the plurality of acceleration components based at least on requirements of the at least one service.
  7. The system of claim 6, wherein the at least one network performance metric comprises latency, bandwidth, or any other performance metric specified by an application logic corresponding to the at least one service.
  8. The system of claim 1, wherein the point-to-point message is broadcasted without relying upon any broadcast support from higher layers of the network than a Layer 2 of the network or any multicast support from the higher layers of the network than the Layer 2 of the network.
  9. A method for allowing a first acceleration component among a first plurality of acceleration components (110, 116), with a first top-of-rack, TOR, switch (302), to transmit messages to other acceleration components (110, 116) in an acceleration plane (106) configurable to provide service acceleration for at least one service, the method comprising:
    the first acceleration component transmitting a point-to-point message to a second acceleration component, associated with a second TOR switch different form the first TOR switch, the method being characterized in that the first acceleration component is transmitting the point-to-point message also to a third acceleration component, associated with a third TOR switch different from the first TOR switch and the second TOR switch;
    the second acceleration component broadcasting the point-to-point message to all of a second plurality of acceleration components associated with the second TOR switch; and
    the third acceleration component broadcasting the point-to-point message to all of a third plurality of acceleration components associated with the third TOR switch.
  10. The method of claim 9 further comprising the first acceleration component transmitting the point-to-point message using a Layer 3 functionality.
  11. The method of claim 9 further comprising each of the second acceleration component and the third acceleration component broadcasting the point-to-point message using a Layer 2 Ethernet broadcast functionality.
  12. The method of claim 11 further comprising dynamically creating a network topology comprising any of the other acceleration components including the second plurality of acceleration components and the third plurality of acceleration components.
  13. The method of claim 12 further comprising dynamically adjusting the network topology based on at least one performance metric associated with a network interconnecting the acceleration plane and a software plane including a plurality of host components configured to execute instructions corresponding to the at least one service.
  14. The method of claim 13, wherein the at least one network performance metric is selected substantially in real-time based at least on requirements of the at least one service.
  15. The method of claim 14, wherein the at least one network performance metric comprises latency, bandwidth, or any other performance metric specified by an application logic corresponding to the at least one service.
EP17833038.7A 2017-01-02 2017-12-19 Transmission of messages by acceleration components configured to accelerate a service Active EP3563535B8 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US15/396,779 US10326696B2 (en) 2017-01-02 2017-01-02 Transmission of messages by acceleration components configured to accelerate a service
PCT/US2017/067150 WO2018125652A1 (en) 2017-01-02 2017-12-19 Transmission of messages by acceleration components configured to accelerate a service

Publications (3)

Publication Number Publication Date
EP3563535A1 EP3563535A1 (en) 2019-11-06
EP3563535B1 true EP3563535B1 (en) 2021-01-20
EP3563535B8 EP3563535B8 (en) 2021-03-31

Family

ID=61022412

Family Applications (1)

Application Number Title Priority Date Filing Date
EP17833038.7A Active EP3563535B8 (en) 2017-01-02 2017-12-19 Transmission of messages by acceleration components configured to accelerate a service

Country Status (5)

Country Link
US (1) US10326696B2 (en)
EP (1) EP3563535B8 (en)
CN (2) CN113315722A (en)
ES (1) ES2856155T3 (en)
WO (1) WO2018125652A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10320677B2 (en) 2017-01-02 2019-06-11 Microsoft Technology Licensing, Llc Flow control and congestion management for acceleration components configured to accelerate a service
CN108984309A (en) * 2018-08-07 2018-12-11 郑州云海信息技术有限公司 A kind of RACK server resource pond system and method
US10922250B2 (en) * 2019-04-30 2021-02-16 Microsoft Technology Licensing, Llc Monitoring and steering service requests to acceleration components
CN113748648A (en) 2019-05-23 2021-12-03 慧与发展有限责任合伙企业 Weight routing

Family Cites Families (51)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0948168A1 (en) 1998-03-31 1999-10-06 TELEFONAKTIEBOLAGET L M ERICSSON (publ) Method and device for data flow control
US6505253B1 (en) 1998-06-30 2003-01-07 Sun Microsystems Multiple ACK windows providing congestion control in reliable multicast protocol
US7437305B1 (en) 1999-05-11 2008-10-14 Christopher Angel Kantarjiev Scheduling delivery of products via the internet
US8363744B2 (en) 2001-06-10 2013-01-29 Aloft Media, Llc Method and system for robust, secure, and high-efficiency voice and packet transmission over ad-hoc, mesh, and MIMO communication networks
US7003118B1 (en) 2000-11-27 2006-02-21 3Com Corporation High performance IPSEC hardware accelerator for packet classification
US20070053356A1 (en) * 2003-10-30 2007-03-08 Venkat Konda Nonblocking and deterministic multirate multicast packet scheduling
CN1674576B (en) 2004-06-03 2010-04-28 华为技术有限公司 Method for transmitting strategic information inter-network equipment
US7570639B2 (en) 2004-11-30 2009-08-04 Broadcom Corporation Multicast trunking in a network device
US7492710B2 (en) 2005-03-31 2009-02-17 Intel Corporation Packet flow control
US7606147B2 (en) 2005-04-13 2009-10-20 Zeugma Systems Inc. Application aware traffic shaping service node positioned between the access and core networks
EP1909440A1 (en) 2005-07-06 2008-04-09 NEC Corporation Bandwidth control circuit and bandwidth control method used therefor
US9621375B2 (en) 2006-09-12 2017-04-11 Ciena Corporation Smart Ethernet edge networking system
US7640023B2 (en) 2006-05-03 2009-12-29 Cisco Technology, Inc. System and method for server farm resource allocation
US8027284B2 (en) 2006-11-27 2011-09-27 Ntt Docomo, Inc. Method and apparatus for reliable multicasting in wireless relay networks
US7839777B2 (en) 2007-09-27 2010-11-23 International Business Machines Corporation Method, system, and apparatus for accelerating resolution of network congestion
US20100036903A1 (en) 2008-08-11 2010-02-11 Microsoft Corporation Distributed load balancer
US8910153B2 (en) 2009-07-13 2014-12-09 Hewlett-Packard Development Company, L. P. Managing virtualized accelerators using admission control, load balancing and scheduling
US8514876B2 (en) 2009-08-11 2013-08-20 Cisco Technology, Inc. Method and apparatus for sequencing operations for an incoming interface check in data center ethernet
US8914805B2 (en) 2010-08-31 2014-12-16 International Business Machines Corporation Rescheduling workload in a hybrid computing environment
US9088510B2 (en) 2010-12-17 2015-07-21 Microsoft Technology Licensing, Llc Universal rate control mechanism with parameter adaptation for real-time communication applications
EP3998755A1 (en) 2010-12-29 2022-05-18 Juniper Networks, Inc. Methods and apparatus for standard protocol validation mechanisms deployed over a switch fabric system
US8798077B2 (en) * 2010-12-29 2014-08-05 Juniper Networks, Inc. Methods and apparatus for standard protocol validation mechanisms deployed over a switch fabric system
US8989009B2 (en) * 2011-04-29 2015-03-24 Futurewei Technologies, Inc. Port and priority based flow control mechanism for lossless ethernet
US8812727B1 (en) 2011-06-23 2014-08-19 Amazon Technologies, Inc. System and method for distributed load balancing with distributed direct server return
US20130318280A1 (en) 2012-05-22 2013-11-28 Xockets IP, LLC Offloading of computation for rack level servers and corresponding methods and systems
US9130764B2 (en) 2012-05-31 2015-09-08 Dell Products L.P. Scaling up/out the number of broadcast domains in network virtualization environments
US9374270B2 (en) 2012-06-06 2016-06-21 Juniper Networks, Inc. Multicast service in virtual networks
US10270709B2 (en) 2015-06-26 2019-04-23 Microsoft Technology Licensing, Llc Allocating acceleration component functionality for supporting services
US8953618B2 (en) 2012-10-10 2015-02-10 Telefonaktiebolaget L M Ericsson (Publ) IP multicast service leave process for MPLS-based virtual private cloud networking
US9253140B2 (en) 2012-11-20 2016-02-02 Cisco Technology, Inc. System and method for optimizing within subnet communication in a network environment
US9344493B1 (en) 2013-07-11 2016-05-17 Juniper Networks, Inc. Server health monitoring for traffic load balancer
US9231863B2 (en) 2013-07-23 2016-01-05 Dell Products L.P. Systems and methods for a data center architecture facilitating layer 2 over layer 3 communication
US9313134B2 (en) 2013-10-15 2016-04-12 Cisco Technology, Inc. Leveraging hardware accelerators for scalable distributed stream processing in a network environment
US9294304B2 (en) 2014-03-31 2016-03-22 Juniper Networks, Inc. Host network accelerator for data center overlay network
US9794079B2 (en) 2014-03-31 2017-10-17 Nicira, Inc. Replicating broadcast, unknown-unicast, and multicast traffic in overlay logical networks bridged with physical networks
US9866427B2 (en) 2015-02-16 2018-01-09 Juniper Networks, Inc. Multi-stage switch fabric fault detection and handling
US9760159B2 (en) 2015-04-08 2017-09-12 Microsoft Technology Licensing, Llc Dynamic power routing to hardware accelerators
US9983938B2 (en) * 2015-04-17 2018-05-29 Microsoft Technology Licensing, Llc Locally restoring functionality at acceleration components
US10296392B2 (en) 2015-04-17 2019-05-21 Microsoft Technology Licensing, Llc Implementing a multi-component service using plural hardware acceleration components
US10027543B2 (en) 2015-04-17 2018-07-17 Microsoft Technology Licensing, Llc Reconfiguring an acceleration component among interconnected acceleration components
US9792154B2 (en) 2015-04-17 2017-10-17 Microsoft Technology Licensing, Llc Data processing system having a hardware acceleration plane and a software plane
US20160308649A1 (en) 2015-04-17 2016-10-20 Microsoft Technology Licensing, Llc Providing Services in a System having a Hardware Acceleration Plane and a Software Plane
US20160335209A1 (en) * 2015-05-11 2016-11-17 Quanta Computer Inc. High-speed data transmission using pcie protocol
US9847936B2 (en) 2015-06-25 2017-12-19 Intel Corporation Apparatus and method for hardware-accelerated packet processing
US20160379686A1 (en) 2015-06-29 2016-12-29 Microsoft Technology Licensing, Llc Server systems with hardware accelerators including stacked memory
CN105162721B (en) * 2015-07-31 2018-02-27 重庆大学 Full light network data centre network system and data communications method based on software defined network
US20170111294A1 (en) 2015-10-16 2017-04-20 Compass Electro Optical Systems Ltd. Integrated folded clos architecture
US10552205B2 (en) 2016-04-02 2020-02-04 Intel Corporation Work conserving, load balancing, and scheduling
CN106230952A (en) * 2016-08-05 2016-12-14 王楚 Monitor the big data storing platform network architecture
US10320677B2 (en) 2017-01-02 2019-06-11 Microsoft Technology Licensing, Llc Flow control and congestion management for acceleration components configured to accelerate a service
US10425472B2 (en) 2017-01-17 2019-09-24 Microsoft Technology Licensing, Llc Hardware implemented load balancing

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
None *

Also Published As

Publication number Publication date
CN110121868B (en) 2021-06-18
US20180191609A1 (en) 2018-07-05
US10326696B2 (en) 2019-06-18
ES2856155T3 (en) 2021-09-27
CN113315722A (en) 2021-08-27
EP3563535A1 (en) 2019-11-06
WO2018125652A1 (en) 2018-07-05
EP3563535B8 (en) 2021-03-31
CN110121868A (en) 2019-08-13

Similar Documents

Publication Publication Date Title
US10791054B2 (en) Flow control and congestion management for acceleration components configured to accelerate a service
US20220311544A1 (en) System and method for facilitating efficient packet forwarding in a network interface controller (nic)
US11412076B2 (en) Network access node virtual fabrics configured dynamically over an underlay network
US10116574B2 (en) System and method for improving TCP performance in virtualized environments
US10922250B2 (en) Monitoring and steering service requests to acceleration components
US20200169513A1 (en) Fabric control protocol for data center networks with packet spraying over multiple alternate data paths
US9596192B2 (en) Reliable link layer for control links between network controllers and switches
US7016971B1 (en) Congestion management in a distributed computer system multiplying current variable injection rate with a constant to set new variable injection rate at source node
US20210320820A1 (en) Fabric control protocol for large-scale multi-stage data center networks
EP3563535B1 (en) Transmission of messages by acceleration components configured to accelerate a service
US20210297350A1 (en) Reliable fabric control protocol extensions for data center networks with unsolicited packet spraying over multiple alternate data paths
US20210297351A1 (en) Fabric control protocol with congestion control for data center networks
US20210297343A1 (en) Reliable fabric control protocol extensions for data center networks with failure resilience
US20230359582A1 (en) In-network collective operations
US11757778B2 (en) Methods and systems for fairness across RDMA requesters using a shared receive queue
US20240121320A1 (en) High Performance Connection Scheduler
Shen et al. A Lightweight Routing Layer Using a Reliable Link-Layer Protocol
WO2018022083A1 (en) Deliver an ingress packet to a queue at a gateway device

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: UNKNOWN

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20190624

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

AX Request for extension of the european patent

Extension state: BA ME

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)
GRAP Despatch of communication of intention to grant a patent

Free format text: ORIGINAL CODE: EPIDOSNIGR1

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: GRANT OF PATENT IS INTENDED

INTG Intention to grant announced

Effective date: 20200817

GRAS Grant fee paid

Free format text: ORIGINAL CODE: EPIDOSNIGR3

GRAA (expected) grant

Free format text: ORIGINAL CODE: 0009210

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE PATENT HAS BEEN GRANTED

AK Designated contracting states

Kind code of ref document: B1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

REG Reference to a national code

Ref country code: GB

Ref legal event code: FG4D

REG Reference to a national code

Ref country code: CH

Ref legal event code: EP

REG Reference to a national code

Ref country code: DE

Ref legal event code: R096

Ref document number: 602017031844

Country of ref document: DE

REG Reference to a national code

Ref country code: AT

Ref legal event code: REF

Ref document number: 1357304

Country of ref document: AT

Kind code of ref document: T

Effective date: 20210215

REG Reference to a national code

Ref country code: IE

Ref legal event code: FG4D

REG Reference to a national code

Ref country code: SE

Ref legal event code: TRGR

REG Reference to a national code

Ref country code: CH

Ref legal event code: PK

Free format text: BERICHTIGUNG B8

RAP2 Party data changed (patent owner data changed or rights of a patent transferred)

Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC

REG Reference to a national code

Ref country code: NL

Ref legal event code: FP

REG Reference to a national code

Ref country code: LT

Ref legal event code: MG9D

REG Reference to a national code

Ref country code: AT

Ref legal event code: MK05

Ref document number: 1357304

Country of ref document: AT

Kind code of ref document: T

Effective date: 20210120

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: GR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20210421

Ref country code: HR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20210120

Ref country code: FI

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20210120

Ref country code: PT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20210520

Ref country code: NO

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20210420

Ref country code: LT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20210120

Ref country code: BG

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20210420

REG Reference to a national code

Ref country code: DE

Ref legal event code: R081

Ref document number: 602017031844

Country of ref document: DE

Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, REDMOND, US

Free format text: FORMER OWNER: MICROSOFT TECHNOLOGY LICENSING LLC, REDMOND, WA, US

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: AT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20210120

Ref country code: RS

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20210120

Ref country code: LV

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20210120

Ref country code: PL

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20210120

REG Reference to a national code

Ref country code: ES

Ref legal event code: FG2A

Ref document number: 2856155

Country of ref document: ES

Kind code of ref document: T3

Effective date: 20210927

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: IS

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20210520

REG Reference to a national code

Ref country code: DE

Ref legal event code: R097

Ref document number: 602017031844

Country of ref document: DE

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: EE

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20210120

Ref country code: SM

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20210120

RAP4 Party data changed (patent owner data changed or rights of a patent transferred)

Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC

PLBE No opposition filed within time limit

Free format text: ORIGINAL CODE: 0009261

REG Reference to a national code

Ref country code: DE

Ref legal event code: R079

Ref document number: 602017031844

Country of ref document: DE

Free format text: PREVIOUS MAIN CLASS: H04L0012933000

Ipc: H04L0049100000

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: RO

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20210120

Ref country code: DK

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20210120

Ref country code: SK

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20210120

26N No opposition filed

Effective date: 20211021

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: AL

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20210120

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: SI

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20210120

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: IS

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20210520

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: MC

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20210120

REG Reference to a national code

Ref country code: CH

Ref legal event code: PL

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: LU

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20211219

Ref country code: IE

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20211219

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: LI

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20211231

Ref country code: CH

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20211231

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: IT

Payment date: 20221111

Year of fee payment: 6

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: BE

Payment date: 20221118

Year of fee payment: 6

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: ES

Payment date: 20230113

Year of fee payment: 6

P01 Opt-out of the competence of the unified patent court (upc) registered

Effective date: 20230430

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: CY

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20210120

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: HU

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT; INVALID AB INITIO

Effective date: 20171219

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: NL

Payment date: 20231121

Year of fee payment: 7

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: GB

Payment date: 20231121

Year of fee payment: 7

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: SE

Payment date: 20231121

Year of fee payment: 7

Ref country code: FR

Payment date: 20231122

Year of fee payment: 7

Ref country code: DE

Payment date: 20231121

Year of fee payment: 7

Ref country code: CZ

Payment date: 20231124

Year of fee payment: 7

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: BE

Payment date: 20231121

Year of fee payment: 7