CN113272793A

CN113272793A - Network interface device

Info

Publication number: CN113272793A
Application number: CN201980087757.XA
Authority: CN
Inventors: S·波普; N·特顿; D·里多克; D·基塔里耶夫; R·索汉; D·罗伯茨
Original assignee: Xilinx Inc
Current assignee: Xilinx Inc
Priority date: 2018-11-05
Filing date: 2019-11-05
Publication date: 2021-08-17
Also published as: EP3877851A1; WO2020094664A1; KR20210088652A; JP2022512879A

Abstract

A network interface device having a hardware module includes a plurality of processing units. Each processing unit of the plurality of processing units is associated with at least one predetermined operation of its own. At compile time, the hardware module is configured to arrange at least some of the plurality of processing units to perform a function on the data packet in a sequence. A compiler is provided to assign different processing stages to each processing unit. A controller is provided to switch between different processing circuits in operation so that one processing circuit can be used while another processing circuit is compiled.

Description

Network interface device

Technical Field

The present application relates to a network interface device for performing a function on a data packet.

Background

Network interface devices are known and are commonly used to provide an interface between a computing device and a network. The network interface device may be configured to process data received from a network and/or process data to be placed on the network.

Disclosure of Invention

According to one aspect, there is provided a network interface device for connecting a host to a network, the network interface device comprising: a first interface configured to receive a plurality of data packets; and a configurable hardware module comprising a plurality of processing units, each processing unit associated with a predetermined type of operation capable of being performed in a single step, wherein at least some of the plurality of processing units are associated with different predetermined types of operations, wherein the hardware module is configurable to interconnect at least some of the plurality of the processing units to provide a first data processing pipeline for processing one or more of the plurality of packets to perform a first function on the one or more of the plurality of packets.

In some embodiments, the first function comprises a filtering function. In some embodiments, the function includes at least one of a tunneling, an encapsulation, and a routing function. In some embodiments, the first function comprises an extended berkeley packet filter function.

In some embodiments, the first function comprises a distributed denial of service cleanup operation.

In some embodiments, the first function comprises firewall operations.

In some embodiments, the first interface is configured to receive a first data packet from a network.

In some embodiments, the first interface is configured to receive a first data packet from the host device.

In some embodiments, two or more of at least some of the plurality of processing units are configured to perform their associated at least one predetermined operation in parallel.

In some embodiments, two or more of at least some of the plurality of processing units are configured to perform their associated predetermined type of operation according to a common clock signal of the hardware modules.

In some embodiments, each of two or more of at least some of the plurality of processing units is configured to perform its associated predetermined type of operation within a predetermined length of time specified by the clock signal.

In some embodiments, two or more of at least some of the plurality of processing units are configured to: accessing the first data packet for a period of a predetermined length of time; and transmitting a result of the respective at least one operation to a next processing unit in response to an end of the predetermined length of time.

In some embodiments, the results include at least one or more of the following: at least one value from one or more of the plurality of data packets; updating the mapping state; and metadata.

In some embodiments, each of the plurality of processing units includes an application specific integrated circuit configured to perform at least one operation associated with the respective processing unit.

In some embodiments, each processing unit comprises a field programmable gate array. In some embodiments, each processing unit includes any other type of soft logic.

In some embodiments, at least one of the plurality of processing units includes digital circuitry and memory to store state associated with processing performed by the digital circuitry, wherein the digital circuitry is configured to perform a predetermined type of operation associated with the respective processing unit in communication with the memory.

In some embodiments, the network interface device comprises two or more processing unit memories accessible to the plurality of processing units, wherein the memories are configured to store a state associated with the first data packet, wherein during execution of the first function by the hardware module, the two or more processing units of the plurality of processing units are configured to access and modify the state.

In some embodiments, a first one of at least some of the plurality of processing units is configured to stall during a time when a second one of the plurality of processing units accesses the value of the state.

In some embodiments, one or more of the plurality of processing units are individually configured to: based on their associated predetermined operation types, specific operations are performed for the respective pipelines.

In some embodiments, the hardware module is configured to receive an instruction and, in response to the instruction, perform at least one of: interconnecting at least some of the plurality of processing units to provide a data processing pipeline for processing one or more of the plurality of data packets; causing one or more processing units of the plurality of processing units to perform their associated predetermined type of operation with respect to the one or more data packets; adding one or more processing elements of the plurality of processing elements to a data processing pipeline; and removing one or more processing units of the plurality of processing units from the data processing pipeline.

In some embodiments, the predetermined operation comprises at least one of: loading at least one value of the first data packet from a memory; storing at least one value of the data packet in a memory; and performing a lookup in a lookup table to determine an action to perform on the packet.

In some embodiments, the hardware module is configured to receive an instruction, wherein the hardware module is configurable to interconnect at least some of the plurality of the processing units in response to the instruction to provide a data processing pipeline for processing one or more of the plurality of data packets, wherein the instruction comprises a data packet sent through a third processing pipeline.

In some embodiments, one or more of at least some of the plurality of processing units may be configured to: in response to the instruction, performing the selected operation of its associated predetermined operation type for one or more of the plurality of data packets.

In some embodiments, the plurality of components includes a second one of the plurality of components configured to provide the first functionality in circuitry other than a hardware module, wherein the network interface device includes at least one controller configured to pass the data packet through the processing pipeline for processing by one of the first one of the plurality of components and the second one of the plurality of components.

In some embodiments, a network interface device includes at least one controller configured to issue an instruction to cause a hardware module to begin performing a first function on a data packet, wherein the instruction is configured to cause a first component of a plurality of components to be inserted into the processing pipeline.

In some embodiments, a network interface device includes at least one controller configured to issue an instruction to cause a hardware module to begin performing a first function with respect to a data packet, wherein the instruction includes a control message sent through a processing pipeline and configured to cause a first one of a plurality of components to initiate.

In some embodiments, for one or more of at least some of the plurality of processing units, the associated at least one operation comprises at least one of: loading at least one value of the first data packet from a memory of a network interface device; storing at least one value of the first data packet in a memory of the network interface device; and performing a lookup in a lookup table to determine an action to perform on the first packet.

In some embodiments, one or more of the at least some of the plurality of processing units is configured to pass at least one result of its associated at least one predetermined operation to a next processing unit in the first processing pipeline, the next processing unit being configured to perform a next predetermined operation in dependence on the at least one result.

In some embodiments, each of the different predetermined operation types is defined by a different template.

In some embodiments, the predetermined operation type comprises at least one of: accessing the data packet; accessing a lookup table stored in a memory of the hardware module; performing a logical operation on data loaded from the data packet; and performing a logical operation on the data loaded from the lookup table.

In some embodiments, the hardware module comprises routing hardware, wherein the hardware module is configured to: routing packets between the plurality of processing units in a particular order specified by the first data processing pipeline by configuring routing hardware to interconnect at least some of the plurality of processing units to provide the first data processing pipeline.

In some embodiments, the hardware module may be configured to interconnect at least some of the plurality of the processing units to provide a second data processing pipeline for processing one or more of the plurality of data packets to perform a second function different from the first function.

In some embodiments, the hardware module may be configured to interconnect at least some of the plurality of processing units to provide a second data processing pipeline after interconnecting at least some of the plurality of processing units to provide a first data processing pipeline.

In some embodiments, the network interface device includes further circuitry separate from the hardware module and configured to perform a first function on one or more of the plurality of data packets.

In some embodiments, the further circuitry comprises at least one of: a field programmable gate array; and a plurality of central processing units.

In some embodiments, the network interface device comprises at least one controller, wherein the further circuitry is configured to perform a first function to be performed on a data packet during a compilation process for the first function to be performed in the hardware module, wherein the at least one controller is configured to, in response to completion of the compilation process, control the hardware module to begin performing the first function on a data packet.

In some embodiments, the further circuitry comprises a plurality of central processing units.

In some embodiments, the at least one controller is configured to: in response to determining that compilation processing for a first function to be performed in the hardware module has been completed, controlling the further circuitry to stop performing the first function on data packets.

In some embodiments, the network interface device comprises at least one controller, wherein the hardware module is configured to perform the first function on a data packet during a compilation process for the first function to be performed in the hardware circuitry, wherein the at least one controller is configured to determine that the compilation process for the first function to be performed in a further circuitry has been completed, and in response to the determination, control the further circuitry to begin performing the first function with respect to a data packet.

In some embodiments, the further circuitry comprises a field programmable gate array.

In some embodiments, the at least one controller is configured to: in response to determining that the compilation process for the first function to be performed in the further circuitry has been completed, controlling the hardware module to stop performing the first function on data packets.

In some embodiments, the network interface device comprises at least one controller configured to perform a compilation process to provide the first function to be performed in the hardware module.

In some embodiments, the compilation process includes providing instructions to provide a control plane interface in the hardware module that is responsive to the control message.

According to another aspect, there is provided a data processing system comprising a network interface device according to the first aspect and a host device, and wherein the data processing system comprises at least one controller configured to perform a compilation process to provide that the first function is to be performed in a hardware module.

In some embodiments, the at least one controller is provided by one or more of: a network interface device and a host device.

In some embodiments, the compiling process is performed in response to determining that the computer program representing the first function made by the at least one controller is safe for execution in kernel mode of the host device.

In some embodiments, the at least one controller is configured to perform the compilation process by: specifying each processing unit of at least some of the plurality of processing units to perform at least one operation represented by a series of computer code instructions in a particular order of the first data processing pipeline, wherein the plurality of operations provide the first function for the one or more data packets of the plurality of data packets.

In some embodiments, the at least one controller is configured to: sending a first instruction to cause further circuitry of the network interface device to perform the first function on the data packet before the compilation process is complete; and sending a second instruction to enable the hardware module to start executing the first function on the data packet after the compiling process is completed.

According to another aspect, there is provided a method for implementation in a network interface device, the method comprising: receiving a plurality of data packets at a first interface; and configuring the hardware module to interconnect at least some of a plurality of processing units of the hardware module to provide a first data processing pipeline for processing one or more of the plurality of data packets to perform a first function on the one or more of the plurality of data packets, wherein each processing unit is associated with a predetermined type of operation that can be performed in a single step, wherein at least some of the plurality of processing units are associated with different predetermined types of operations.

According to another aspect, there is provided a non-transitory computer readable medium comprising program instructions for causing a network interface device to perform a method comprising: receiving a plurality of data packets at a first interface; and configuring the hardware module to interconnect at least some of a plurality of processing units of the hardware module to provide a first data processing pipeline for processing one or more of the plurality of data packets to perform a first function on the one or more of the plurality of data packets, wherein each processing unit is associated with a predetermined type of operation that can be performed in a single step, wherein at least some of the plurality of processing units are associated with different predetermined types of operations.

According to another aspect, there is provided a processing unit configured to: performing at least one predetermined operation on a first data packet received at a network interface device; connected to a first further processing unit configured to perform a further at least a first predetermined operation on said first data packet; connected to a second further processing unit configured to perform a second further at least one predetermined operation on the first data packet; receiving a result of a first further at least one predetermined operation from a first further processing unit; performing the at least one predetermined operation in dependence on a result of the first further at least one predetermined operation; sending a result of the at least one predetermined operation to the second further processing unit for processing in the second further at least one predetermined operation.

In some embodiments, the processing unit is configured to receive a clock signal for timing the at least one predetermined operation, wherein the processing unit is configured to perform the at least one predetermined operation in at least one cycle of the clock signal.

In some embodiments, the processing unit is configured to perform at least one predetermined operation in a single cycle of the clock signal.

In some embodiments, the at least one predetermined operation, the first further at least one predetermined operation, and the second further at least one predetermined operation form part of a function performed with respect to a first data packet received at the network interface device.

In some embodiments, a first data packet is received from a host device, wherein a network interface device is configured to interface the host device to a network.

In some embodiments, a first data packet is received from a network, wherein a network interface device is configured to interface a host device to the network.

In some embodiments, the function is a filtering function.

In some embodiments, the filtering function is an extended berkeley packet filtering function.

In some embodiments, the processing unit comprises an application specific integrated circuit configured to perform at least one predetermined operation.

In some embodiments, the processing unit comprises: digital circuitry configured to perform at least one predetermined operation; and a memory storing a state related to at least one predetermined operation performed.

In some embodiments, the processing unit is configured to access a memory accessible by the first further processing unit and the second further processing unit, wherein the memory is configured to store a state associated with the first data packet, wherein the at least one predetermined operation comprises modifying the state stored in the memory.

In some embodiments, the processing unit is configured to read the value of the state from the memory in a first clock cycle and provide the value to a second further processing unit for modification by the second further processing unit, wherein the processing unit is configured to stop in a second clock cycle following the first clock cycle.

In some embodiments, the at least one predetermined operation comprises at least one of: loading a first data packet from a memory of a network interface device; storing the first data packet in a memory of the network interface device; and performing a lookup in a lookup table to determine an action to be performed on the first packet.

According to another aspect, there is provided a method implemented in a processing unit, the method comprising: performing at least one predetermined operation with respect to a first data packet received at the network interface device; connected to a first further processing unit configured to perform a first further at least one predetermined operation on the first data packet; connected to a second further processing unit configured to perform a second further at least one predetermined operation on the first data packet; receiving a result of a first further at least one predetermined operation from a first further processing unit; performing the at least one predetermined operation in dependence on a result of the first further at least one predetermined operation; sending a result of the at least one predetermined operation to the second further processing unit for processing in the second further at least one predetermined operation.

According to another aspect, there is provided a computer-readable non-transitory storage device storing instructions that, when executed by a processing unit, cause the processing unit to perform a method comprising: performing at least one predetermined operation with respect to a first data packet received at the network interface device; connected to a first further processing unit configured to perform a first further at least one predetermined operation with respect to the first data packet; connected to a second further processing unit configured to perform a second further at least one predetermined operation with respect to the first data packet; receiving a result of a first further at least one predetermined operation from a first further processing unit; performing the at least one predetermined operation in dependence on a result of the first further at least one predetermined operation; sending a result of the at least one predetermined operation to the second further processing unit for processing in the second further at least one predetermined operation.

According to another aspect, there is provided a network interface device for interfacing a host device with a network, the network interface device comprising: at least one controller; a first interface configured to receive a data packet; a first circuit configured to perform a first function on a data packet received at a first interface; and a second circuit, wherein the first circuit is configured to perform the first function with respect to the data packets received at the first interface during a compilation process for the first function to be performed in the second circuit, wherein the at least one controller is configured to determine that the compilation process for the first function performed in the second circuit is complete and, in response to the determination, control the second circuit to begin performing the first function for the data packets received at the first interface.

In some embodiments, the at least one controller is configured to control the first circuitry to stop performing the first function on data packets received at the first interface in response to said determination that the compilation process for the first function to be performed in the second circuitry has been completed.

In some embodiments, the at least one controller is configured to, in response to the determination that the compilation process for the first function to be performed in the second circuit has been completed: initiating execution of a first function on a packet of a first data stream received at a first interface; and controlling the first circuit to stop performing the first function on the data packets of the first data stream.

In some embodiments, the first circuit comprises at least one central processing unit, wherein each of the at least one central processing unit is configured to perform the first function on at least one data packet received at the first interface.

In some embodiments, the second circuit comprises a field programmable gate array configured to begin performing the first function on a data packet received at the first interface.

In some embodiments, the second circuit comprises a hardware module comprising a plurality of processing units, each processing unit associated with at least one predetermined operation, wherein the first interface is configured to receive the first data packet, wherein the hardware module is configured to: after compilation processing for a first function to be performed in the second circuit, at least some of the plurality of processing units are caused to perform their associated at least one predetermined operation in a particular order, thereby performing the first function for the first packet.

In some embodiments, the first circuit comprises a hardware module comprising a plurality of processing units, each processing unit associated with at least one predetermined operation, wherein the first interface is configured to receive the first data packet, wherein the hardware module is configured to: during the compilation process for a first function to be performed in the second circuit, at least some of the plurality of processing units are caused to perform their associated at least one predetermined operation in a particular order to perform the first function for the first packet.

In some embodiments, the at least one controller is configured to execute a compilation process for compiling the first function to be performed by the second circuitry.

In some embodiments, the at least one controller is configured to: the first circuitry is instructed to perform a first function on a data packet received at the first interface before the compilation process is completed.

In some embodiments, the compilation process for compiling the first function to be performed by the second circuitry is performed by the host device, wherein the at least one controller is configured to determine that the compilation process has been completed in response to receiving an indication from the host device that the compilation process is completed.

In some embodiments, the method comprises: a processing pipeline for processing data packets received at the first interface, wherein the processing pipeline comprises a plurality of components, each configured to perform one of a plurality of functions on data packets received at the first interface, wherein a first component of the plurality of components is configured to provide the first function when provided by the first circuitry, wherein a second component of the plurality of components is configured to provide the first function when provided by the second at least one processing unit.

In some embodiments, the at least one controller is configured to control the second circuitry to begin performing the first function with respect to the data packet received at the first interface by inserting a second one of the plurality of components into the processing pipeline.

In some embodiments, the at least one controller is configured to control the first circuit to stop performing the first function on the data packet received at the first interface by removing a first component of the plurality of components from the processing pipeline in response to said determination that the compilation process for the first function to be performed in the second circuit is complete.

In some embodiments, the at least one controller is configured to control the second circuitry to begin performing the first function on the data packet received at the first interface by sending a control message via the processing pipeline to initiate a second component of the plurality of components.

In some embodiments, the at least one controller is configured to control the first circuit to stop performing the first function on the data packet received at the first interface by sending a control message via the processing pipeline to deactivate the second component of the plurality of components in response to said determination that the compilation process for the first function to be performed in the second circuit is complete.

In some embodiments, a first component of the plurality of components is configured to provide a first function with respect to packets of a first data stream passing through the processing pipeline, wherein a second component of the plurality of components is configured to provide the first function with respect to packets of a second data stream passing through the processing pipeline.

In some embodiments, the first function includes filtering the data packets.

In some embodiments, the first interface is configured to receive a data packet from a network.

In some embodiments, the first interface is configured to receive a data packet from a host device.

In some embodiments, the compile time of the first function of the second circuit is greater than the compile time of the first function of the first circuit.

According to another aspect, there is provided a method comprising: receiving a data packet at a first interface of a network interface device; performing, in a first circuit of a network interface device, a first function on a data packet received at a first interface; wherein the first circuit is configured to perform a first function on a data packet received at the first interface during a compilation process for the first function to be performed in the second circuit, the method comprising: determining that a compilation process for a first function to be performed in a second circuit is complete; and in response to the determination, control second circuitry of the network interface device to begin performing a first function on the data packet received at the first interface.

According to another aspect, there is provided a non-transitory computer readable medium comprising program instructions for causing a data processing system to perform a method comprising: receiving a data packet at a first interface of a network interface device; and receiving the data packet at a first interface of the network interface device. A first function is performed in a first circuit of the network interface device with respect to a data packet received at a first interface, wherein the first circuit is configured to perform the first function on the data packet received at the first interface during a compilation process. The first function is to be performed in the second circuit, the method comprising: determining that a compilation process for the first function to be performed in the second circuit is complete; in response to the determination, second circuitry of the network interface device is controlled to begin performing a first function on data packets received at the first interface.

According to another aspect, a non-transitory computer readable medium is provided that includes program instructions for causing a data processing system to: performing a compilation process to compile a first function to be performed by a second circuit of the network interface device; prior to completion of the compilation process, sending a first instruction to cause first circuitry of the network interface device to perform a first function on a data packet received at a first interface of the network interface device; a second instruction is sent to cause the second circuitry to begin performing the first function on the data packet received at the first interface after the compilation process is complete.

In some embodiments, the non-transitory computer readable medium comprises program instructions for causing a data processing system to perform a further compilation process to compile a first function to be performed by the first circuit, wherein the compilation process takes longer than the further compilation process.

In some embodiments, the data processing system includes a host device, wherein the network interface device is configured to interface the host device with a network.

In some embodiments, the data composition system includes a network interface device, wherein the network interface device is configured to interface the host device with a network.

In some embodiments, a data processing system includes a host device and a network interface device, wherein the network interface device is configured to interface the host device with a network.

In some embodiments, the first function comprises filtering data packets received at the first interface from the network.

In some embodiments, a non-transitory computer readable medium includes program instructions for causing a data processing system to: a third instruction is sent to cause the first circuitry to stop performing the function on the data packet received at the first interface after the compilation process is complete.

In some embodiments, a non-transitory computer readable medium includes program instructions for causing a data processing system to: sending instructions to cause second circuitry to perform a first function on packets of a first data stream; and sending instructions to cause the first circuit to stop performing the first function on the data packets of the first data stream.

In some embodiments, the first circuitry comprises at least one central processing unit, wherein prior to completion of the second compilation process, each of the at least one central processing unit is configured to perform the first function on the at least one data packet received at the first interface.

In some embodiments, the second circuit comprises a hardware module comprising a plurality of processing units, each processing unit associated with at least one predetermined operation, wherein the data packet received at the first interface comprises a first data packet, wherein the hardware module is configured to perform a first function on the first data packet by each processing unit after completion of the second compilation process, at least some of the plurality of processing units performing their respective at least one operation on the first data packet.

In some embodiments, the first circuit comprises a hardware module comprising a plurality of processing units configured to provide the first function for the data packet, each processing unit associated with at least one predetermined operation, wherein the data packet received at the first interface comprises the first data packet, wherein the hardware module is configured to perform the first function for the first data packet by each of at least some of the plurality of processing units performing its respective at least one operation for the first data packet before the second compilation process is complete.

In some embodiments, the compilation process includes allocating each of a plurality of processing units of the second circuit to perform at least one operation associated with one of a plurality of processing stages in a series of computer code instructions in a particular order.

In some embodiments, the first function provided by the first circuitry is provided as a component of a processing pipeline for processing data packets received at the first interface, wherein the first function provided by the second circuitry is provided as a component of the processing pipeline.

In some embodiments, the first instruction comprises an instruction configured to cause a first component of the plurality of components to be inserted into the processing pipeline.

In some embodiments, the second instruction comprises an instruction configured to cause a second component of the plurality of components to be inserted into the processing pipeline.

In some embodiments, a non-transitory computer readable medium includes program instructions for causing a data processing system to: sending a third instruction to cause the first circuitry to stop performing the first function on the data packet received at the first interface after the compilation process is complete, wherein the third instruction comprises an instruction configured to cause a first component of the plurality of components to be removed from the processing pipeline.

In some embodiments, the first instruction includes a control message sent through the processing pipeline to activate a second one of the plurality of components.

In some embodiments, the second instruction includes a control message sent through the processing pipeline to activate a second one of the plurality of components.

In some embodiments, a non-transitory computer readable medium includes program instructions for causing a data processing system to: sending a third instruction to cause the first circuitry to stop performing a function on a data packet received at the first interface after the compilation process is complete, wherein the third instruction includes a control message through the processing pipeline to disable a first component of the plurality of components.

According to another aspect, there is provided a data processing system comprising at least one processor and at least one memory including computer program code, wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the data processing system to: performing a compilation process to compile a function to be performed by a second circuit of the network interface device; instructing a first circuit of the network interface device to perform a function with respect to a data packet received at a first interface of the network interface device prior to completion of the compilation process; after completion of the second compilation process, the second at least one processing unit is instructed to begin performing functions on data packets received at the first interface.

According to another aspect, there is provided a method for implementation in a data processing system, the method comprising: performing a compilation process to compile a function to be performed by a second circuit of the network interface device; prior to completion of the compilation process, sending a first instruction to cause first circuitry of the network interface device to perform a function on a data packet received at a first interface of the network interface device; and sending a second instruction to cause the second circuitry to begin performing a function on the data packet received at the first interface after the compilation process is complete.

According to another aspect, there is provided a non-transitory computer readable medium comprising program instructions for causing a data processing system to allocate each of a plurality of processing units to perform at least one operation associated with one of a plurality of processing stages in a series of computer code instructions in a particular order, wherein the plurality of processing stages provide a first function for a first data packet received at a first interface of a network interface device, wherein each of the plurality of processing units is configured to perform one of a plurality of processing types, wherein at least some of the plurality of processing units are configured to perform different types of processing, wherein for each of the plurality of processing units, the allocating is performed in accordance with a determination that the processing unit is configured to perform a processing type suitable for performing the respective at least one operation.

In some embodiments, each process type is defined by one of a plurality of templates.

In some embodiments, the type of treatment includes at least one of: accessing a data packet received at a network interface device; accessing a lookup table stored in a memory of a hardware module; performing a logical operation on data loaded from the data packet; and performing a logical operation on the data loaded from the lookup table.

In some embodiments, two or more of at least some of the plurality of processing units are configured to perform their associated at least one operation according to a common clock signal of the hardware module.

In some embodiments, the allocating comprises allocating each of two or more of at least some of the plurality of processing units to perform its associated at least one operation within a predetermined length of time defined by a clock signal.

In some embodiments, the allocating comprises allocating two or more of the at least some of the plurality of processing units to access the first data packet within a time period of a predetermined length of time.

In some embodiments, the assigning includes assigning each of two or more of the at least some of the plurality of processing units, responsive to an end of the time period of the predetermined length of time, communicating a result of the respective at least one operation to a next processing unit.

In some embodiments, a non-transitory computer readable medium includes program instructions for causing a data processing system to: at least some of the plurality of phases are distributed to occupy a single clock cycle.

In some embodiments, a non-transitory computer readable medium includes program instructions for causing a data processing system to allocate two or more of a plurality of processing units to perform at least one operation allocated thereto to be performed in parallel.

In some embodiments, the network interface device includes a hardware module that includes a plurality of processing units.

In some embodiments, a non-transitory computer readable medium includes computer program instructions to cause a data processing system to: performing a compilation process that includes the allocation; sending a first instruction to cause circuitry of the network interface device to perform a first function on a data packet received at the first interface before the compilation process is complete; and after completion of the compilation process, sending a second instruction to cause the plurality of processing units to begin performing the first function on the data packet received at the first interface.

In some embodiments, a non-transitory computer-readable medium comprises: for one or more of at least some of the plurality of processing units, the assigned at least one operation comprises at least one of: loading at least one value of a first data packet from a memory of a network interface device; storing at least one value of the first packet in a memory of the network interface device; and performing a look-up table lookup to determine an action to perform for the first packet.

In some embodiments, a non-transitory computer readable medium includes computer program instructions to cause a data processing system to issue instructions to configure routing hardware of a network interface device to route a first packet between a plurality of processing units in a particular order to perform a first function on the first packet.

In some embodiments, the first function provided by the plurality of processing units is provided as a component of a processing pipeline for processing data packets received at the first interface.

In some embodiments, a non-transitory computer readable medium includes computer program instructions for causing a plurality of processing units to begin performing a first function on a data packet received at a first interface by causing a data processing system to issue instructions causing a component to be inserted into a processing pipeline.

In some embodiments, a non-transitory computer readable medium includes computer program instructions to cause a plurality of processing units to begin performing a first function on a data packet received at a first interface by causing a data processing system to issue instructions to cause a component to be activated in a processing pipeline.

In some embodiments, a data processing system includes a network interface device.

In some embodiments, a data processing system comprises: a network interface device; and a host device, wherein the network interface device is configured to connect the host device to the network interface.

According to another aspect, there is provided a data processing system comprising at least one processor and at least one memory including computer program code, wherein the at least one memory and the computer program code are configured, with the at least one processor, to cause the data processing system to allocate each of a plurality of processing units to perform at least one operation associated with one of a plurality of processing stages in a series of computer code instructions in a particular order, wherein the plurality of processing stages provide a first function with respect to a first data packet received at a first interface of a network interface device, wherein each of the plurality of processing units is configured to perform one of a plurality of processing types, wherein at least some of the plurality of processing units are configured to perform different processing types, wherein for each of the plurality of processing units, the allocation is performed in dependence on a determination that the processing unit is configured to perform a type of processing suitable for performing the respective at least one operation.

According to another aspect, there is provided a method comprising: assigning each of the plurality of processing units to perform at least one operation associated with one of a plurality of processing stages in the series of computer code instructions in a particular order, wherein the plurality of processing stages provide a first function for a first data packet received at a first interface of the network interface device, wherein each of the plurality of processing units is configured to perform one of a plurality of processing types, wherein at least some of the plurality of processing units are configured to perform different types of processing, wherein for each of the plurality of processing units the assigning is performed dependent on a determination that the processing unit is configured to perform one type of processing suitable for performing the respective at least one operation.

The processing units of the hardware modules have been described as performing their type of operation in a single step. However, those skilled in the art will recognize that this feature is merely a preferred feature and is not necessary or essential to the functioning of the invention.

According to one aspect, there is provided a method comprising: receiving, at a compiler, a bitfile description and a program, the bitfile description including a description of a route for a portion of a circuit; and compiling the program by using the bit file description to output a bit file for the program.

The method may include configuring at least a portion of the circuitry using the bit file to perform a function associated with the program.

The bit file description may include information about routing between the plurality of processing units of the portion of the circuit.

The bit file description may include routing information for at least one of the plurality of processing units, the routing information indicating at least one of: to which data of one or more other processing units may be output and from which data of one or more other processing units may be received.

The bit file description may include routing information indicating one or more routes between two or more respective processing units.

The bit file description may include information that only indicates the routes that may be used by the compiler when compiling the program to provide the bit file for the program.

The bit file may comprise information indicating for each processing unit at least one of: providing input therefrom for one or more of the one or more other processing units in the bit file description for the respective processing unit; to which output is provided one or more of the one or more other processing units in the bit file description.

The portion of circuitry may comprise at least a portion of a configurable hardware module comprising a plurality of processing units, each processing unit being associated with a predetermined type of operation executable in a single step, at least some of the plurality of processing units being associated with a different predetermined type of operation, the bit file description comprising information about routing between at least some of the plurality of processing units, wherein the method may comprise using the bit file to interconnect hardware with at least some of the plurality of processing units to provide a first data processing pipeline for processing one or more of the plurality of data packets to perform a first function on one or more of the plurality of data packets.

The bit file description may be at least a part of an FPGA.

The bit file description may be part of a dynamically programmable FPGA.

The program may include one of an eBPF program and a P4 program.

The compiler and FPGA may be provided in the network interface device.

According to another aspect, there is provided an apparatus comprising at least one processor and at least one memory including computer code for one or more programs, the at least one memory and the computer code configured to, with the at least one processor, cause the apparatus to: receiving a bit file description and a program, the bit file description including a description of a route for a portion of a circuit; and compiling the program using the bitfile description to output a bitfile for the program.

The at least one memory and the computer code may be configured, with the at least one processor, to cause the apparatus to configure at least a portion of the circuitry using the bit file to perform a function associated with the program.

The bit file description may include: for at least one of the plurality of processing units, the routing information indicates at least one of: to which data of one or more other processing units may be output; and may receive data from one or more other processing units.

The bitfile description may include information that only indicates the routes that are available to a compiler when compiling the program to provide the bitfile for the program.

The bit file may include information for indicating the respective processing unit, at least one of the information: providing input therefrom for the one or more other processing units in the bit file description of the respective processing unit; one or more of the one or more other processing units in the bit file description to which the output is provided.

The portion of the circuit may comprise at least a portion of a configurable hardware module comprising a plurality of processing elements, each processing element being associated with a predetermined type of operation that can be performed in a single step, at least some of the plurality of processing units being associated with different predetermined operation types, the bit file description including information about routing between at least some of the plurality of processing units, wherein the at least one memory and the computer code are configured to, with the at least one processor, causing the apparatus to use the bitfile to cause hardware to interconnect at least some of the plurality of the processing units, to provide a first data processing pipeline for processing one or more of said plurality of data packets, to perform a first function on the one or more of the plurality of data packets.

The bit file description may be at least a part of an FPGA.

The bit file description may be part of a dynamically programmable FPGA.

The program may include one of an eBPF program and a P4 program.

According to another aspect, there is provided a network interface device comprising: a first interface configured to receive a plurality of data packets; a configurable hardware module comprising a plurality of processing units, each processing unit being associated with a predetermined type of operation that can be performed in a single step; a compiler configured to receive a bitfile description and a program, the bitfile description including a description of a route for at least a portion of the configurable hardware module, and compile the program using the bitfile description to output a bitfile for the program, wherein the hardware module is configurable using the bitfile to perform a first function associated with the program.

The network interface device may be used to connect the host device to a network interface.

At least some of the plurality of processing units may be associated with different predetermined operation types.

The hardware module may be configured to interconnect at least some of the plurality of the processing units to provide a first data processing pipeline for processing one or more of the plurality of data packets to perform a first function on the one or more of the plurality of data packets.

In some embodiments, the first function comprises a filtering function. In some embodiments, the functions include at least one of tunneling, encapsulation, and routing functions. In some embodiments, the first function comprises an extended berkeley packet filtering function.

In some embodiments, the first function comprises firewall operations.

In some embodiments, the first interface is configured to receive a first data packet from a host device.

In some embodiments, each of two or more of the at least some of the plurality of processing units is configured to perform its associated predetermined type of operation within a predetermined length of time defined by a clock signal.

In some embodiments, two or more of the at least some of the plurality of processing units are configured to: accessing the first data packet for a period of a predetermined length of time; and transmitting a result of the corresponding at least one operation to a next processing unit in response to an end of the predetermined length of time.

In some embodiments, the results include at least one or more of: at least one value from one or more of the plurality of data packets; an updated value of the mapping state; and metadata.

In some embodiments, at least one of the plurality of processing units comprises digital circuitry and a memory storing state related to processing performed by the digital circuitry, wherein the digital circuitry is configured to: predetermined types of operations associated with the respective processing units are performed in communication with the memory.

In some embodiments, the network interface device includes a memory accessible by two or more of the plurality of processing units, the memory configured to store a state associated with the first data packet, wherein during execution of the first function by the hardware module, the two or more of the plurality of processing units are configured to access and modify the state.

In some embodiments, a first one of the at least some of the plurality of processing units is configured to stall during access of the state value by a second one of the plurality of processing units.

In some embodiments, one or more of the plurality of processing units are independently configurable to perform operations specific to the respective pipelines based on their associated predetermined operation types.

In some embodiments, the hardware module is configured to receive an instruction and, in response to the instruction, perform at least one of: interconnecting at least some of said plurality of said processing units to provide a data processing pipeline for processing one or more data packets; causing one or more processing units of the plurality of processing units to perform their associated predetermined type of operation on the one or more data packets; adding one or more of the plurality of processing units to a data processing pipeline; and removing one or more of the plurality of processing units from the data processing pipeline.

In some embodiments, the predetermined operation comprises at least one of: loading at least one value of a first data packet from memory; storing at least one value of the data packet in a memory; and performing a query in a lookup table to determine an action to perform on the data packet.

In some embodiments, one or more of at least some of the plurality of processing units may be configured to: in response to the instruction, the selected operation of its associated predetermined operation type is performed on one or more of the plurality of data packets.

In some embodiments, the plurality of components includes a second one of the plurality of components configured to provide the first functionality in a circuit distinct from the hardware module, wherein the network interface device includes at least one controller configured to cause a data packet passing through the processing pipeline to be processed by the first one of the plurality of components and the second one of the plurality of components.

In some embodiments, a network interface device includes at least one controller configured to issue an instruction to cause a hardware module to begin performing a first function on a data packet, wherein the instruction is configured to cause a first component of a plurality of components to be inserted into a processing pipeline.

In some embodiments, a network interface device includes at least one controller configured to issue an instruction to cause a hardware module to begin performing a first function on a data packet, wherein the instruction includes a control message sent through a processing pipeline and configured to cause a first one of a plurality of components to activate.

In some embodiments, for one or more of at least some of the plurality of processing units, the associated at least one operation comprises at least one of: loading at least one value of a first data packet from a memory of a network interface device; storing at least one value of the first packet in a memory of the network interface device; and performing a look-up table lookup to determine an action to perform for the first packet.

In some embodiments, one or more of at least some of the plurality of processing units are configured to pass at least one result of its associated at least one predetermined operation to a next processing unit in the first processing pipeline, the next processing unit being configured to perform a next predetermined operation in dependence on the at least one result.

In some embodiments, the predetermined operation type includes at least one of: accessing the data packet; accessing a lookup table stored in a memory of a hardware module; performing a logical operation on data loaded from the data packet; and performs logical operations on the data loaded from the look-up table.

In some embodiments, the hardware module comprises routing hardware, wherein the hardware module is configurable to interconnect at least some of the plurality of processing units to provide the first data processing pipeline by configuring the routing hardware to route packets between the plurality of processing units in a particular order defined by the first data processing pipeline.

In some embodiments, the hardware module may be configured to interconnect at least some of the plurality of said processing units to provide a second data processing pipeline for processing one or more of the plurality of data packets to perform a second function different from the first function.

In some embodiments, the network interface device includes further circuitry separate from the hardware module and configured to perform the first function on one or more of the plurality of data packets.

In some embodiments, the network interface device comprises at least one controller, wherein the further circuitry is configured to perform the first function with respect to the data packet during a compilation process for the first function to be performed in the hardware module, wherein the at least one controller is configured to control the hardware module to begin performing the first function with respect to the data packet in response to completion of the compilation process.

In some embodiments, the at least one controller is configured to control the further other circuitry to stop performing the first function on the data packet in response to said determination that the compilation process for the first function to be performed in the hardware module is complete.

In some embodiments, the network interface device comprises at least one controller, wherein the hardware module is configured to perform a first function on a data packet during a compilation process for the first function to be performed in further circuitry, wherein the at least one controller is configured to determine that the compilation process for the first function to be performed in the further circuitry has been completed, and in response to the determination, control the further circuitry to begin performing the first function on a data packet.

In some embodiments, the at least one controller is configured to control the hardware module to stop performing the first function on the data packet in response to the determination that the compilation process for the first function to be performed in the further circuitry has been completed.

In some embodiments, the network interface device includes at least one controller configured to perform a compilation process to provide the first function to be performed in the hardware module.

According to another aspect, there is provided a computer-implemented method, the method comprising: routing information for at least a portion of a configurable hardware module comprising a plurality of processing units, each processing unit being associated with a predetermined type of operation executable in a single step, at least some of the plurality of processing units being associated with a different predetermined type of operation, is determined, the routing information providing information about available routes between at least the plurality of processing units.

The configurable hardware module may include a substantially static portion and a substantially dynamic portion, the determining including determining routing information for the substantially dynamic portion.

The routing information for the substantially dynamic portion may include determining a route in the substantially dynamic portion that is used by one or more of the processing elements in the substantially static portion.

The determining may include analyzing a bit file description of at least a portion of the configurable hardware module to determine the routing information.

According to another aspect, there is provided a non-transitory computer readable medium comprising program instructions for: determining routing information for at least a portion of a configurable hardware module comprising a plurality of processing elements, each processing element associated with a predetermined type of operation that can be performed in a single step, at least some of the plurality of processing elements being associated with a different predetermined type of operation, the routing information providing information about available routes between at least the plurality of processing elements.

A computer program comprising program code means adapted to perform the method may also be provided. The computer program may be stored and/or otherwise embodied by a carrier medium.

In the above, many different embodiments have been described. It is to be understood that other embodiments may be provided by a combination of any two or more of the above embodiments.

Various other aspects and further embodiments are also described in the following detailed description and the appended claims.

Drawings

Some embodiments will now be described, by way of example only, with reference to the accompanying drawings, in which:

FIG. 1 shows a schematic diagram of a data processing system connected to a network;

FIG. 2 shows a schematic diagram of a data processing system including a filtering operation application configured to run on a host computing device in user mode;

FIG. 3 depicts a schematic diagram of a data processing system including a filtering operation configured to run in kernel mode on a host computing device;

FIG. 4 shows a schematic diagram of a network interface device including multiple CPUs performing functions on packets;

FIG. 5 shows a schematic diagram of a network interface device including a field programmable gate array running an application for performing functions on data packets;

FIG. 6 shows a schematic diagram of a network interface device including hardware modules for performing functions on data packets;

FIG. 7 shows a schematic diagram of a network interface device including a field programmable gate array and at least one processing unit for performing functions on data packets;

FIG. 8 illustrates a method implemented in a network interface device, in accordance with some embodiments;

FIG. 9 illustrates a method implemented in a network interface device, in accordance with some embodiments;

fig. 10 shows an example of processing a packet by a series of procedures;

FIG. 11 shows an example of processing a data packet by a plurality of processing units;

FIG. 12 shows an example of processing a data packet by a plurality of processing units;

FIG. 13 shows an example of a pipeline of processing stages for processing data packets;

FIG. 14 shows an example of a slice architecture with multiple pluggable components;

FIG. 15 shows an example representation of the arrangement and order of processing by a plurality of processing units;

FIG. 16 illustrates an example method of compiling functions.

Fig. 17 shows an example of a stateful processing unit.

Fig. 18 shows an example of a stateless processing unit.

FIG. 19 illustrates a method of some embodiments;

FIGS. 20a and 20b illustrate routing between slices in an FPGA; and

fig. 21 schematically shows partitioning on FGPA.

Detailed description of the preferred embodiments

The following description is presented to enable any person skilled in the art to make and use the invention, and is provided in the context of a particular application. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art.

The general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present invention. Thus, the present invention is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.

When data is to be transferred between two data processing systems over a data channel, such as a network, each data processing system has an appropriate network interface to allow it to communicate between the channels. The network is typically based on ethernet technology. Data processing systems that communicate over a network are equipped with network interfaces that are capable of supporting the physical and logical requirements of the network protocol. The physical hardware component of a network interface is referred to as a network interface device or Network Interface Card (NIC).

Most computer systems contain an Operating System (OS) through which user-level applications communicate with the network. A portion of the operating system, referred to as the kernel, includes a protocol stack for translating commands and data between applications and device drivers specific to the network interface device. The device driver may directly control the network interface device. By providing these functions in the operating system kernel, the complexity and differences of the network interface device can be hidden from the user-level application. Network hardware and other system resources (e.g., memory) may be securely shared by many applications and the system may be protected from errant or malicious applications.

A typical data processing system 100 for performing transactions over a network is shown in figure 1. The data processing system 100 includes a host computing device 101 coupled to a network interface device 102, the network interface device 102 being configured to interface the host to a network 103. Host computing device 101 includes an operating system 104 that supports one or more user-level applications 105. The host computing device 101 may also include a network protocol stack (not shown). For example, the protocol stack may be a component of an application, a library linked to an application, or a library provided by an operating system. In some embodiments, more than one protocol stack may be provided.

The network protocol stack may be a Transmission Control Protocol (TCP) stack. Application 105 can send and receive TCP/IP messages by opening and reading and writing data from and to the socket, and operating system 104 causes the messages to be transmitted across the network. For example, an application may call a system call (syscall) to transfer data through a socket and then to the network 103 through the operating system 104. Such an interface for transmitting messages may be referred to as a messaging interface.

Instead of implementing the stack in the host 101, some systems offload the protocol stack to the network interface device 102. For example, where the stack is a TCP stack, the network interface device 102 may include a TCP Offload Engine (TOE) to perform TCP protocol processing. By performing the protocol processing in the network interface device 102 rather than in the host computing device 101, the need for a processor of the host system 101 may be reduced. Data to be transmitted over the network may be sent by application 105 through a TOE-enabled virtual interface driver, partially or fully bypassing the kernel TCP/IP stack. Therefore, only data sent along the fast path needs to be formatted to meet the requirements of the TOE driver.

The host computing device 101 may include one or more processors and one or more memories. In some embodiments, the host computing device 101 and the network interface device 102 may communicate via a bus, for example, a peripheral component interconnect express bus (PCIe bus).

During operation of the data processing system, data to be transmitted onto the network may be transmitted from the host computing device 101 to the network interface device 102 for transmission. In one example, the data packet may be transmitted by the host processor directly from the host to the network interface device. The host may provide data to one or more buffers 106 located on the network interface device 102. The network interface device 102 may then prepare the data packets and send them over the network 103.

Alternatively, the data may be written to the buffer 107 in the host system 101. The network interface device may then retrieve the data from the buffer 107 and transmit over the network 103.

In both cases, data is temporarily stored in one or more buffers prior to transmission over the network. Data sent over the network may be returned to the host (backtrack).

When sending and receiving data packets over the network 103, there are many processing tasks that can be represented as operations on data packets to be sent over the network or on data packets received over the network. For example, filtering processing may be performed on received data packets to protect the host system 101 from distributed denial of service (DDOS) filtering. Such filtering processes may be performed by simple wrap-around inspection or extended berkeley packet filters (eBPF). As another example, encapsulation and forwarding may be performed on data packets to be transmitted over network 103. These processes can consume many CPU cycles and are burdensome for conventional OS architectures.

Referring to fig. 2, one manner in which a filtering operation or other packet processing operation may be implemented in host system 220 is shown. The processes performed by host system 220 are shown as being performed in user space or kernel space. In the kernel space there is a receive path for transferring data packets received at the network interface device 210 from the network to the terminal application 250. The receive path includes a driver 235, a protocol stack 240, and a socket 245. The filtering operation 230 is implemented in user space. Incoming data provided by the network interface device 210 to the host system 220 wraps around the kernel (where protocol processing occurs) and is provided directly to the filtering operation 230.

The filtering operation 230 is configured with a virtual interface (which may be an Ethernet Fabric Virtual Interface (EFVI) or a Data Plane Development Kit (DPDK) or any other suitable interface) for exchanging data packets with other units in the host system 220. The filtering operation 230 may perform DDOS scrubbing and/or other forms of filtering. The DDOS clean-up process may be performed on all data packets that are easily identified as DDOS candidates-e.g., sample data packets, data packet copies, and packets that have not yet been classified. Packets that are not passed to the filtering operation 230 may be passed directly from the network interface to the driver 235. Operation 230 may provide an extended burley packet filter (eBPF) for performing filtering. If the received packet passes the filtering provided by operation 230, operation 230 is configured to re-inject the packet into the receive path in the kernel to process the received packet. Specifically, the data packet is provided to the driver 235 or the stack 240. The data packet is then protocol processed by the protocol stack 240. The data packet is then passed to the socket 245 associated with the terminating application 250. Terminating application 250 issues a recv () call to retrieve the packet from the buffer of the associated socket.

However, there are several problems with this approach. First, a filtering operation 230 runs on the host CPU. To run the filtering 230, the host CPU must process the packets at the rate at which they are received from the network. This may occupy a significant amount of the processing resources of the host CPU in cases where the rate of sending and receiving data from the network is high. The high data flow rate to the filtering operation 230 may result in a large consumption of other limited resources, such as I/O bandwidth and internal memory/cache bandwidth.

In order to re-inject the packet into the kernel, it is necessary to provide the filtering operation 230 with a privileged API for performing the re-injection. The re-injection process can be cumbersome and requires attention to the ordering of the packets. To perform the re-injection, operation 230 may in many cases require a dedicated CPU core.

The step of providing data for operation and re-injecting requires that the data be copied to or from memory. This replication is a resource burden on the system.

Similar problems may arise when other types of operations are provided instead of filtering data to be sent/received over the network.

Certain operations (e.g., DPDK type of operation) may require forwarding of the processed data packet back to the network.

Referring to fig. 3, another method is shown. Like elements are labeled with like reference numerals. In this example, an additional layer called the fast data path (XDP)310 is inserted into the send and receive paths in the core. Extensions to XDP 310 allow for the insertion of transmission paths. The XDP helper allows data packets to be sent (as a result of the receive operation). The XDPs 310 are inserted at the driver level of the operating system and allow programs to execute at this level to perform operations on packets received from the network before the packets received from the stack are processed by the protocol stack 240. XDP 310 also allows programs to execute at this level in order to perform operations on data packets to be sent over the network. Thus, the eBPF program and other programs may run in both the transmit and receive paths.

As shown in FIG. 3, a filter operation 320 can be inserted into an XDP from user space to form a program 330 that is part of the XDP 310. Operation 320 is inserted using the XDP control plane to be executed on the data receive path to provide a program 330, program 330 performing a filtering operation (e.g., DDOS clean-up) on the data packets on the receive path. Such a program 330 may be an eBPF program.

Program 330 is shown inserted into the kernel and between driver 235 and protocol stack 240. However, in other examples, program 330 may be inserted elsewhere in the receive path in the kernel. The program 330 may be part of a separate control path that receives the data packet. Program 330 may be provided by an application program through an extension of an Application Programming Interface (API) that provides socket 245 for the application program.

Program 330 may additionally or alternatively perform one or more operations on data sent over a transmission path. The XDP 310 then calls a send function of the driver 235 to send data over the network via the network interface device 210. In this case, program 330 may provide load balancing or routing operations for packets to be sent over the network. Program 330 may provide segment re-encapsulation and forwarding operations for data packets to be sent over the network.

Program 330 may be used for firewall and virtual switching or other operations that do not require protocol termination or application processing.

One advantage of using XDP 310 in this manner is that program 330 can directly access the memory buffers handled by the driver without an intermediate copy.

In order to insert the program 330 to be executed in the kernel in this manner, it is necessary to ensure that the program 330 is secure. Some risks are posed if an insecure program is inserted into the kernel, such as: infinite loops that may cause a kernel crash, buffer overflow, uninitialized variables, compiler bugs, performance issues caused by large programs.

To ensure that the program 330 is secure prior to being inserted into the XDP 310 in this manner, an authentication program may be run on the host system 220 to verify the security of the program 330. The verification procedure may be configured to ensure that no loops exist. If no loop is caused, a jump back operation may be allowed. The verification program may be configured to ensure that program 330 has no more than a predetermined number (e.g., 4000) of instructions. The validation program may perform a check of the validity of the register usage by traversing the data path of program 330. If there are too many possible paths, then program 330 will be rejected because it is not secure to run in kernel mode. For example, if there are more than 1000 branches, then program 330 may be rejected.

Those skilled in the art will appreciate that XDP is one example of a security program 330 that can be installed in the kernel, and that there are other ways in which this can be accomplished.

The method discussed above with respect to fig. 3 may be as efficient as the method discussed above with respect to fig. 2 if, for example, the operation may be represented in a secure (or sandbox) language required to execute code in the kernel. The eBPF language may be efficiently executed on an x86 processor, and JIT (just in time) compilation techniques allow eBPF programs to be compiled into native machine code. The language is designed to be secure, e.g., the state is limited to mapping only structures that are shared data structures (e.g., hash tables). Instead of allowing one eBPF program to make a tail call to another program, a limited number of loops are allowed. The state space is limited.

However, in some embodiments, with this approach, the resources of the host system 220 (e.g., I/O bandwidth and internal memory/cache bandwidth, host CPU) may be consumed in large amounts. The operations on the data packet are still performed by the host CPU, requiring the host CPU to perform such operations at the rate of sending/receiving data.

Another proposal is to perform the above operations in the network interface device rather than in the host system. In addition to the I/O bandwidth, memory, and cache bandwidth consumed, doing so may free up CPU cycles used by the host CPU in performing operations. Moving the execution of processing operations from the host to the hardware of the network interface device may present some challenges.

One proposal for implementing processing in network hardware is to provide a Network Processing Unit (NPU) comprising a plurality of CPUs in a network interface device, which is dedicated to packet processing and/or manipulation operations.

Referring to fig. 4, which illustrates an example of a network interface device 400, the network interface device 400 includes an array 410 of Central Processing Units (CPUs), e.g., CPU 420. The CPU is configured to perform functions such as filtering data packets sent and received from the network. Each CPU in CPU array 410 may be an NPU. Although not shown in fig. 4, the CPU may additionally or alternatively be configured to perform operations such as load balancing data packets received from the host for transmission over the network. These CPUs are dedicated to such packet processing/manipulation operations. The CPU executes an instruction set optimized for such packet processing/manipulation operations.

The network interface device 400 additionally includes memory (not shown) that is shared between the arrays 410 of CPUs and is accessible by the arrays 410 of CPUs.

The network interface device 400 includes a network Media Access Control (MAC) layer 430 for interfacing the network interface device 400 with a network. The MAC layer 430 is configured to receive data packets from and transmit data packets over the network.

Operations on packets received on the network interface device 400 are parallelized on the CPU. As shown, when a data stream is received at the MAC layer 430, the data stream is passed to an expansion function 440, and the expansion function 440 is configured to extract packets from the data stream and distribute them over multiple CPUs in the NPU 410 to cause the CPUs to perform processing, such as filtering the packets. Extended function 440 may parse received packets to identify the data streams to which they belong. Extension function 440 generates, for each packet, an indication of its position in the data stream to which it belongs. The indication may be, for example, a label. The extension function 440 adds a corresponding indication to the associated metadata for each data packet. The associated metadata for each packet may be appended to the packet. The associated metadata may be passed to the extended function 440 as sideband control information. The indication is added according to the stream to which the packets belong so that the order of the packets for any particular stream can be reconstructed.

After being programmed by the multiple CPUs 410, the packets are then passed to the reordering function 450, and the reordering function 450 reorders the packets of the data stream into their proper order before passing them to the host interface layer 460. The reordering function 450 may reorder the packets of the data flow by comparing indications (e.g., tags) within the packets of the data flow to reconstruct the order of the packets. The reordered packets then pass through the host interface 460 and are transmitted to the host system 220.

Although fig. 4 illustrates CPU array 410 operating only on packets received from the network, similar principles (including expansion and reordering) may be performed on packets received from a host for transmission over the network, while CPU array 410 performs functions (e.g., load balancing) on those packets received from the host.

The program executed by the CPU may be a compiled or transcoded version of the program that would be executed on the host CPU in the example described above with respect to fig. 3. In other words, the instruction set to be executed on the host CPU for operation is translated for execution on each CPU of the dedicated CPU array in the network interface 400.

To achieve parallelization on a CPU, multiple instances of a program are compiled and executed in parallel on multiple CPUs. Each instance of the program may be responsible for processing a different set of data packets received at the network interface device. However, when providing program functions with respect to the packets, each individual packet is processed by a single CPU. The overall effect of the execution of the parallel program may be the same as the effect of the execution of a single program (e.g., program 330) on the host CPU.

One of the dedicated CPUs may process 5000 ten thousand packets per second. This operating speed may be lower than the operating speed of the host CPU. Thus, parallelization can be used to achieve the same performance as that achieved by executing an equivalent program on the host CPU. To perform the parallelization, the packets are distributed over the CPU and then reordered after processing by the CPU. The requirement to process the packets for each flow in sequence along with the reordering step 450 may cause bottlenecks, increase memory resource overhead and may limit the available throughput of the device. This requirement and the reordering step 450 may increase the jitter of the device, as the processing throughput may fluctuate depending on the content of the network traffic and the degree of parallelism applicable.

One of the advantages of using such a dedicated CPU may be a short compile time. For example, it is possible to compile a filter application to run on such a CPU in less than 1 second.

As this approach expands to higher link speeds, problems may arise with the use of CPU arrays. In the near future, a host network interface may be required to achieve terabit per second speeds. When extending such CPU arrays 410 to these higher speeds, the amount of power required may become problematic.

Another proposal is to include a Field Programmable Gate Array (FPGA) in the network interface device and use the FPGA to perform operations on data packets received from the network.

Referring to fig. 5, an example of using an FPGA 510 with an FPGA application 515 in a network interface device 500 for performing operations on data packets received at the network interface device 500 is shown. Elements that are the same as elements in fig. 4 are designated with the same reference numerals.

Although fig. 5 shows an FPGA application 515 that operates only on packets received from a network, such an FPGA application 515 may be used to perform functions (e.g., load balancing and/or firewall functions) on packets received from a host for transmission over a network or return to the host or another network interface on a system.

FPGA application 515 can be provided by compiling a program written to run on FPGA 510 in a general-purpose system-level language, such as C or C + + or Scala.

The FPGA 510 may have a network interface function and an FPGA function. The FPGA functionality can provide an FPGA application 515 that can be programmed into the FPGA 510 according to the needs of the user of the network interface device. The FPGA application 515 can, for example, provide filtering of messages on the receive path from the network 230 to the host. The FPGA application 515 may provide a firewall.

FPGA 510 can be programmed to provide FPGA application 515. Some of the network interface device functions may be implemented as "hard" logic within the FPGA 510. For example, the hard logic may be an Application Specific Integrated Circuit (ASIC) gate. The FPGA application 515 may be implemented as "soft" logic. The soft logic may be provided by programming an FPGA LUT (look-up table). Hard logic may be able to be clocked at a higher rate than soft logic.

The network interface device 500 includes a host interface 505 configured to send and receive data with a host. The network interface device 520 includes a network Media Access Control (MAC) interface 520 configured to transmit and receive data with a network.

When a packet is received from the network at the MAC interface 520, the packet is passed to the FPGA application 515, the FPGA application 515 being configured to perform functions such as filtering on the packet. The packet (if it passes any filtering) is then passed to the host interface 505, from where it is passed to the host. Alternatively, the packet FPGA application 515 may determine to drop or resend the packet.

One problem with this approach of using an FPGA to perform functions on a data packet is the relatively long compilation time required. FPGAs are composed of many logic elements (e.g., logic cells) that represent the original logical operations, e.g., AND, OR, NOT, etc., respectively. The logic elements are arranged in a matrix with programmable interconnections. To provide functionality, these logic cells may need to operate together to implement circuit definition and synchronous clock timing constraints. Placing each logic cell and routing between cells can be a difficult challenge algorithmically. When compiling on an FPGA with lower usage, the compilation time may be less than ten minutes. However, as FPGA devices are increasingly utilized by a variety of applications, the placement and routing challenges may become greater, and thus the time to compile a given function onto an FPGA may increase. Thus, adding additional logic to an FPGA that has consumed most of the routing resources may take hours of compilation time.

One approach is to design the hardware using specific processing primitives (e.g., parsing, matching, and action primitives). These may be used to construct a processing pipeline in which all packets go through each of the three processes. First, the data packet is parsed to construct a metadata representation of the protocol header. Secondly, the data packets are flexibly matched with the rules stored in the table. Finally, when a match is found, the packet will be manipulated according to the entry in the table selected in the matching operation.

To implement functionality using a parse/match/action model, the P4 programming language (or similar language) may be used. The P4 programming language is target independent, meaning that programs written in P4 can be compiled to run on different types of hardware (e.g., CPU, FPGA, ASIC, NPU, etc.). Each different type of object is provided with its own compiler to map the P4 source code into the appropriate object switch model.

P4 may be used to provide a programming model that allows high-level programs to express packet processing operations for a packet processing pipeline. This method is suitable for naturally expressing its own operations in a declarative style. In the P4 language, the programmer represents the parsing, matching and manipulation phases as operations to be performed on received data packets. These operations are aggregated together to allow efficient execution of the dedicated hardware. However, this declarative style may not be applicable to programs expressing imperative properties, such as eBPF programs.

In a network interface device, a series of eBPF programs may be required to execute serially. In this case, a chain of eBPF programs is generated, one program calling the other. Each program can modify the state and output as if the entire program chain had been executed in sequence. It can be challenging for a compiler to collect all the parsing, matching, and operation steps. However, even if the eBPF program chain has already been installed, it may be necessary to install, delete, or modify the chain, which may present further challenges.

To provide an example of such a program that needs to be repeatedly executed, please refer to fig. 10, which shows a program sequence e configured to process a data packet, fig. 10₁，e₂，e₃Examples of (2). For example, each program may be an eBPF program. Each program is configured to parse a received packet, perform a lookup of table 1010 to determine an action in a matching entry in table 1010, and then perform the action on the packet. The action may include modifying the data packet. Each eBPF program may also perform operations according to local and shared states. Data packet P₀Initially by eBPF program e₁Processed and then passed on, modified to the next program e in the pipeline₂. The output of the program sequence is the output of the final program in the pipeline, i.e. e₃。

It can be complicated for a compiler to combine the effects of each of n such programs into a single P4 program. In addition, some programming models (e.g., XDP) may require that programs be dynamically inserted and deleted quickly at any point in the program sequence in response to changing circumstances.

According to some embodiments of the present application, a network interface device is provided that includes a plurality of processing units. Each processing unit is configured to perform at least one predetermined operation in hardware. Each processing unit includes a memory that stores its own local state. Each processing unit includes digital circuitry that modifies the state. The digital circuit may be an application specific integrated circuit. Each processing unit is configured to run a program including configurable parameters to perform a respective plurality of operations. Each processing unit may be an atom. Atoms are defined by the specific programming and routing of a predetermined template. This defines its specific operational behavior and logical location in the flow provided by the connected processing units. Where the term "atom" is used in the specification, this may be understood to refer to a data processing unit configured to perform its operations in a single step. In other words, an atom performs its operation as an atomic operation.

An atom can be viewed as a collection of hardware structures that can be configured to repeatedly perform one of a series of computations, taking one or more inputs and generating one or more outputs.

Atoms are provided by hardware. The atoms may be configured by a compiler. An atom may be configured to perform a computation.

During compilation, at least some of the plurality of processing units are arranged to perform operations such that functions are performed on data packets received by at least some of the plurality of processing units at the network interface device. Each of at least some of the plurality of processing units is configured to perform its respective at least one predetermined operation in order to perform a function with respect to the data packet. In other words, the operations that the connected processing unit is configured to perform are performed on the received data packets. The operations are performed sequentially by at least some of the plurality of processing units. Collectively, execution of each of the plurality of operations provides functionality, such as filtering, for received data packets.

By arranging each atom to perform their respective at least one predetermined operation in order to perform a function, compile time may be reduced as compared to the FPGA application example described above with respect to fig. 5. Moreover, for performing functions using a processing unit dedicated to performing specific operations in hardware, as discussed above in FIG. 4, the speed at which the functions are performed may be increased relative to performing the functions for each packet using a CPU in the network interface device to execute software.

Referring to FIG. 6, an example of a network interface device 600 according to an embodiment of the present application is shown. The network interface device includes a hardware module 610 configured to perform processing of data packets received at an interface of the network interface device 600. Although fig. 6 shows hardware module 610 performing functions (e.g., filtering) on packets on the receive path, hardware module 610 may also be used to perform functions (e.g., load balancing or firewalling) on packets on the transmit path received from the host.

The network interface device 600 includes a host interface 620 for transmitting and receiving data packets with a host and a network MAC interface 630 for transmitting and receiving data packets with a network.

The network interface device 600 includes a hardware module 610 that includes a plurality of

processing units

640a, 640b, 640c, 640 d. Each processing unit may be an atomic processing unit. The term atom is used in this specification to refer to a processing unit. Each processing unit is configured to perform at least one operation in hardware. Each processing unit includes digital circuitry 645 configured to perform at least one operation. Digital circuit 645 may be an application specific integrated circuit. Each processing unit also includes a memory 650 that stores state information. The digital circuit 645 updates the state information when the corresponding plurality of operations are performed. In addition to local memory, each processing unit may also access shared memory 660, where shared memory 660 may also store state information accessible by each of the plurality of processing units.

The state information in the shared memory 660 of the processing units and/or the state information in the memory 650 may include at least one of: metadata passed between processing units, temporary variables, contents of data packets, contents of one or more shared mapping tables.

Together, the multiple processing units can provide the functions to be performed on the data packets received at the network interface device 600. The compiler outputs instructions to configure the hardware module 610 to perform functions on incoming data packets by arranging for at least some of the plurality of processing units to perform their respective at least one predetermined operation on each incoming data packet. This may be accomplished by linking (i.e., connecting) at least some of the

processing units

640a, 640b, 640c, 640d together such that each connected processing unit will perform its respective at least one operation on each incoming data packet. Each processing unit performs at least one of its respective operations in a particular order to perform a function. The sequence may be such that two or more processing units execute in parallel (i.e., simultaneously) with each other. For example, one processing unit may read from a data packet for a period of time (defined by a periodic signal (e.g., a clock signal) of the hardware module 610) during which a second processing unit also reads data from a different location in the same data packet.

In some embodiments, the data packets are delivered to each stage represented by the processing unit in sequence. In this case, each processing unit completes its processing before passing the data packet to the next processing unit to perform its processing.

In the example shown in fig. 6,

processing units

640a, 640b, and 640d are coupled together at compile-time such that each performs at least one of their respective operations to perform a function, such as filtering, on received data packets. The

processing units

640a, 640b, 640d form a pipeline for processing data packets. The packets may move along the pipeline in stages, each stage having equal time periods. The time period may be defined in terms of a periodic signal or beat. The time period may be defined by a clock signal. Several cycles of the clock may define a time period for each stage of the pipeline. At the end of each repetition period, the packet moves along a stage in the pipeline. The time period may be a fixed interval. Alternatively, each time period of a stage in the pipeline may take a variable amount of time. When the last processing stage completes the operation, a signal may be generated indicating the next stage in the pipeline, which may take a variable amount of time. Stalls may be introduced at any stage of the pipeline by delaying the signal by some predetermined amount of time.

Each

processing unit

640a, 640b, 640d may be configured to access shared memory 660 as part of at least one of their respective operations. Each of the

processing units

640a, 640b, 640d may be configured to communicate metadata between each other as part of their respective at least one operation. Each of the

processing units

640a, 640b, 640d may be configured to access data packets received from the network as part of their respective at least one operation.

In this example, processing unit 640c is not used to perform processing of received packets to provide functionality, but is omitted from the pipeline.

Data packets received at network MAC layer 630 may be passed to hardware module 610 for processing. Although not shown in fig. 6, the processing performed by hardware module 610 may be part of a larger processing pipeline that provides additional functionality with respect to the data packets in addition to that provided by hardware module 610. This is shown with reference to fig. 14 and will be explained in more detail below.

The first processing unit 640a is configured to perform a first at least one operation on the data packet. This first at least one operation may include at least one of: reads, and writes shared state from the data packet in memory 660, and/or performs a table lookup to determine an action. The first processing unit 640a is then configured to produce a result from at least one operation thereof. The result may be in the form of metadata. The result may include modifications to the data packet. The results may include modifications to the shared state in memory 660. The second processing unit 640b is configured to perform at least one operation thereof with respect to the first packet according to a result of the operation performed by the first processing unit 640 a. The second processing unit 640b generates a result from at least one operation thereof and passes the result to the third processing unit 640d, which third processing unit 640d is configured to perform at least one operation thereof with respect to the first packet. The first processing unit 640a, the second processing unit 640b and the third processing unit 640d are together configured to provide functionality with respect to data packets. The packet may then be passed to host interface 620, from where it is passed to the host system.

Thus, it can be seen that the connected processing units form a pipeline for processing data packets received at the network interface device. The pipeline may provide processing for the eBPF program. The pipeline may provide for the processing of multiple eBPF programs. The pipeline may provide for the processing of multiple modules to be executed in sequence.

The connection of the processing units in the hardware module 610 may be performed by programming the routing functions of the presynthesized interconnect structure of the hardware module 610. The interconnect structure provides connections between the various processing units of the hardware module 610. The interconnect fabric may be programmed according to the topology supported by the fabric. Possible example topologies are discussed below with reference to fig. 15.

The hardware module 610 supports at least one bus interface. At least one bus interface receives data packets at hardware module 610 (e.g., from a host or a network). At least one bus interface outputs data packets from hardware module 610 (e.g., to a host or network). At least one bus interface receives control messages at the hardware module 610. The control messages may be used to configure the hardware module 610.

The example shown in FIG. 6 has the advantage of reduced compile time relative to the FPGA application 515 shown in FIG. 5. For example, the hardware module 610 of FIG. 6 may require less than 10 seconds to compile the filtering function. The example shown in fig. 6 has the advantage of improved processing speed compared to the example of the CPU array shown in fig. 4.

An application program may be executed in such a hardware module 610 by mapping the general program (or programs) to a presynthesized data path. The compiler builds the datapath by linking any number of processing stage instances, each of which is built by one of the presynthesized processing stage atoms.

Each atom is built from a circuit. Each circuit may be defined using RTL (register transfer language) or a high-level language. Each circuit is synthesized using a compiler or a toolchain. Atoms can be synthesized as hard logic and thus can be used as a hard (ASIC) resource in a hardware module of a network interface device. Atoms can be synthesized into soft logic. The atoms in the soft logic may be provided with constraints that assign and maintain location and routing information for the synthetic logic on the physical device. Atoms may be designed using configurable parameters that specify the behavior of the atom. Each parameter may be a variable, or even a sequence of operations (a microprogram), which may specify at least one operation to be performed by a processing unit during a clock cycle of the processing pipeline. The logic implementing the atoms may be clocked synchronously or asynchronously.

The atomic processing pipeline itself may be configured to operate on a periodic signal. In this case, each packet and metadata is moved one stage along the pipeline in response to each occurrence of a signal. The processing pipeline may operate in an asynchronous manner. In this case, a higher level of backpressure in the pipeline will cause each downstream stage to begin processing only when data from the upstream stage has been provided to the downstream stage.

When compiling a function to be performed by a plurality of such atoms, the sequence of computer code instructions will be divided into a plurality of operations, each of which maps to a single atom. Each operation may represent a single line of disassembly instructions in the computer code instructions. Each operation is assigned to one of the atoms for which one of the atoms is to be performed. In the computer code instructions, there may be one atom per expression. Each atom is associated with an operation type and selects at least one operation in the computer code instructions to be performed according to its associated operation type. For example, an atom may be pre-configured to perform a load operation from a data packet. Thus, such atoms are assigned to execute instructions representing load operations from data packets in computer code.

In the computer code instructions, each row may select an atom. Thus, when a function is implemented in a hardware module containing such atoms, there may be 100 such atoms, each performing its respective operation to perform the function on the packet.

Each atom may be constructed from one of a set of process stage templates that determine the type of operation with which it is associated. The compilation process is configured to generate instructions to control each atom to perform a particular at least one operation based on its associated type. For example, if an atom is preconfigured to perform a packet access operation, the compilation process may assign the atom an operation to load certain information (e.g., the source ID of the packet) from the header of the packet. The compilation process is configured to send instructions to the hardware modules, wherein the atoms are configured to perform the operations assigned to them by the compilation process.

The processing phase templates that specify atomic behavior are logical phase templates (e.g., providing operations on registers, scratchpad, and stacks, and branches), packet access state templates (e.g., providing packet data loading and/or packet data storage), and mapping table access phase templates (e.g., mapping table lookup algorithm, mapping table size).

The packet access phase may include at least one of: reading byte sequences from a data packet, replacing one byte sequence with a different byte sequence in the data packet, inserting bytes into the data packet, and deleting bytes in the data packet.

The mapping table access phase may be used to access different types of mapping tables (e.g., lookup tables), including direct index arrays and associative arrays. The mapping table access phase may include at least one of: reading a value from a location, writing a value to a location, replacing a value at a location in a mapping table with another value. The mapping table access phase may include a comparison operation in which one value is read from a location in the mapping table and compared to a different value. If the value read from the location is less than the different value, a first action may be performed (e.g., not performing any operation, swapping the value at the location for a different value, or adding the values). Otherwise, a second operation may be performed (e.g., no operation, swapping, or adding values is performed). In either case, the value read from the location may be provided to the next processing stage.

Each mapping table access phase may be implemented in a stateful processing unit. Referring to FIG. 17, an example of circuitry 1700 that may be included in an atom configured to perform processing of a map table access phase is shown. The circuit 1700 may include a hash function 1710 configured to perform a hash of an input value used as an input to a lookup table. The circuitry 1700 includes a memory 1720 configured to store a state associated with an operation of an atom. The circuit 1700 includes an arithmetic logic unit 1730 configured to perform operations.

The logic stage may perform a calculation on the values provided by the previous stage. The processing units configured to implement the logic stages may be stateless processing units. Each stateless processing unit may perform simple arithmetic operations. Each processing unit may perform, for example, 8-bit operations.

Each logic stage may be implemented in a stateless processing unit. Referring to fig. 18, which illustrates an example of a circuit 1800, the circuit 1800 may be included in an atom configured to perform a logical phase of processing. The circuit 1800 includes an array of Arithmetic Logic Units (ALUs) and multiplexers. The ALUs and multiplexers are arranged in a hierarchy, with the output of one layer of processing by the ALUs being used by the multiplexers to provide input to the next layer of ALUs.

The stage pipeline implemented in the hardware module may include a first packet access stage (pkt0), then a first logic stage (logic0), then a first map access stage (map0), then a second logic stage (logic1), then a second packet access stage (pkt1), and so on. Thus, it may take the following form:

pkt0->logic0->map0->logic1->pktl

in some examples, stage pkt0 extracts the required information from the packet. Phase pkt0 passes this information to logic 0. logic0 determines whether the packet is a valid IP packet. In some cases, logic0 forms and sends a mapping request to map0, which performs the mapping operation. The stage map0 may perform updates to the lookup table. Then, stage logic1 collects the results from the mapping operation and decides whether to drop the packet.

In some cases, the mapping request is disabled to cover situations where no mapping operation should be performed on this packet. Without performing the mapping operation, logic0 indicates to logic1 whether the packet should be dropped based on whether the packet is a valid IP packet. In some examples, the lookup table contains 256 entries, each of which is an 8-bit value.

The example described includes only five stages. However, as described above, more may be used. Furthermore, the operations need not all be performed sequentially, but certain operations on the same packet may be performed simultaneously by different processing units.

The hardware module 610 shown in fig. 6 illustrates an atomic single pipeline for performing packet related functions. However, hardware module 610 may include multiple pipelines for processing data packets. Each of the plurality of pipelines may perform a different function on the data packet. Hardware module 610 may be configured to interconnect a first set of atoms of hardware module 610 to form a first data processing pipeline. Hardware module 610 may also be configured to interconnect a second set of atoms of hardware module 610 to form a second data processing pipeline.

To compile functions to be implemented in a hardware module comprising a plurality of processing units, a series of steps starting with a series of computer code may be performed. A compiler, which may be running on a processor on a host device or a network interface device, may access the disassembled computer code sequences.

First, the compiler is configured to divide the sequence of computer code instructions into separate stages. Each stage may include operations according to one of the process stage templates described above. For example, one stage may provide for reading of a data packet. One stage may provide for updating of mapping table data. Another phase may make a decision to pass the abort. The compiler assigns each of a plurality of operations represented by the code to each of a plurality of stages.

Second, the compiler is configured to allocate each processing stage determined by the code to be executed by a different processing unit. This means that each of the respective at least one operation of the processing stages is performed by a different processing stage. The output of the compiler may then be used to cause the processing unit to perform the operations of each stage in a particular order to perform the functions.

The output of the compiler includes generated instructions for causing the processing units of the hardware module to carry out the operations associated with each processing stage.

The output of the compiler may also be used to generate logic in the hardware module in response to control messages used to configure the hardware module 610. Such control messages are described in more detail below with reference to fig. 14.

The compiling process for compiling the functions to be executed on the network interface device 600 may be performed according to a determination that the process for providing the functions is safe for execution in the kernel of the host device. The determination of the security of the program may be performed by a suitable verifier as described above with reference to fig. 3. Once it is determined that the process is safe for execution in the kernel, the process may be compiled for execution in the network interface device.

Referring to fig. 15, a representation of at least some of the plurality of processing units performing their respective at least one operation to perform a function on a data packet is shown. Such a representation may be generated by a compiler and used to configure the hardware module to perform the function. The representation indicates the order in which operations may be performed and how some processing units perform their operations in parallel.

Representation 1500 is in the form of a list having rows and columns. Some entries of the list show atoms, such as atom 1510a, configured to perform their respective operations. The row to which a processing unit belongs indicates the timing of operations performed by the processing unit for a particular data packet. Each row may correspond to a single time period represented by one or more cycles of the clock signal. The processing units belonging to the same row perform their operations in parallel.

The inputs to the logic stage are provided in row 0 and the computation flows forward to the following row. By default, an atom receives the processing result of the atom in the same column as the atom but in the row above. For example, atom 1510b receives the results of the processing from atom 1510a and performs its respective processing based on these results.

When using local routing resources, an atom may also access the output of a previous row of atoms whose column numbers differ by no more than two. For example, atom 1510d can receive results from the processing performed by atom 1510 c.

When using a global routing resource, the atom can also access the output of the atom in the first two rows and any column. This may be performed by using global routing resources. For example, atom 1510f may receive results from the processing performed by atom 1510 e.

These constraints on routing between atoms are given as examples, and other constraints may be applied. Imposing more restrictive constraints may make information routing between atoms easier. Constraints that impose fewer constraints may make scheduling easier. Compiling a function into a hardware module will fail if the atomic number of a given type (e.g., mapping, logic, or packet access) is exhausted or cannot be routed between atoms.

The specific constraints are determined by the topology supported by the interconnect structure supported by the hardware module. The interconnect fabric is programmed to cause the atoms of the hardware modules to perform their operations in a particular order and to provide data between each other within the constraints. FIG. 15 shows one particular example of how the interconnect structure may be programmed.

In the synthesis of the FPGA application 515 onto the FPGA (as shown in FIG. 5), a place and route algorithm is used. However, in this case, the solution space is limited, and thus the execution time of the algorithm is limited.

There is a trade-off between processing speed or efficiency and compile time. According to embodiments of the present application, it may be desirable to first compile and run a program on at least one processing unit (which may be a CPU or an atom as described above with reference to fig. 6) to provide functionality for received data packets. The at least one processing unit may then perform a function with respect to the received data packet during the first time period. During operation of the network interface device, functions are performed. The second at least one processing unit (which may be an FPGA application or a template type processing unit as described above with respect to fig. 6) may be configured to perform a function on the data packet. The function may then be migrated from the first at least one processing unit to the second at least one processing unit such that the second at least one processing unit subsequently executes the function to subsequently receive the data packet at the network interface device. And (4) grouping. Thus, the slower compile time of the second at least one processing unit does not prevent the network interface device from performing a function on the data packet before compiling the function for the second at least one processing unit, because the first at least one processing unit may be compiled more quickly and may be used to perform the function on the data packet while the function for the second at least one processing unit is compiled. Since the second at least one processing unit typically has a faster processing time, migrating to the second at least one processing unit at compile time allows for faster processing of data packets received at the network interface device.

According to embodiments of the application, the compilation process may be configured to run on at least one processor of the data processing system, wherein the at least one processor is configured to send instructions for the first at least one processing unit and the second at least one processing unit to perform at least one function on the data packet at an appropriate time. The at least one processor may comprise a host CPU. The at least one processor may include a control processor on the network interface device. The at least one processor may comprise a combination of one or more processors on the host system and one or more processors on the network interface device.

Thus, the at least one processor is configured to execute the first compilation process to compile functions to be performed by the first at least one processing unit of the network interface device. The at least one processing unit is further configured to execute a second compilation process to compile functions to be performed by a second at least one processing unit of the network interface device. The at least one processing unit instructs the first at least one processing unit to perform a function on the data packet received from the network before the second compilation process is completed. Subsequently, after completion of the second compilation process, the at least one processing unit instructs the second at least one processing unit to begin performing functions on data packets received from the network.

Performing these steps enables the network interface device to perform functions using the first at least one processing unit (which may have a shorter compilation time but slower and/or less efficient processing) while waiting for the second compilation process to complete. When the second compilation process is complete, the network interface device, in addition to or in place of the first at least one processing unit, may then use a second at least one processing unit (which may have a longer compilation time, but faster and/or more efficient processing) to perform the functions.

Referring to FIG. 7, an example network interface device 700 is shown in accordance with an embodiment of the present application. The same reference elements as those shown in the previous drawings are denoted by the same reference numerals.

The network interface device comprises a first at least one processing unit 710. The first at least one processing unit 710 may include a hardware module 610 including a plurality of processing units as shown in fig. 6. The first at least one processing unit 710 may include one or more CPUs, as shown in fig. 4.

The function is compiled to run on the first at least one processing unit 710 such that the function is executed by the first at least one processing unit 710 with respect to data packets received from the network during the first time period. The first at least one processing unit 710 is instructed by the at least one processor to perform a function on a data packet received from the network before the second compilation process for the second at least one processing unit is complete.

The network interface device comprises a second at least one processing unit 720. The second at least one processing unit 720 may comprise an FPGA with FPGA applications (e.g., as shown in fig. 5), or may comprise a hardware module 610 as shown in fig. 6, which includes a plurality of processing units.

During the first time period, a second compilation process is performed to compile functions for execution on a second at least one processing unit. That is, the network interface device is configured to dynamically compile the FPGA application 515.

After a first period of time (i.e., after completion of the second compilation process), the second at least one processing unit 720 is configured to begin performing functions on data packets received from the network.

After a first period of time, the first at least one processing unit 710 may stop performing functions with respect to data packets received from the network. In some embodiments, the first at least one processing unit 710 may partially stop performing functions on the data packet. For example, if the first at least one processing unit includes multiple CPUs, after a first period of time, one or more CPUs may stop performing processing with respect to data packets received from the network while the remaining CPUs of the multiple CPUs continue to perform the processing.

The first at least one processing unit 710 may be configured to perform a function on the data packets of the first data stream. When the second compilation process is complete, the second at least one processing unit 720 may begin performing functions on the data packets of the first data stream. When the second compilation process is complete, the first at least one processing unit may stop performing functions on the data packets of the first data stream.

Different combinations are possible for the first at least one processing unit and the second at least one processing unit. For example, in some embodiments, the first at least one processing unit 710 includes multiple CPUs (as shown in fig. 4), while the second at least one processing unit 720 includes a hardware module having multiple processing units (as shown in fig. 6). In some embodiments, the first at least one processing unit 710 comprises a plurality of CPUs (as shown in fig. 4) and the second at least one processing unit 720 comprises an FPGA (as shown in fig. 5). In some embodiments, the first at least one processing unit 710 comprises a hardware module having a plurality of processing units (as shown in fig. 6), and the second at least one processing unit 720 comprises an FPGA (as shown in fig. 5).

Referring to fig. 11, it is shown how the connected plurality of

processing units

640a, 640b, 640d may perform their respective at least one operation on a data packet. Each processing unit is configured to perform its respective at least one operation on the received data packet.

At least one operation of each processing unit may represent a logical stage in a function (e.g., a function of an eBPF program). The at least one operation of each processing unit may be expressed by instructions executed by the processing unit. The instructions may determine the behavior of the atom.

FIG. 11 shows a data packet (P)₀) How to proceed along the processing stages implemented by each processing unit.

Each processing unit performs processing on the data packets in a particular order specified by the compiler. The order may be such that some of the processing units are configured to perform their processing in parallel. The processing may include accessing at least a portion of the data packet stored in the memory. Additionally or alternatively, the processing may include performing a lookup on a lookup table to determine an action to perform on the data packet. Additionally or alternatively, the processing may include modifying the state 1110.

Processing units exchange metadata M with each other₀、M₁、M₂、M₃. The first processing unit 640a is configured to perform its respective at least one predetermined operation and generate metadata M in response thereto₁. The first processing unit 640a is configured to process the metadata M₁To the second processing unit 640 b.

At least some of the processing units perform respective at least one operation in accordance with at least one of: the contents of a data packet, its own storage state, global sharing state, and metadata associated with the data packet (e.g., M)₀、M₁、M₂、M₃). Some processing units may be stateless.

Each processing unit can be used for at least one clock cycle for a data packet (P)₀) The type of operation associated with it is executed. In some embodiments, each processing unit may perform its associated operation type in a single clock cycle. Each processing unit may be individually clocked for performing their operations. This timing may be complementary to the timing of the processing pipeline of the processing unit.

Examining the operation of the second processing unit 640b in more detail, the second processing unit 640b is configured to be connected to a first processing unit 640a configured to perform a first at least one predetermined operation with respect to the first data packet. The second processing unit 640b is configured to receive the result of the first at least one predetermined operation from the first further processing unit. The second processing unit 640b is configured to perform a second at least one predetermined operation according to a result of the first at least one predetermined operation. The second processing unit 640b is configured to be connected to a third processing unit 640d, the third processing unit 640d being configured to perform a third at least one predetermined operation with respect to the first data packet. The second processing unit 640b is configured to send the result of the second at least one predetermined operation to the third processing unit 640d for processing in a third at least one predetermined operation.

The processing unit may operate similarly to provide functionality with respect to each of the plurality of data packets.

Embodiments of the application enable pipelining of multiple packets simultaneously as functionality allows.

Referring to fig. 12, a pipeline of data packets is shown. As shown, different packets may be processed simultaneously by different processing units. The first processing unit 640a is at a first time (t)₀) For the third data packet (P)₂) Its corresponding at least one operation is performed. The second processing unit 640b is at a first time (t)₀) For the second data packet (P)₁) Its corresponding at least one operation is performed. The third processing unit 640d is at a first time (t)₀) For the first data packet (P)₀) Its corresponding at least one operation is performed.

Each packet moves along a stage in the sequence after each processing unit has performed at least the corresponding operation. E.g. at a second subsequent time (t)₁) The first processing unit 640a is at a first time (t)₀) For the fourth data packet (P)₃) Its corresponding at least one operation is performed. The second processing unit 640b is at a first time (t)₀) For the third data packet (P)₂) Its corresponding at least one operation is performed. The third processing unit 640d is at a first time (t)₀) For the first data packet (P)₁) Its corresponding at least one operation is performed.

It should be understood that in some embodiments, there may be multiple packets in a given phase.

In some embodiments, packets may move from one phase to the next, without having to be in a lock step.

A pipeline running at a fixed clock may have a constant bandwidth as long as there is no pipeline hazard. This may reduce jitter in the system.

To avoid hazards in executing instructions (e.g., conflicts in accessing the shared state), each processing unit may be configured to execute no-operation (i.e., processing unit stalls) instructions when necessary.

In some embodiments, the operation (e.g., simple arithmetic, incrementing, adding/subtracting constant values, shifting, adding/subtracting values from packets or from metadata) requires one clock cycle to be performed by the processing unit. This may mean that the shared state value required by one processing unit has not been updated by another processing unit. Thus, the expiration values in the shared state 1110 may be read by the processing units that need them. Therefore, a hazard may occur in reading and writing the value of the shared state. On the other hand, operations on intermediate values can be passed as metadata without danger.

Examples of hazards that may be avoided when reading and writing the shared state 1110 may be given in the context of incremental operations. Such an increment operation may be an operation that increments a packet counter in the shared state 1110. In one implementation of the increment operation, during a first time slot of the pipeline, the second processing unit 640b is configured to read the value of the counter from the shared state 1110 and provide the output of the read operation (e.g., as metadata M2) to the third processing unit 640 d. The third processing unit 640d is configured to receive the value of the counter from the second processing unit 640 b. During the second time slot, the third processing unit 640d increments the value and writes a new increment value into the shared state 1110.

A problem may arise in performing such an increment operation in that if the second processing unit 640b attempts to access the counter stored in the shared state 1110 during the second time slot, the second processing unit 640b may read the previous value of the counter before the counter value in the shared state 1110 is updated by the third processing unit 640 d.

Therefore, to address this issue, the second processing unit 640b may stall (execute no-op instructions or pipeline bubbles by the second processing unit 640 b) during the second time slot. A stall may be understood as a delay in the execution of the next instruction. This delay may be implemented by executing a "no operation" instruction instead of the next instruction. The second processing unit 640b then reads the counter value from the shared state 1110 during a subsequent third time slot. During the third time slot, the counter in the shared state 1110 has been updated, thus ensuring that the second processing unit 640b reads the updated value.

In some embodiments, each atom is configured to read from, update, and write to an updated state during a single pipeline time slot. In this case, the stall of the processing unit described above may not be used. However, quiescing the processing unit may reduce the cost of the required memory interface.

In some embodiments, to avoid hazards, a processing unit in the pipeline may wait until other processing units in the pipeline have completed their processing before performing their own operations.

As previously described, the compiler constructs the data path by linking any number of processing stage instances, each of which is constructed from one of a predetermined number of pre-synthesis processing stage templates (three in the given example). The processing stage templates are logical stage templates (e.g., providing arithmetic operations on registers, scratchpad, and metadata), packet access status templates (e.g., providing packet data loading and/or packet data storage), and mapping access stage templates (e.g., mapping lookup algorithms, mapping table sizes).

Each processing stage instance may be implemented by a single processing unit. That is, each processing stage includes a respective at least one operation performed by the processing unit.

Fig. 13 shows an example of how processing stages may be connected together in pipeline 1300 to process a received data packet. As shown in fig. 13, a first data packet is received and stored at the FIFO 1305. One or more call parameters are received at the first logic stage 1310. The call parameter may include a program selector that identifies a function to be performed on the received packet. The invocation parameter may include an indication of the packet length of the received data packet. The first logic stage 1310 is configured to process the call parameters and provide output to the first packet access stage 1315.

First packet access phase 1315 loads data from a first packet at network tap 1320. The first packet access stage 1315 may also write data to the first packet based on the output of the first logic stage 1310. The first packet access stage 1315 may write data ahead of the first packet. The first packet access stage 1315 may overlay data in a packet.

The loaded data and any other metadata and/or parameters are then provided to a second logic stage 1325, which second logic stage 1325 performs processing on the first packet and provides output parameters to the first map access stage 1330. The first map access stage 1330 uses the output from the second logic stage 1325 to perform a lookup of a lookup table to determine an action to perform for the first packet. The output is then passed to third logic stage 1335, which processes the output and passes the result to second packet access stage 1340.

Second packet access stage 1340 may read data from and/or write data to the first packet according to the output of third logic stage 1335. The results of the second packet access stage 1340 are then passed to a fourth logic stage 1345, which fourth logic stage 1345 is configured to perform processing on the inputs it receives.

The pipeline may include a plurality of packet access stages, a logic stage, and a map access stage. The final logic stage 1350 is configured to output return parameters. The return parameters may include a pointer identifying the start of the data packet. The return parameters may include an indication of an action to be performed on the data packet. The indication of action may indicate whether the packet is to be dropped. The indication of the action may indicate whether to forward the data packet to the host system. The network interface device may include at least one processing unit configured to drop a respective packet in response to an indication that the packet is to be dropped.

Pipeline 1300 may additionally include one or

more bypass FIFOs

1355a, 1355b, 1355 c. The bypass FIFO may be used to pass processed data, such as data from the first packet around the map access phase and/or the packet access phase. In some embodiments, the map access phase and/or the packet access phase do not require data from the first packet to perform their respective at least one operation. The map access phase and/or the packet access phase may perform their respective at least one operation according to the input parameters.

Referring to fig. 8, a method 800 performed by a

network interface device

600, 700 according to an embodiment of the present application is shown.

At S810, the hardware module of the network interface device is arranged to perform a function. The hardware module includes a plurality of processing units, each configured to perform a type of operation in hardware with respect to the data packet. S810 includes arranging for at least some of the plurality of processing units to perform their respective predetermined types of operations in a particular order to provide functionality for each received data packet. Such an arrangement of hardware modules includes: at least some of the plurality of processing units are coupled such that the received data packet is processed by each of a plurality of operations of the at least some of the plurality of processing units. The connection may be achieved by configuring the routing hardware of the hardware module to route packets and associated metadata between the processing units.

At S820, a first data packet is received from a network at a first interface of a network interface device.

At S830, the first data packet is processed by each of at least some of the processing units connected during the compilation process of S810. Each of the at least some processing units performs a type of operation with respect to the at least one data packet that it is preconfigured to perform. Thus, the function is performed for the first packet.

At S840, the processed first packet is forwarded to its destination. This may include sending a data packet to the host. This may include sending data packets over the network.

Referring to FIG. 9, a method 900 that may be performed in the network interface device 700 according to embodiments of the present application is illustrated.

At S910, a first at least one processing unit (i.e., a first circuit) of a network interface device is configured to receive and process data packets received from a network. The processing includes performing a function on the data packet. The process is performed during a first time period.

At S920, a second compilation process is performed during the first time period to compile a function for execution on a second at least one processing unit (i.e., a second circuit).

At S930, it is determined whether the second compiling process is completed. If not, the method returns to S910 and S920, where the first at least one processing unit continues to process the data packet received from the network and the second compilation process continues.

At S940, in response to determining that the second compilation is complete, the first at least one processing unit stops performing functions on the received data packet. In some embodiments, the first at least one processing unit may stop performing functions only for certain data streams. The second at least one processing unit may then perform functions on those particular data streams instead (at S950).

At S950, when the second compilation process is complete, the second at least one processing unit is configured to begin performing functions on data packets received from the network.

Referring to fig. 16, a method 1600 in accordance with an embodiment of the present application is shown. The method 1600 may be performed in a network interface device or a host device.

At S1610, a compiling process is performed so as to compile a function to be performed by the first at least one processing unit.

At S1620, a compiling process is performed to compile a function to be performed by the second at least one processing unit. The process includes allocating each of a plurality of processing units of the second at least one processing unit to perform at least one operation associated with one of a plurality of stages for processing a data packet to provide the first function. Each of the plurality of processing units is configured to a type of processing, and the assigning is performed according to a determination as to the type of processing that the processing unit is configured to perform suitable for performing the respective at least one operation. In other words, the processing units are selected according to their templates.

At 1630, before the compilation process in S1620 is complete, an instruction is issued to cause the first at least one processing unit to perform the function. The instruction may be transmitted before the compiling process in S1620 is started.

At S1640, after the compilation process in S1620 is complete, an instruction is sent to the second circuit to cause the second circuit to perform a function on the data packet. The instruction may include a compiling instruction generated at S1620.

Functionality according to embodiments of the present application may be provided as pluggable components of a processing slice in a network interface. Referring to fig. 14, an example of how slices 1425 may be used in the network interface device 600 is shown. Slice 1425 may be referred to as a processing pipeline.

The network interface device 600 includes a transmit queue 1405 for receiving and storing packets from a host to be processed by the slice 1425 and then transmitted over the network. The network interface device 600 includes a receive queue 1410 for storing data packets to be received from the network 1410, to be processed by the slice 1425, and then passed to the host. The network interface device 600 includes a receive queue 1415 for storing data packets received from the network that have been processed by the slice 1425 and are to be delivered to the host. The network interface device 600 includes a transmit queue for storing data packets received from a host that have been processed by the slice 1425 and are to be delivered to the network.

Slice 1425 of network interface device 600 includes a number of processing functions for processing data packets on the receive path and the transmit path. Slice 1425 may include a protocol stack configured to perform protocol processing on packets on the receive path and the transmit path. In some embodiments, there may be multiple slices in the network interface device 600. At least one of the plurality of slices may be configured to process a received data packet received from the network. At least one of the plurality of slices may be configured to process a transport packet for transmission over the network. The slicing may be implemented by a hardware processing device such as at least one FPGA and/or at least one ASIC.

The

accelerator components

1430a, 1430b, 1430c, 1430d may be inserted into the slice at different stages, as shown. Each accelerator component provides functionality with respect to data packets traversing the slice. The accelerator component may be inserted or removed dynamically (i.e., during operation of the network interface device). Thus, the accelerator component is a pluggable component. The accelerator component is a logical area allocated for slice 1425. Each of them supports a streaming packet interface that allows packets flowing through the slice to flow into and out of the component.

For example, one type of accelerator component may be configured to provide encryption of data packets on the receive or transmit path. Another type of accelerator component may be configured to provide decryption of data packets on the receive or transmit path.

The functionality discussed above (as discussed above with reference to fig. 6) provided by performing operations performed by a plurality of connected processing units may be provided by an accelerator component. Similarly, the functionality provided by the array of network processing CPUs (as discussed above with reference to FIG. 4) and/or the FPGA application (as discussed above with reference to FIG. 5) may be provided by the accelerator component.

As described, during operation of the network interface device, processing performed by the first at least one processing unit (e.g., a plurality of connected processing units) may be migrated from the second at least one processing unit. To accomplish such migration, components of the slice 1425 for processing by a first at least one processing unit may be replaced with components for processing by a second at least one processing unit.

The network interface device may include a control processor configured to insert and remove components from the slice 1425. During the first time period discussed above, there may be a component in the slice 1425 that is performing a function by the first at least one processing unit. The control processor may be configured to, after a first time period: pluggable components that are functional by the first at least one processing unit are removed from slice 1425 and pluggable components that are functional by the second at least one processing unit are inserted into slice 1425.

In addition to or instead of inserting components from a slice or deleting components from a slice, the control processor may load programs into the components and issue control plane commands to control the flow of frames into the components. In such a case, the component may be caused to run or not run without being inserted into or removed from the pipeline.

In some embodiments, the control plane or configuration information is carried on the data path without a separate control bus. In some embodiments, the request to update the configuration of the datapath component is encoded as a message that is carried on the same bus as the network packet. Thus, a data path may carry two types of data packets: network data packets and control data packets.

Control packets are formed by the control processor and injected into the slice 1425 by the same mechanism as sending or receiving packets using the slice 1425. The same mechanism may be a transmit queue or a receive queue. The control packets may be distinguished from the network packets in any suitable manner. In some embodiments, different types of data packets may be distinguished by one or more bits in the metadata word.

In some embodiments, the control packet contains a routing field in the metadata word that determines the path the control packet takes through the slice 1425. The control data packet may carry a series of control commands. Each control command may be directed to one or more components of slice 1425. The corresponding datapath component is identified by a component ID field. Each control command encodes a request for a corresponding identified component. The request may be to make a change to the configuration of the component. The request may control whether the component is activated, i.e., whether the component performs its function on the data packets that traverse the slice.

Thus, in some embodiments, the control processor of the network interface device 600 is configured to send a message to cause one of the sliced components to begin performing a function on a data packet received at the network interface device. The message is a control plane message that is sent through the pluggable component and causes the atom to switch frames to the component to perform a function. This component then executes on all received packets that pass through the slice until they are cut out. The control processor is configured to send a message to cause another component of the slice to cause the component to stop performing the function for the data packet received at the network interface device 600.

To switch components into and out of data slice 1425, sockets may exist at various points in the ingress and egress data paths. The control processor may cause additional logic to enter and be cut from slice 1425. This additional logic may take the form of FIFOs placed between the components.

The control processor may send control plane messages to the configured components of slice 1425 through slice 1425. This configuration may determine the functions performed by the components of slice 1425. For example, control messages sent through slice 1425 may cause a hardware module to be configured to perform a function on a data packet. Such control messages may cause the atoms of the hardware modules to interconnect into the pipeline of the hardware modules to provide certain functionality. Such control messages may cause the respective atoms of the hardware module to be configured to select operations to be performed by the respective selected atoms. Since each atom is preconfigured to perform a certain operation, the selection of the operation for each atom depends on the type of operation each atom is preconfigured to perform.

Some other embodiments will now be described with reference to fig. 19 to 21. In this embodiment, a packet handler or feed forward pipeline runs in the FPGA. A method of enabling subunits of an FPGA to implement a packet handler or feed forward pipeline will be described. The packet processing program or feed forward pipeline may be an eBPF program or a P4 program or any other suitable program.

The FPGA may be provided in a network interface device. In some embodiments, the packet handler is deployed or run only after the network interface device is installed with respect to its host.

A packet handler or feed forward pipeline may implement a logic flow without loops.

In some embodiments, programs may be written in a non-privileged or lower-privileged domain, such as in a user-level. The program may run on a privileged domain or a more privileged domain (e.g., a kernel). The hardware running the program may require no arbitrary cycles.

In the following embodiments, reference is made to the eBPF program example. However, it should be understood that other embodiments may be used with any other suitable procedure.

It is to be understood that one or more of the following embodiments can be used in combination with one or more of the previous embodiments.

Some embodiments may be provided in the context of an FPGA, an ASIC, or any other suitable hardware device. Some embodiments use sub-units such as FPGAs or ASICs. The following examples are described with reference to an FPGA. It should be understood that similar processing may be performed using an ASIC or any other suitable hardware device.

A subunit may be an atom. Some examples of atoms have been described previously. It should be understood that any of those previously described examples of atoms may alternatively or additionally be used as a subunit. Alternatively or additionally, these subunits may be referred to as "slices" or configurable logic blocks.

Each of these sub-units may be configured to execute a single instruction or multiple related instructions. In the latter case, the dependent instruction may provide a single output (which may be defined by one or more bits).

A subunit may be considered a computational unit. The subunits may be arranged in a pipeline in which the packets are processed in sequence. In some embodiments, subunits may be dynamically allocated to execute various instructions in a program.

In some embodiments, a sub-unit may be all or a portion of a unit used to define a block, such as an FPGA. In some FPGAs, the blocks of the FPGA are referred to as slices. In some embodiments, a subunit or atom is equivalent to a slice.

By mapping the respective atoms or subunits to respective blocks or slices of the FPGA, improved resource utilization may be achieved compared to methods that map RTL atoms to FPGA resources. Such latter approach may result in RTL atoms requiring a relatively large number of individual blocks or slices of the FPGA.

In some embodiments, the compilation may be at the atomic level. This may have the advantage of pipelining. The packets may be processed in sequence. The compilation process can be performed relatively quickly.

In some embodiments, an arithmetic operation may require one slice per byte. A logical operation may require half a slice per byte. Depending on the width of the shift operation, the shift operation may require a set of slices. The compare operation may require one slice per byte. The select operation may require half a slice per byte.

As part of the compilation process, layout and routing is performed. A layout is the allocation of a particular physical subunit to execute one or more particular instructions. Routing ensures that one or more outputs of a particular sub-unit are routed to the correct destination, which may be, for example, another sub-unit or sub-units.

Placement and routing may use a process that assigns operations to particular sub-units starting from one end of the pipeline. In some embodiments, the most critical operation may be placed before the less critical operation. In some embodiments, routes may be assigned concurrently with layout specific operations. In some embodiments, the route may be selected from a limited set of pre-computed routes. This will be described in more detail later.

In some embodiments, if a route cannot be assigned, the operation will be retained for later use.

In some embodiments, the pre-computed route may be a byte wide route. However, this is merely exemplary, and in other embodiments, different routing widths may be defined. In some embodiments, a plurality of different sized routes may be provided.

In some embodiments, routing may be limited to routing between nearby subunits.

In some embodiments, the subunits may be physically arranged on the FPGA in a regular structure.

In some embodiments, to facilitate routing, rules may be formulated regarding how subunits may communicate. For example, a subunit can only provide output to subunits next to, above, or below it.

Alternatively or additionally, a limit may be placed on the distance of the next subunit for routing purposes. For example, a subunit may only output data to neighboring subunits or subunits that are within a defined distance (e.g., no more than one intermediate subunit).

Referring to fig. 19, a method of some embodiments is illustrated.

In some embodiments, an FPGA may have one or more "static" regions and one or more "dynamic" regions. The static area provides a standard configuration and the dynamic functionality may provide functionality that meets the end user requirements. For example, the static portion can be defined prior to the end user receiving the network interface device, such as prior to installing the network interface device relative to the host. For example, the static area may be configured to cause the network interface device to provide certain functionality. The static region will provide a pre-computed path between atoms. As will be discussed in more detail below, routing may be between one or more static regions that experience one or more dynamic regions. When the network interface device is deployed with respect to the host, the dynamic zone may be configured by the end user according to its needs. The dynamic zone may be configured to perform different functions for the end user over a period of time.

In step S1, a first compilation process is performed to provide first bitfiles referred to as master bitfile 50 and tool checkpoint 52. In some embodiments, this is a bit file for at least a portion of the static area. When downloaded to an FPGA, a bit file that is compiled from the program will cause the FPGA to function as specified by the program. In some embodiments, the program used in the first compilation process may be any one or more programs, or may be a test program specifically designed to help determine routes within a portion of the FPGA. In some embodiments, a series of simple procedures may be used instead or in addition.

The program may be modified or have reconfigurable partitions that can be used by a compiler. By moving the network out of the reconfigurable partition, the program can be modified to make the work of the compiler easier.

Step S1 may be performed in the design tool. For example only, the Vivado tool may be used with a Xilinx FPGA. The checkpoint file may be provided by a design tool. The checkpoint file represents a design snapshot when the bit file is generated. The checkpoint file may include one or more of a synthesized netlist, design constraints, placement information, and routing information.

In step S2, the bitfile is analyzed in view of the checkpoint file to provide a bitfile description 54. The analysis may be one or more of: detecting resources, generating routes, checking timing, generating one or more partial bitfiles, and generating bitfile descriptions.

The analysis may be configured to extract routing information from the bit file. The analysis may be configured to determine which lines or routes the signal has propagated.

The analysis stage may be performed at least in part in a synthesis or design tool. In some embodiments, the scripting tool of Vivado may be used. The scripting tool may be a TCL (tool command language). The TCL may be used to add or modify the functionality of Vivado. The functions of Vivado may be invoked and controlled by TCL scripts.

The bit file description 54 defines how a given portion of the FPGA is used. For example, a bitfile description will indicate which atom may be routed to other atoms, and one or more routes that may be routed between these atoms. For example, for each atom, the bitfile description will indicate where the atom's inputs can come from and where the atom's outputs can be routed along with one or more data output routes. The bit file description is independent of any program.

The bitfile description may contain one or more of routing information, an indication of which conflicts occur with routes, and a description of how the bitfile is generated from the desired atomic configuration.

The bitfile description may provide a set of routes between a set of atoms, but available before any particular instruction is executed by a given atom.

The bit file description may be for a portion of the FPGA. The bit file description may be part of a dynamic FPGA. The bit file description will include which routes are available and/or which routes are not. For example, the bit file may take into account any routing required across the dynamic portion of the FPGA, such as indicating available routes for the dynamic portion of the FPGA through the static portion of the FPGA.

It should be appreciated that in some embodiments, the bit file description may be obtained in any suitable manner. For example, the bit file description may be provided by the provider of the FPGA or ASIC.

In some embodiments, the bit file description may be provided by a design tool. In this embodiment, the analyzing step may be omitted. The design tool may output a bit file description. The bit file description may be for the static portion of the FPGA, including any required routing across the dynamic portion of the FPGA.

It should be appreciated that any other suitable technique may be used to generate the bit file description. In the examples described above, tools for designing FPGAs are used to provide analysis for generating bit files.

It should be understood that in other embodiments different tools may be used. In some embodiments, the tool may be specific to a product or a series of products. For example, a provider of an FPGA may provide the relevant tools for managing the FPGA.

In other embodiments, a universal scripting tool may be used.

In some embodiments, different tools or different techniques may be used to determine the partial bit file. For example, the master bitfile may be analyzed to determine which features correspond to which features. This may require the generation of multiple partial bit files.

It should be appreciated that step S3 is performed when the network interface device is installed with respect to a host and executed on a physical FPGA device. Steps S1 and S2 may be performed as part of a design integration process to generate a bitfile image that implements a network interface device. In some embodiments, step S1 and/or step S2 are used to characterize the behavior of the FPGA. Once the characteristics of the FPGA are determined, the description of the bit file is stored in the memory of all the physical network interface devices that will operate in the given defined manner.

In step S3, compilation is performed using the bitfile description and the eBPF program. The output of the compilation is a partial bit file of the eBPF program. The compilation will add the route to the partial bitfile and add the programming to be performed by each of the slices.

It should be understood that the bit file description may be provided in a deployed system. The bit file description may be stored in memory. The bit file description may be stored on the FPGA, on the network interface device, or on the host device. In some embodiments, the bit file description is stored in a flash memory or the like connected to the FPGA on the network interface device. The flash memory may also contain master bit files.

The eBPF program may be stored with the bit file description or separately. The eBPF program may be stored on the FPGA, on the network interface device, or on the host computer. In the case of eBPF, a program may be transferred from a user mode program to a kernel, both running on the host. The kernel will transfer the program to the device driver and then to a compiler running on the host or network interface device. In some embodiments, the eBPF program may be stored on the network interface device so that it can be run before the host OS is booted.

The compiler may be provided at any suitable location on the network interface device, FPGA or host. For example only, the compiler may run on a CPU on the network interface device.

The compiler flow will now be described. The front end of the compiler receives an eBPF program. The eBPF program may be written in any suitable language. For example, the eBPF program may be written in a C-type language. The compiler is configured in the front end to convert the program into an intermediate representation IR. In some embodiments, the IR may be LLVM-IR or any other suitable IR.

In some embodiments, pointer analysis may be performed to create packet/map access primitives.

It should be appreciated that in some embodiments, the optimization of IR may be performed by a compiler. This may be optional in some embodiments.

The high-level synthesis back-end of the compiler is configured to divide the program pipeline into stages, generate packet access taps, and issue C-code. In some embodiments, the design tool and/or the HLS portion of the design tool used may be invoked to synthesize the output of the HLS phase.

The compiler back-end of the FPGA atom divides the pipeline into stages and generates packet access taps. An if transformation may be performed to transform the control dependency into the data dependency. The design is laid out and routed. The partial bit file for the eBPF program has been sent.

A routing problem may arise as shown in fig. 20a, where there is a routing conflict. For example, slice a may communicate with slice C and slice B may communicate with slice D. In the layout of fig. 20a, a common routing section 60 has been assigned to the communication between slice a and slice C and the communication between slice a and slice C. In some embodiments, this may avoid the routing conflict. In this regard, reference is made to fig. 20 b. It can be seen that a separate route 62 is provided between slice a and slice C compared to the route 64 between slice B and slice D.

In some embodiments, the bit file description may include a plurality of different routes for at least some of the subunit pairs. The compilation process will check for routing conflicts as shown in fig. 20 a. In the case of route conflicts, the compiler may resolve or avoid such conflicts by selecting one of the appropriate alternative routes.

Fig. 21 shows a partition 66 in the FPGA for executing the eBPF program. The partition will interface with the static portion of the FPGA, for example, via a series of input flip-flops 68 and a series of output flip-flops. In some embodiments, as previously discussed, there may be a route 70 on the design.

The compiler may need to handle routing across the FPGA region that the compiler is configuring. The compiler needs to generate a local bitfile that fits into the reconfigurable partition in the master bitfile. When generating the master bit file using the reconfigurable partition, the design tool will avoid using logic resources within the reconfigurable partition so that the partial bit file can use these resources. However, the design tool may not avoid using routing resources in the reconfigurable partition.

Therefore, the analysis tool would need to avoid using the routing resources already used by the design tool in the master bitfile. The analysis tool may need to ensure that its available routing list in the bitfile description does not include any of the used resources being used by the master bitfile. The available routes can be defined according to routing templates, which can be used in many places within the FPGA due to its strong regularity. The routing resources used by the master files break the conventions, which means that the analysis tools avoid using these templates at locations that conflict with the master files. The analysis tool may need to generate new routing templates that can be used in those places and/or prevent certain routing templates from being used in specific locations.

Some examples of the functionality provided by a compiler when converting some example eBPF program slices into instructions to be executed by an atom will now be described.

Some embodiments may use any suitable synthesis tool to generate the bit file description. By way of example only, some embodiments may use a Bluespec tool based on a pattern that uses atomic transactions on hardware.

In the first example, the eBPF program slice has two instructions:

instruction 1: r1+ ═ r2

Instruction 2: r1+ ═ r3

The first instruction adds the number in register 1(r1) to the number in register 2(r2) and puts the result in r 1. The second instruction adds r1 to r3 and places the result in r 1. Both instructions in this example use 64-bit registers, but only the lowest 32-bits. The upper 32 bits of the result are padded with zeros.

The compiler will convert these into instructions for atomic execution. A 32-bit add instruction requires 32 pairs of look-up tables (LUTs), a 32-bit carry chain, and 32 flip-flops.

Each pair of look-up tables will add two bits to produce a 2-bit result. The structure of the carry chain allows one bit to be carried from digit to next column during addition and allows borrowing from next column during subtraction.

32 flip-flops are storage elements that accept a value on one clock cycle and reproduce the value on the next clock cycle. These can be used to limit the amount of work done per clock cycle and simplify timing analysis.

In some embodiments, an FPGA may include multiple slices. In some example slices, the carry chain propagates from the bottom of the slice (CIN) to the top of the slice (COUT), which is then connected to the CIN input of the next slice.

In the example with a 4-bit carry chain per slice, eight slices are used to perform 32-bit addition. In this embodiment, an atom may be considered to be provided by a pair of slices. This is because in some embodiments it may be convenient for an atom to operate on an 8-bit value.

In the example with an 8-bit carry chain per slice, four slices are used to perform 32-bit addition. In this embodiment, the atoms may be considered to be provided by the slice.

It should be understood that this is merely exemplary, and as previously described, atoms may be defined in any suitable manner.

In this example, the case where the FPGA has slices that support 8-bit carry chains will now be used for compilation of the first example eBPF program slice.

There are 3 input values that are 32 bits wide and 1 output value that is 32 bits wide. There may be other earlier instructions that produce these 3 input values. In the following, some arbitrary positions of the slices (atoms) will be assumed.

The following numbering convention will be used. The slices (atoms) are arranged in regular rows and columns. XnYm represents the position of an atom in an arrangement. Xn denotes columns and Ym denotes rows. X6Y0 indicates that the slice is in column 6 and row 0. It should be appreciated that any other suitable numbering scheme may be used in other embodiments.

Assume that the initial values are generated simultaneously at the following positions:

r1: sections X6Y0, X6Y1, X6Y2 and X6Y3

r 2: sections X6Y4, X6Y5, X6Y6 and X6Y7

r 3: sections X6Y8, X6Y9, X6Y10 and X6Y11

The result of the first instruction needs to be computed from four adjacent slices in the same column in order for the carry chain to be correctly connected. The compiler may choose to compute the result in slices X7Y0, X7Y1, X7Y2, and X7Y 3. For this purpose, the inputs need to be connected. There will be a connection from X6Y0 to X7Y0, a connection from X6Y1 to X7Y1, a connection from X6Y2 to X7Y2, and a connection from X6Y3 to X7Y 3. Corresponding linkages from X6Y4-X6Y7 to X7Y0-X7Y3 are also required.

These will be full byte connections, which means that each of the 8 input bits is connected to a corresponding output bit. For example:

the output of flip-flop 0 of slice X6Y0 is connected to input 0 of slice X7Y0 LUT 0.

The output of flip-flop 1 of slice X6Y0 is connected to input 0 of slice X7Y0 LUT 1.

And so on until

The output of the slice X6Y0 flip-flop 7 is connected to input 0 of the slice X7Y0 LUT 7.

In the first clock cycle, the r1 and r2 values from slices X6Y0-X6Y7 will be transferred to the inputs of slices X7Y0-X7Y3, will be processed by the LUT and carry chain, the results of which will be stored in the flip-flops of these slices (X7Y0-X7Y3), ready for use in the next cycle.

Move to instruction 2. The compiler needs to select a location to compute the result of instruction 2. It may select slices X7Y4 through X7Y 7. Likewise, input from the result of instruction 1 (X7Y0 to X7Y3) to instruction 2(X7Y4 to X7Y7) will make full byte concatenation.

The value of r3 is also required. If r1, r2, and r3 were generated in cycle 0, then r1+ r2 would be generated in cycle 1. The value of r3 needs to be delayed by one clock cycle in order to be generated in cycle 1. The compiler may choose to generate r3 in cycle 1 by using slices X7Y8 through X7Y 11. Then a connection needs to be established between the original slice (X6Y8 to X6Y11) that produced r3 in cycle 0 and the new slice (X7Y8 to X7Y11) that produced the same value in cycle 1. After this is done, connections are now needed from these new slices to the slice for instruction 2. Thus, the output of slice X7Y8 will be connected to the input of slice X7Y4, and so on.

The FPGA bit file will contain the following features:

full byte connection from X6Y0 to X7Y0 input 0 (initial r1 byte 0)

Full byte connection from X6Y1 to X7Y1 input 0 (initial r1 byte 1)

Full byte connection from X6Y2 to X7Y2 input 0 (initial r1 byte 2)

Full byte connection from X6Y3 to X7Y3 input 0 (initial r1 byte 3)

Full byte connection from X6Y4 to X7Y0 input 1 (initial r2 byte 0)

Full byte connection from X6Y5 to X7Y1 input 1 (initial r2 byte 1)

Full byte connection from X6Y6 to X7Y2 input 1 (initial r2 byte 2)

Full byte connection from X6Y7 to X7Y3 input 1 (initial r2 byte 3)

Full byte connection from X6Y8 to X7Y8 input 0 (initial r3 byte 0)

Full byte connection from X6Y9 to X7Y9 input 0 (initial r3 byte 1)

Full byte connection from X6Y10 to X7Y10 input 0 (initial r3 byte 2)

Full byte connection from X6Y11 to X7Y11 input 0 (initial r3 byte 3)

Slice X7Y0 is configured to add input 0 to input 1 (instruction 1 byte 0)

Slice X7Y1 is configured to add input 0 to input 1 (instruction 1 byte 1)

Slice X7Y2 is configured to add input 0 to input 1 (instruction 1 byte 2)

Slice X7Y3 is configured to add input 0 to input 1 (instruction 1 byte 3)

Slice X7Y8 configured to copy input 0 to output (r3 delay bytes 0)

Slice X7Y9 configured to copy input 0 to output (r3 delay bytes 1)

Slice X7Y10 configured to copy input 0 to output (r3 delay bytes 2)

Slice X7Y11 configured to copy input 0 to output (r3 delay bytes 3)

Full byte connection from X7Y0 to X7Y4 input 0 (instruction 1 byte 0)

Full byte connection from X7Y1 to X7Y5 input 0 (instruction 1 byte 1)

Full byte connection from X7Y2 to X7Y6 input 0 (instruction 1 byte 2)

Full byte connection from X7Y3 to X7Y7 input 0 (instruction 1 byte 3)

Full byte connection from X7Y8 to X7Y4 input 1(r 3 delay byte 0)

Full byte connection from X7Y9 to X7Y5 input 1(r 3 delay byte 1)

Full byte connection from X7Y10 to X7Y6 input 1(r 3 delay byte 2)

Full byte connection from X7Y11 to X7Y7 input 1(r 3 delay byte 3)

Slice X7Y4 is configured to add input 0 to input 1 (instruction 2 byte 0)

Slice X7Y5 is configured to add input 0 to input 1 (instruction 2 byte 1)

Slice X7Y6 is configured to add input 0 to input 1 (instruction 2 byte 2)

Slice X7Y7 is configured to add input 0 to input 1 (instruction 2 byte 3)

The compiler does not need to generate the upper 32 bits of the result of instruction 2 because they are known to be zeros. It may simply record this fact and use zeros whenever they are used.

A second example of compilation of eBPF slices will now be described.

Instruction 1: r1& ═ 0xff

Instruction 2: r2& ═ 0xff

Instruction 3: if r1< r2 goto L1

Instruction 4: r1 ═ r2

Label L1.

The first instruction performs a bitwise AND (bitwise-AND) operation on r1 using a constant 0xff AND puts the result in r 1. If the corresponding bit is initially set to 1 in r1 and the corresponding bit is set, then the given bit in the result will be set to a constant of 1. Otherwise it will be set to zero. Bits 0 to 7 of the constant 0xff are set to 1 and bits 8 to 63 are cleared, so the result will be that bits 0 to 7 of r1 will be unchanged, while bits 8 to 63 will be set to zero. This simplifies the work of the compiler, since the compiler knows that bits 8 to 63 are zeros and does not need to generate them. The second instruction performs the same operation on r 2.

Instruction 3 checks whether r1 is less than r2 and jumps to tag L1 if r 1. This skips instruction 4. Instruction 4 simply copies the value from r2 into r 1. The command sequence finds the minimum of r1 byte 0 and r2 byte 0, placing the result in r1 byte 0.

The compiler may use a technique called "if-convert" to convert the conditional jump to a select instruction:

instruction 1: r1& ═ 0xff

Instruction 2 r2& ═ 0xff

Command 5: c1 ═ (r1< r2)

Command 6: r1 ═ c1r1: r2

Instruction 5 compares r1 with r2, sets c1 to 1 if r1 is less than r2, and otherwise sets c1 to zero. Instruction 6 is a select instruction, copying r1 to r1 if c1 is set (which would have no effect), otherwise copying r2 to r 1. If c1 is equal to 1, instruction 3 will skip instruction 4, which means r1 will retain its value with instruction 1. In this case, the select instruction also leaves r1 unchanged. If c1 is equal to 0, instruction 3 will not skip instruction 4, so r2 will be copied into r1 by instruction 4. Again, the select instruction copies r2 into r1, so the new sequence has the same effect as the old sequence.

Instruction 6 is not a valid eBPF instruction. However, when it is processed by the compiler, the instruction is represented in LLVM-IR. Instruction 6 will be a valid instruction in LLVM-IR.

Now, these instructions need to be allocated to atoms. Assume that input r1 is available in slice X0Y0 through X0Y7, and r2 is available in slice X0Y8 through X0Y 15. Instructions 1 and 2 cause the compiler to note that the first 7 bytes of r1 and r2 are set to zero.

The compiler may then select the result of instruction 5 in compute slice X1Y 0. A full byte connection is required from the output of dice X0Y0 to input 0 of dice X1Y0 and a full byte connection is required from the output of dice X0Y8 to input 1 of dice X1Y 0. The method of comparing two values is to subtract one value from the other and then to see if the calculation overflows by trying to borrow bits from the next bit. The result of this comparison is then stored in flip-flop 7 of dice X1Y 1.

As with the first example, r1 and r2 will need to be delayed by one cycle to present the value to instruction 6 at the correct time. The compiler may use slices X1Y1 and X1Y2 for r1 and r2, respectively.

The select instruction requires three inputs: c1, r1 and r 2. Note that r1 and r2 are one byte wide, while c1 is only one bit wide. Assume that the compilation computation selects the result of instruction slice X2Y 0. The selection is performed bit by bit, with each LUT in slice X2Y0 processing one bit:

if c1 is set, bit 0 of the result is r1 bit 0 and r2 bit 0

Otherwise

If c1 is set, bit 1 of the result is r1 bit 1 and r2 bit 1

Otherwise

.., and so on, until

If c1 is set, bit 7 of the result is r1 bit 7 and r2 bit 7

Otherwise.

Each LUT may need to access the corresponding bit from r1 and the corresponding bit from r2, but all LUTs need to access c 1. This means that cl needs to be copied between the bits of input 0 of the slice. Thus, the connections entered by instruction 6 would be: bit 7 of the output of slice X1Y0 is copied to input 0 of slice X2Y 0.

Full byte concatenation from the output of slice X1Y1 to input 1 of slice X2Y 0.

Full byte concatenation from the output of slice X1Y2 to input 2 of slice X2Y 0.

Another problem to be solved is related to shift instructions. Consider the following example:

a 16-bit word shifted 5 bits to the left requires:

set output bit 0 to zero

Set output bit 1 to zero

Set output bit 2 to zero

Set output bit 3 to zero

Set output bit 4 to zero

Copying input bit 0 to output bit 5

Copying input bit 1 to output bit 6

…

Copying input bits 10 to output bits 15

It should be noted that the inputs and outputs here are connected. The input to the connection is the output from the first slice. The output of the connection will go to the input of the second sliced.

Such connections may not be made in the slices but through the interconnections between the slices. The compiler can assume that the 16-bit input value is generated by two adjacent tiles in the same column because the compiler can ensure that the value is generated here.

For example, assume that the inputs were generated by slices X0Y4 and X0Y5, and the outputs would go into slices X1Y4 and X1Y 5. In this case, the following connections are required:

bit 0 of slice X1Y4 is known to be zero and therefore is not required

Bit 1 of slice X1Y4 is known to be zero and therefore is not required

Bit 2 of slice X1Y4 is known to be zero and therefore is not required

Bit 3 of slice X1Y4 is known to be zero and therefore is not required

Bit 4 of slice X1Y4 is known to be zero and therefore is not required

Bit 5 of slice X1Y4 comes from bit 0 of slice X0Y4

Bit 6 of slice X1Y4 is from bit 1 of slice X0Y4

Bit 7 of slice X1Y4 is from bit 2 of slice X0Y4

Bit 0 of slice X1Y5 is from bit 3 of slice X0Y4

Bit 1 of slice X1Y5 is from bit 4 of slice X0Y4

Bit 2 of slice X1Y5 is from bit 5 of slice X0Y4

Position 3 of slice X1Y5 from position 6 of slice X0Y4

Bit 4 of slice XIY5 is from bit 7 of slice X0Y4

Bit 5 of slice X1Y5 comes from bit 0 of slice X0Y5

Bit 6 of slice XIY5 is from bit 1 of slice X0Y5

Bit 7 of slice X1Y5 is from bit 2 of slice X0Y5

The 8 connections to the inputs of slice X1Y5 may be considered shifted connections or routes. The same structure may be used for slice X1Y4, but for the inputs of X1Y3 and X1Y4, since bits 5-7 match and the slice may ignore bits 0-4, there is no relationship to what input is displayed there.

It may be desirable to be able to shift any number between 1 and 7 bits. The connection shifted by 0 bits or 8 bits is the same as a full byte connection, since in this case each bit is connected to a corresponding bit of another slice.

The shift may be performed in two or three stages by variable amounts, depending on the width of the value to be shifted. These phases are:

stage 1: shift by 0, 1, 2 or 3.

And (2) stage: shift 0, 4, 8 or 12.

And (3) stage: shifts 0, 16, 32, or 48 (limited to 32-bit or 64-bit words only).

As another example, assuming that the arithmetic right shift amount of bytes is a variable amount, the value to be shifted is generated by slice X3Y2 and the shift amount is generated by X3Y 3.

Arithmetic right-shifting requires an "arithmetic right-shift" type of connection. This type of connection takes the output of one slice and connects it to the input of another slice, but in the process shifts them to the right by a constant amount and copies the sign bits as needed.

For example, an "arithmetic right shift 3" connection would have:

output bit 0 is from input bit 3

Output bit 1 from input bit 4

Output bit 2 from input bit 5

Output bit 3 from input bit 6

Output bit 4 from input bit 7

Output bit 5 from input bit 7 (sign bit)

Output bit 6 from input bit 7 (sign bit)

Output bit 7 from input bit 7 (sign bit)

Phase 1 may be calculated in slice X4Y2, in which case it requires the following concatenation:

from slice X3Y2 full bytes to slice X4Y2 input 0

From slice X3Y2 arithmetic right shift 1 to slice X4Y2 input 1

From slice X3Y2 arithmetic Right move 2 to slice X4Y2 input 2

From slice X3Y2 arithmetic Right move 3 to slice X4Y2 input 3

Copy slice X3Y3 bit 0 to slice X4Y2 input 4

Copy slice X3Y3 bit 1 to slice X4Y2 input 5

Slice X4Y2 would then be configured to select one of the first four inputs based on input 4 and input 5, as follows:

input 4 is 0, input 5 is 0: select input 0

Input 4 is 1, input 5 is 0: select input 1

Input 4 is 0, input 5 is 1: selection input 2

Input 4 is 1 and input 5 is 1: selection input 3

The offset may be copied from slice X3Y3 to slice X4Y3 to provide a delayed version.

Phase 2 may be calculated in slice X5Y2, in which case it requires the following connections:

from full bytes of slice X4Y2 to slice X5Y2 input 0

From arithmetic right shift 4 of slice X4Y2 to input 1 of slice X5Y2

Copy slice X4Y3 bit 2 to slice X5Y2 input 2

Slice X5Y2 would then be configured to select either input 0 or input 1 based on input 2, as follows:

input 2 is 0: select input 0

Input 2 is 1: select input 1

The output of slice X5Y2 will be the result of a variable arithmetic right shift operation.

The bit file for a given atom may be as follows:

identity information of atoms

List of other atoms from which a given atom can receive input and available routes for that input

A given atom can provide a list of other atoms for output and available routes for that output.

It should be appreciated that since FPGAs are regular structures, there may be one common template that can be used for multiple atoms and modified for a single atom if necessary.

For example, the bit file description of slice X7Y1 may specify the following possible inputs and outputs:

input from X6Y1 via route A or route B

Input from X6Y5 via route C or route D

Input from X7Y0 via route E or route F

Output to X8Y1 via route G or route H

Output to X7Y2 via route I or route J

Output to X7Y5 via route K or route L.

The compiler will use this bitfile description to provide the partial bitfile for the input and output of slice X7Y1, for the first eBPF example described previously,

input from X6Y1 via route A

Input from X6Y5 via route C

Output to X7Y5 via route K or route L.

For example, a bit file description of a slice XnYm may specify the following possible inputs and outputs:

input from Xn-lYm via route A or route B

Input from Xn-lYm +4 via route C or route D

Input from Xnym-1 via route E or route F

Output to Xn + lYm via route G or route H

Output to XnYm +1 via route I or route J

Output to XnYm +4 via route K or route L.

This bit file description may be modified to delete one or more routes that are not available to the compiler, such as described previously. This may be because the route is used by another atom or for routing across partitions.

It should be understood that a compiler may be implemented by a computer program comprising computer executable instructions that may be executed by one or more computer processors. The compiler may be run on hardware such as at least one processor operating in conjunction with one or more memories.

It should be noted that although the above describes exemplifying embodiments, there are several variations and modifications which may be made to the disclosed solution without departing from the scope of the present invention.

The embodiments may thus vary within the scope of the attached claims. In general, some embodiments may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software executed by a controller, microprocessor or other computing device, although the embodiments are not limited thereto.

Embodiments may be implemented by computer software stored in a memory and executed by at least one data processor of the involved entities or may be executed by hardware or by a combination of software and hardware.

The software may be stored on physical media such as memory chips or memory blocks implemented within a processor, magnetic media such as hard or floppy disks, and optical media such as DVDs and data variants thereof, CDs.

The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory.

The data processor may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), gate level circuits, and processors based on a multi-core processor architecture, as non-limiting examples.

Various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings will still fall within the scope as defined in the appended claims.

Claims

1. A network interface device for connecting a host to a network, the network interface device comprising:

a first interface configured to receive a plurality of data packets;

a configurable hardware module comprising a plurality of processing units, each processing unit being associated with a predetermined type of operation that can be performed in a single step, wherein at least some of the plurality of processing units are associated with different predetermined types of operation,

wherein the hardware module is configurable to interconnect at least some of the plurality of the processing units to provide a first data processing pipeline for processing one or more of the plurality of data packets to perform a first function on the one or more of the plurality of data packets.

2. The network interface device of claim 1, wherein two or more of at least some of the plurality of processing units are configured to perform their associated at least one predetermined operation in parallel.

3. The network interface device of claim 1 or 2, wherein two or more of at least some of the plurality of processing units are configured to:

performing its associated predetermined type of operation for a predetermined length of time specified by the clock signal; and

in response to the end of the predetermined length of time, transmitting a result of the respective at least one operation to a next processing unit.

4. The network interface device of any preceding claim, wherein each of the plurality of processing units comprises an application specific integrated circuit configured to perform the at least one operation associated with the respective processing unit.

5. The network interface device of any preceding claim, wherein each of the plurality of processing units comprises digital circuitry and a memory to store a state associated with a process performed by the digital circuitry, wherein the digital circuitry is configured to perform a predetermined type of operation associated with the respective processing unit in communication with the memory.

6. The network interface device of any preceding claim, comprising a memory accessible to two or more of the plurality of processing units, wherein the memory is configured to store a state associated with a first data packet, wherein during execution of a first function by the hardware module, the two or more of the plurality of processing units are configured to access and modify the state.

7. The network interface device of claim 6, wherein a first one of at least some of the plurality of processing units is configured to stop running during a value of the access state by a second one of the plurality of processing units.

8. The network interface device of any preceding claim, wherein one or more of the plurality of processing units are individually configured to: the particular operations for the respective pipelines are performed based on their associated predetermined operation types.

9. The network interface device of any preceding claim, wherein the hardware module is configured to receive an instruction and, in response to the instruction, perform at least one of:

interconnecting at least some of the plurality of processing units to provide a data processing pipeline for processing one or more of the plurality of data packets;

for the one or more data packets, causing one or more processing units of the plurality of processing units to perform their associated predetermined type of operation;

adding one or more processing elements of the plurality of processing elements to a data processing pipeline; and

one or more processing units of the plurality of processing units are removed from the data processing pipeline.

10. The network interface device of any preceding claim, wherein the predetermined operation comprises at least one of:

loading at least one value of the first data packet from a memory;

storing at least one value of the data packet in a memory; and

a lookup is performed in a lookup table to determine an action to be performed on the packet.

11. The network interface device of any preceding claim, wherein one or more of at least some of the plurality of processing units is configured to pass at least one result of its associated at least one predetermined operation to a next processing unit in the first processing pipeline, the next processing unit being configured to perform a next predetermined operation in dependence on the at least one result.

12. A network interface device as claimed in any preceding claim wherein each of the different predetermined types of operation is defined by a different template.

13. The network interface device of any preceding claim, wherein the predetermined type of operation comprises at least one of:

accessing the data packet;

accessing a lookup table stored in a memory of the hardware module;

performing a logical operation on data loaded from the data packet; and

a logical operation is performed on the data loaded from the lookup table.

14. The network interface device of any preceding claim, wherein the hardware module comprises routing hardware, wherein the hardware module is configured to: routing packets between the plurality of processing units in a particular order specified by the first data processing pipeline by configuring the routing hardware to interconnect at least some of the plurality of processing units to provide a first data processing pipeline.

15. The network interface device of any preceding claim, wherein the hardware module is configurable to interconnect at least some of the plurality of processing units to provide a second data processing pipeline for processing one or more of the plurality of data packets to perform a second function different from the first function.

16. A network interface device as claimed in any preceding claim, wherein the hardware module is configurable to interconnect at least some of the plurality of processing units so as to provide a second data processing pipeline after interconnecting at least some of the plurality of processing units to provide the first data processing pipeline.

17. The network interface device of any preceding claim, comprising further circuitry separate from the hardware module and configured to perform a first function on one or more of the plurality of data packets.

18. The network interface device of claim 17, wherein the additional circuitry comprises at least one of:

a field programmable gate array; and

a plurality of central processing units.

19. The network interface device of claim 17 or 18, wherein the network interface device comprises at least one controller, wherein the further circuitry is configured to perform the first function on a data packet during a compilation process for the first function to be performed in the hardware module, and wherein the at least one controller is configured to: controlling the hardware module to start executing the first function on a packet in response to completion of the compilation process.

20. The network interface device of claim 19, wherein the at least one controller is configured to: control the further circuitry to stop performing the first function on data packets in response to determining that the compilation process for the first function to be performed in the hardware module has been completed.

21. The network interface device of claim 17 or 18, wherein the network interface device comprises at least one controller, wherein the hardware module is configured to perform the first function on a data packet during a compilation process for the first function to be performed in the hardware circuit, and wherein the at least one controller is configured to: determining that compilation processing for the first function to be performed in the further circuitry has been completed, and in response to the determination, controlling the further circuitry to begin performing the first function on data packets.

22. The network interface device of claim 21, wherein the at least one controller is configured to: in response to determining that the compilation process for the first function to be performed in the further circuitry has been completed, controlling the hardware module to stop performing the first function on data packets.

23. The network interface device of any preceding claim, comprising at least one controller configured to perform a compilation process to provide the first function to be performed in the hardware module.

24. A data processing system comprising a network interface device and a host device according to any preceding claim, wherein the data processing system comprises at least one controller configured to perform a compilation process to provide the first function to be performed in the hardware module.

25. The data processing system of claim 24, wherein the at least one controller is provided by one or more of:

the network interface device; and

the host device.

26. The data processing system of claim 24 or 25, wherein the compilation process is performed in response to determining that the computer program representing the first functionality made by the at least one controller is secure for execution in kernel mode of the host device.

27. The data processing system of claim 24, 25 or 26, wherein the at least one controller is configured to perform the compilation process by: designating each processing unit of at least some of the plurality of processing units to perform at least one operation represented by a series of computer code instructions in a particular order of the first data processing pipeline, wherein the plurality of operations provide the first function for the one or more data packets of the plurality of data packets.

28. The data processing system of any of claims 24 to 27, wherein the at least one controller is configured to:

sending a first instruction to cause further circuitry of the network interface device to perform the first function on a data packet before the compilation process is complete; and

sending a second instruction to cause the hardware module to start executing the first function on a data packet after the compiling process is completed.

29. A method for implementation in a network interface device, the method comprising:

receiving a plurality of data packets at a first interface; and

configuring a hardware module to interconnect at least some of a plurality of processing units of the hardware module to provide a first data processing pipeline for processing one or more packets of the plurality of packets to perform a first function on the one or more packets of the plurality of packets,

wherein each processing unit is associated with a predetermined type of operation that can be performed in a single step,

wherein at least some of the plurality of processing units are associated with different predetermined operation types.

30. A non-transitory computer-readable medium comprising program instructions for causing a network interface device to perform a method comprising:

receiving a plurality of data packets at a first interface; and