EP3877864A1 - Streaming platform flow and architecture - Google Patents
Streaming platform flow and architectureInfo
- Publication number
- EP3877864A1 EP3877864A1 EP19835920.0A EP19835920A EP3877864A1 EP 3877864 A1 EP3877864 A1 EP 3877864A1 EP 19835920 A EP19835920 A EP 19835920A EP 3877864 A1 EP3877864 A1 EP 3877864A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- circuit
- kernel
- data
- traffic manager
- stream
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 239000000872 buffer Substances 0.000 claims description 202
- 230000015654 memory Effects 0.000 claims description 130
- 238000012546 transfer Methods 0.000 claims description 70
- 230000006854 communication Effects 0.000 claims description 32
- 238000004891 communication Methods 0.000 claims description 32
- 230000004044 response Effects 0.000 claims description 25
- 238000000034 method Methods 0.000 description 24
- 230000006870 function Effects 0.000 description 23
- 238000013507 mapping Methods 0.000 description 18
- 238000012545 processing Methods 0.000 description 16
- 238000010586 diagram Methods 0.000 description 10
- 238000011144 upstream manufacturing Methods 0.000 description 10
- 238000013461 design Methods 0.000 description 8
- 238000012544 monitoring process Methods 0.000 description 7
- 230000000670 limiting effect Effects 0.000 description 6
- 230000009471 action Effects 0.000 description 5
- 230000005540 biological transmission Effects 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 230000001133 acceleration Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 239000004744 fabric Substances 0.000 description 4
- 230000000977 initiatory effect Effects 0.000 description 4
- 239000000463 material Substances 0.000 description 4
- 230000036961 partial effect Effects 0.000 description 4
- 230000003068 static effect Effects 0.000 description 4
- 238000004590 computer program Methods 0.000 description 3
- 238000013500 data storage Methods 0.000 description 3
- 239000012634 fragment Substances 0.000 description 3
- 238000004519 manufacturing process Methods 0.000 description 3
- 230000002093 peripheral effect Effects 0.000 description 3
- 230000001105 regulatory effect Effects 0.000 description 3
- LHMQDVIHBXWNII-UHFFFAOYSA-N 3-amino-4-methoxy-n-phenylbenzamide Chemical compound C1=C(N)C(OC)=CC=C1C(=O)NC1=CC=CC=C1 LHMQDVIHBXWNII-UHFFFAOYSA-N 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000007175 bidirectional communication Effects 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 2
- 230000001364 causal effect Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000002085 persistent effect Effects 0.000 description 2
- 230000011664 signaling Effects 0.000 description 2
- 238000000638 solvent extraction Methods 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- RYGMFSIKBFXOCR-UHFFFAOYSA-N Copper Chemical compound [Cu] RYGMFSIKBFXOCR-UHFFFAOYSA-N 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 230000003139 buffering effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000001276 controlling effect Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 229910052802 copper Inorganic materials 0.000 description 1
- 239000010949 copper Substances 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 239000002184 metal Substances 0.000 description 1
- 229910052751 metal Inorganic materials 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001902 propagating effect Effects 0.000 description 1
- 230000002829 reductive effect Effects 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 229910000679 solder Inorganic materials 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 239000011800 void material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
- G06F15/7867—Architectures of general purpose stored program computers comprising a single central processing unit with reconfigurable architecture
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/16—Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
- G06F15/163—Interprocessor communication
- G06F15/173—Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
- G06F15/17306—Intercommunication techniques
- G06F15/17331—Distributed shared memory [DSM], e.g. remote direct memory access [RDMA]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
- G06F15/7828—Architectures of general purpose stored program computers comprising a single central processing unit without memory
- G06F15/7835—Architectures of general purpose stored program computers comprising a single central processing unit without memory on more than one IC chip
Definitions
- This disclosure relates to integrated circuits (ICs) and, more particularly, to using data streams for communications between a host system and hardware accelerated circuitry and for communication between kernel circuits of the hardware accelerated circuitry.
- ICs integrated circuits
- a system in one or more embodiments, includes a host system and an IC coupled to the host system through a communication interface.
- the IC is configured for hardware acceleration.
- the IC includes a direct memory access circuit coupled to the communication interface, a kernel circuit, and a stream traffic manager circuit coupled to the direct memory access circuit and the kernel circuit.
- the stream traffic manager circuit is configured to control data streams exchanged between the host system and the kernel circuit.
- the IC includes an input buffer coupled to an output port of the first interconnect and to an input port of the kernel circuit, wherein the input buffer is configured to temporarily store the packetized data, convert the packetized data into a data stream, and provide the data stream to the kernel circuit.
- the stream traffic manager circuit initiates a data transfer to the kernel circuit in response to determining that the input buffer has space available.
- the IC includes an input buffer coupled to an input port of the second kernel circuit within the second die and configured to temporarily store data streamed to the second kernel circuit and an output buffer coupled to an output port of the first kernel circuit within the first die and configured to temporarily store data output from the first kernel circuit.
- the stream traffic manager circuit is configured to initiate a data transfer from the first kernel circuit to the second kernel circuit in response to determining that the input buffer has space available and the output buffer is storing data.
- the stream traffic manager circuit and the satellite stream traffic manager circuit are configured to exchange the data stream in response to determining that an input buffer of a receiving kernel circuit has space available.
- the stream data transfer includes an in-band instruction that controls operation of the receiving kernel circuit.
- FIG. 5 illustrates an example method of exchanging data between kernel circuits using data streams.
- FIG. 7 illustrates an example architecture for an IC.
- This disclosure relates to ICs and, more particularly, to using data streams for communications between a host system and hardware accelerated circuitry and for communication between kernel circuits of the hardware accelerated circuitry.
- An IC implements hardware accelerated circuitry as one or more kernel circuits.
- each kernel circuit represents hardware accelerated program code.
- the host system is capable of offloading one or more tasks to the kernel circuits implemented within the IC. In doing so, the host system transfers the data to be operated on by the kernel circuits using an architecture that supports data streams.
- the kernel circuits are capable of exchanging data with one another using the data stream enabled architecture.
- the kernel circuits also transfer data, e.g., results, to the host system as data streams that are packetized prior to sending to the host system.
- Streaming is performed over a data path within the IC that utilizes one or more smaller internal memory buffers.
- the memory buffers for example, are smaller in size than the amount of data exchanged between host system and the kernel circuits.
- a streaming architecture as described within this disclosure facilitates faster data transfers, less latency, and more efficient usage of memory compared to conventional systems.
- kernel circuits can begin operation on data immediately upon receipt of less than the entirety of the data rather than waiting for the entirety of the data to be first transferred to the off-chip RAM and then loaded into the kernel circuit. This improves speed and latency of the overall system. Similar gains in speed and latency are obtained by streaming data from the kernel circuits to the host system.
- commands from the host system to the kernel circuits may be included in the data streams themselves, e.g., in-banded, which further reduces system latency.
- in-banded e.g., in-banded commands from the host system to the kernel circuits may be included in the data streams themselves, e.g., in-banded, which further reduces system latency.
- off-chip RAM is required, which reduces the power requirements of the system and/or hardware accelerator.
- FIG. 1 illustrates an example architecture 100 for hardware acceleration.
- Architecture 100 includes a host system 102 and a hardware accelerator 103.
- Host system 102 is implemented as a computer system such as a server or other data processing system.
- Hardware accelerator 103 is implemented as a circuit board having an IC 104 and a memory 106 attached thereto.
- hardware accelerator 103 may be implemented as an accelerator card having an edge connector that can be inserted into an available peripheral slot of host system 102.
- IC 104 is implemented as a programmable IC. In particular embodiments, IC 104 is implemented using an architecture the same as or similar to that described in connection with FIG. 7. In the example of FIG. 1 , IC 104 includes an endpoint 108, direct memory access circuit (DMA)
- DMA direct memory access circuit
- Endpoint 108 is an interface that is capable of communicating over a communications bus with host system 102.
- the communications bus may be implemented as a Peripheral Component Interconnect Express (PCIe) bus.
- PCIe Peripheral Component Interconnect Express
- endpoint 108 may be implemented as a PCIe endpoint. It should be appreciated, however, that other communication buses may be used and that the examples provided are not intended to be limiting. Accordingly, endpoint 108 can be implemented as any of a variety of suitable interfaces for communicating over a communication bus.
- kernel circuit 1 12 is capable of transferring data to host system 102 by outputting a data stream that is packetized prior to being provided to host system 102 by way of DMA 1 10 and endpoint 108. Further details relating to the transfer of data are described in greater detail in
- interface 1 16 is a stream-enabled on-chip interconnect such as an Advanced Microcontroller Bus Architecture (AMBA®) Advanced Extensible Interface (AXI) stream interconnect.
- AXI-stream interconnect enables connection of heterogeneous master/slave AMBA® AXI-stream protocol compliant circuit blocks.
- Interface 1 16 is capable of routing connections conveying packetized data from one or more masters to one or more slaves.
- AXI is provided for purposes of illustration and is not intended to be limiting. It should be appreciated that interface 1 16 can be implemented as any of a variety of interconnects.
- interface 1 16 can be implemented as a bus, a network-on-chip (NoC), a cross-bar, a switch, or other type of interconnect.
- NoC network-on-chip
- memory controller 1 14 is coupled to memory 106.
- Memory 106 is implemented as a RAM.
- Memory controller 1 14 may be multi-ported and is coupled to DMA 1 10 and to kernel circuit 1 12.
- Memory controller 1 14 is capable of accessing (e.g., reading and/or writing) memory 106 under control of DMA 1 10 and/or kernel circuit 1 12.
- DMA 1 10 is coupled to memory controller 1 14 through a memory mapped interface 1 18.
- kernel circuit 1 12 is coupled to memory controller 1 14 through a memory mapped interface 120.
- DMA 1 10 is coupled to kernel circuit 1 12 via a control interface 122.
- control interface 122 is implemented as an AXI-Lite interface that is configured to provide point- to-point bidirectional communication with a circuit block.
- AXI-Lite can be used as a control interface for kernel circuit 1 12. As discussed, AXI is provided for purposes of illustration and not limitation.
- the architecture illustrated in FIG. 1 is capable of also supporting data transfers between host system 102 and kernel circuit 1 12 through memory 106.
- host system 102 sends data to memory 106.
- the data may be provided to DMA 1 10, which stores the data within memory 106 using memory controller 1 14.
- the data is accumulated and stored in memory 106 as previously described until the data transfer is complete.
- Host system 102 may notify kernel circuit 1 12 of the availability of the data in memory 106 through control interface 122.
- Kernel circuit 1 12 is capable of accessing memory controller 1 14 to read the data from memory 106.
- Kernel circuit 1 12 generates results and stores the results within memory 106.
- Kernel circuit 1 12 notifies host system 102 of the availability of the results in memory 106 through control interface 122.
- host system 102 In the examples where data is transferred to kernel circuit 1 12 or multiple kernel circuits implemented in IC 104 using memory 106, host system 102 has the responsibility of allocating and sharing memory 106 between the various kernel circuits. Host system 102 configures and starts kernel circuits through control interface 122. Control interface 122, however, tends to be a slower interface with significant latency. Besides having to communicate with the kernel circuits through control interface 122, host system 102 also must manage and synchronize kernel circuit operation adding significant overhead to host system 102. Host system 102, for example, must synchronize the data transfers with the control signals to start and/or stop kernel circuits at the appropriate time(s).
- architecture 100 is implemented to support direct communication between host system 102 and kernel circuit 1 12 by way of packetized data and data streams.
- memory mapped communication capability may be omitted.
- control interface 122, memory mapped interfaces 118 and 120, and memory controller 1 14 may be omitted (as may be memory 106).
- architecture 100 is implemented to support both memory mapped communication involving memory 106 and direct communication using packetized data and data streams.
- DMA 1 10 may support both types of data transfer.
- a single kernel circuit is illustrated in the example of FIG. 1 , a plurality of kernel circuits may be implemented where some kernel circuits utilize direct
- kernel circuits may be implemented to utilize either direct communication via data streams or memory 106 for data transfers depending upon the particular application executed by host system 102 that is invoking the kernel circuit or the particular functions invoked by the application to
- Architecture 100 and other streaming architectures described herein provide a more efficient way to configure and manage kernels circuits.
- instructions can be provided to kernel circuits in-band with the data payload of the data streams. Including the instructions with the data, e.g., "in-banding the instructions,” removes the need for control interface 122 when data streams are used and provides more efficient host system to kernel circuit communication.
- Host system 102 is capable of executing a software framework that includes one or more user applications such as memory mapped user application 124 and/or stream user application 126.
- Memory mapped user application 124 is an application executed by host system 102 that is configured to invoke kernel circuits such as kernel circuit 1 12 and exchange data with kernel circuit 1 12 using memory mapped interfaces 1 18 and 120, control interface 122, and memory 106.
- Stream user application 126 is an application executed by host system 102 that is configured to invoke kernel circuits such as kernel circuit 1 12 and exchange data with kernel circuit 1 12 using streaming interface 1 16.
- the software framework also includes a runtime 128.
- Runtime 128 provides functions, e.g., an application programming interface (API), for communicating with IC 104.
- API application programming interface
- runtime 128 is capable of providing functions for implementing DMA transfers over PCIe.
- Driver 130 is capable of controlling an endpoint within host system 102 (not shown). In the case of a PCIe connection, for example, the endpoint within host system 102 is implemented as a root complex. Accordingly, driver 130 is capable of implementing and managing a plurality of read and write queues for storing descriptors that control the data transfers between host system 102 and IC 104.
- driver 130 is capable of dividing a request for a large data transfer to a kernel circuit (e.g., a streamed data transfer) into multiple stream transfers of smaller chunks of data called packets. This division of data, or "packetization of data into packets", performed by driver 130 is largely hidden from kernel circuit 1 12. Packetization allows an interconnect fabric implemented in IC 104 to service a plurality of kernel circuits concurrently by interleaving packets destined to and/or from different kernels circuits.
- a kernel circuit e.g., a streamed data transfer
- Driver 130 is capable of determining packet sizes to be large enough to efficiently amortize the packetization overhead while not being so large that the packets cause a kernel circuit to stall while waiting for a turn to send and/or receive streamed data while other kernel circuits are transferring streamed data.
- control interface 122 tends to be a slow
- control interface 122 is used for out-of-band signaling with data streams, the speed and/or
- kernel circuit 1 12 implements an encryption operation.
- Different data payloads provided to kernel circuit 1 12 typically require different keys for encryption.
- Were control interface 122 to be used data streams to kernel circuit 1 12 would be stopped, the keys updated via control interface 1 12, and then the data stream(s) resumed.
- Such operations would be coordinated by host system 102, which adds to the overhead of host system 102.
- one or more instructions to kernel circuit 1 12 are provided in-band. As such, new and/or updated keys can be included in the data stream in-band as provided to kernel circuit 1 12. The instruction can be included with the payload or
- the instructions can be specified in a custom defined header for each packet.
- host system 102 is capable of sending the encryption key as part of a packet header for the plaintext payload(s) of one or more packets upon which kernel circuit 1 12 is to operate.
- kernel circuit 1 12 is capable of operating efficiently, in this case switching encryption keys for different payloads, without host system 102 incurring synchronization overhead and with reduced latency compared to conventional techniques for data transfer as kernel circuit 1 12 need not be stopped and/or synchronized with control interface 122.
- FIG. 2 illustrates another example implementation of architecture 100 of FIG. 1.
- FIG. 2 illustrates further aspects of architecture 100 not illustrated in the higher-level view described in connection with FIG. 1.
- some elements shown in FIG. 1 are not illustrated in FIG. 2 such as selected elements of the software framework executed by host system 102, endpoint 108 and memory controller 1 14 within IC 104, and memory 106.
- driver 130 of the software framework executed by host system 102 is shown.
- Driver 130 is capable of implementing a plurality of queues 202-1 through 202-8.
- Driver 130 is capable of creating a read queue and a write queue for each kernel circuit that is implemented within IC 104.
- queues 202 configured as write queues are shaded, while queues 202 configured as read queues are not shaded.
- IC 104 implements four kernel circuits 234-1 , 234-2, 234-3, and 234-4, driver 130 implements four write queues (e.g., 202-1 , 202-3, 202-5, and 202-7) and four read queues (e.g., 202-2, 202-4, 202-6, and 202-8).
- Each of queues 202 is capable of storing one or more descriptors, where each descriptor describes a data transfer to be performed.
- Each descriptor stored in a write queue describes a data transfer from host system 102 to a kernel circuit 234, while each descriptor stored in a read queue describes a data transfer from a kernel circuit 234 to host system 102.
- DMA 1 10 includes two channels.
- the write channel supports transfer of data from host system 102 to kernel circuits 234.
- the write channel includes a write circuit 204 and an arbitration circuit 206.
- Write circuit 204 is capable of storing commands and/or data received from host system 102 prior to forwarding the commands and/or data to kernel circuits 234.
- the read channel supports transfer of data from the kernel circuits 234 to host system 102.
- the read channel includes a read circuit 208 and an arbitration circuit 210.
- Read circuit 208 is capable of storing data received from kernel circuits 234 prior to forwarding the data to host system 102.
- DMA 1 10 moves data between host memory (not shown) of host system 102 and buffers 218, 220, 222, 224, 226, 228, 230, and 232. DMA 1 10 fetches and maintains a list of addresses, e.g., descriptors, for every packet to be transferred, and forms the sequence of commands and addresses for endpoint 108.
- DMA 1 10 is highly configurable. Accordingly, traffic management and flow control for DMA 1 10 is performed through stream traffic manager 212. Stream traffic manager 212 effectively ensures that all kernel circuits 234 have fair access to DMA 1 10 for data transfer to and from host system 102.
- Stream traffic manager 212 is coupled to DMA 1 10 and to interconnects 214 and 216.
- Stream traffic manager 212 is capable of regulating the flow of data streams/packets between host system 102 and kernel circuits 234.
- stream traffic manager 212 includes a controller 236, one or more buffers 238, one or more data mover engines 240, a flow to pipe map (map) 242, and a pipe to route map (map) 244.
- interconnect 214 and interconnect 216 implement interface 1 16 of FIG. 1.
- interconnect 214 is configured to receive packetized data from stream traffic manager 212 and route the packetized data to appropriate kernel circuits 234.
- Interconnect 216 is configured to receive packetized data from kernel circuits 234 and provide the packetized data to stream traffic manager 212.
- kernel circuits 234 are connected to interconnect 214 and interconnect 216 through buffers.
- Each of kernel circuits 234 has an input port configured to receive data streams through a corresponding input buffer and an output port configured to send data streams through a
- input buffers e.g., buffers 218, 222, 226, and 230
- output buffers e.g., buffers 220, 224, 228, and 232
- Kernel circuit 234-1 is connected to interconnect 214 through buffer 218 and to interconnect 216 through buffer 220.
- Kernel circuit 234-2 is connected to interconnect 214 through buffer 222 and to interconnect 216 through buffer 224.
- Kernel circuit 234-3 is connected to interconnect 214 through buffer 226 and to interconnect 216 through buffer 228.
- Kernel circuit 234-4 is connected to interconnect 214 through buffer 230 and to interconnect 216 through buffer 232.
- interconnects 214 and 216 may be implemented as AXI- stream interconnects, the inventive arrangements are not intended to be so limited. Any of a variety of circuit architectures for delivering packetized data cam be used. Other example circuit architectures that may be used to implement interconnects 214 and 216 include, but are not limited to, a crossbar, a multiplexed bus, a mesh network, and/or a Network-on-Chip (NoC).
- NoC Network-on-Chip
- Each of input buffers 218, 222, 226, and 230 is coupled to interconnect 214 and an input port of kernel circuits 234-1 , 234-2, 234-3, and 234-4, respectively.
- Each input buffer is capable of temporarily storing packetized data from host system 102 directed to the corresponding kernel circuit 234 in case the kernel circuit is not able to immediately absorb or process the received data.
- each input buffer is also capable of converting packetized data received from host system 102 into a data stream that is provided to the corresponding kernel circuit 234. For example, each input buffer is capable of combining a sequence of one or more packets to generate a data stream that can be provided to the corresponding kernel circuit.
- Each of output buffers 220, 224, 228, and 232 is coupled to interconnect 216 and an output port of kernel circuits 234-1 , 234-2, 234-3, and 234-4, respectively.
- Each output buffer is capable of temporarily holding a data stream output from the corresponding kernel circuit 234, converting the data stream into packetized data, and sending the packetized data to host system 102 via interconnect 216.
- Each output buffer is capable of storing data in case the kernel circuit is unable to keep pace with the streaming infrastructure.
- Each output buffer for example, is capable of separating the data stream output from the corresponding kernel circuit into one or more packets.
- the output buffers 220, 224, 228, and 232 are capable of providing kernel tagging information to identify the source and/or destination kernel circuits.
- an output buffer is capable of adding the tagging information as a pre-pended header. The tagging performed by the output buffer allows data within the packets to be placed or routed to the proper place in host memory or to the appropriate kernel circuit.
- each output buffer corresponding to a kernel circuit 234 is capable of tagging each packet with a source kernel identifier and sending the packets to interconnect 216.
- Interconnect 216 delivers the packets to stream traffic manager 212 and to DMA engine 1 10.
- DMA engine 1 10 moves the packetized data to host memory.
- kernel circuit 234-1 is described. It should be appreciated that kernel circuits 234-2, 234-3, and 234-4 may operate in the same or similar manner.
- an input port of kernel circuit 234-1 is connected to interconnect 214 through buffer 218.
- An output port of kernel circuit 234-1 is connected to interconnect 216 through buffer 220.
- write queue 202-1 is mapped to input buffer 218; and, read queue 202-2 is mapped to output buffer 220.
- each of queues 202 is mapped to one of buffers 218-232. Buffers 218-232, however, may be mapped to more than one of queues 202.
- queues 202-1 and 202-2 correspond to buffers 218 and 220; queues 202-3 and 202-4 correspond to buffers 222 and 224; queues 202-5 and 202-6 correspond to buffers 226 and 228; and queues 202-7 and 202-8 correspond to buffers 230 and 232.
- host system 102 executes a user application that is configured for data streaming.
- host system 102 creates a pair of queues 202.
- the user application may invoke a function provided by runtime 128 that causes driver 130 to create a pair of queues 202-1 and 202-2 corresponding to buffers 218 and 220, respectively.
- the host processor is capable of invoking further functions to configure control registers within DMA 1 10 (not shown) and maps 242 and 244 of stream traffic manager 212 so that data can be streamed between host system 102 and kernel circuit 234-1 , in this example.
- host system 102 places descriptors within queue 202-1 specifying instructions for sending (e.g., writing) data to kernel circuit 234-1 and, as appropriate, places descriptors within read queue 202-2 specifying instructions for receiving (e.g., reading) data from kernel circuit 234-1.
- driver 130 is capable of packetizing the data to be send to IC 104 and notifying DMA 1 10 of the number of descriptors available in queues 202 to be fetched. DMA 1 10 conveys the information to stream traffic manager 212.
- Stream traffic manager 212 maintains a mapping of queues 202 to buffers 218-232 using map 242 and map 244. Using the stored mapping, stream traffic manager 212 determines that queue 202-1 corresponds to buffer 218 and that queue 202-2 corresponds to buffer 220. Controller 236, being aware of descriptors available in queue 202-1 , is capable of accessing buffer 218 for the input port of kernel circuit 234-1. Controller 236 determines whether buffer 218 has space available to receive data and, if so, the amount of data that can be received and stored in buffer 218.
- DMA 1 10 is capable of determining how full each of queues 202 is and informing controller 236.
- Write circuit 204 for example, is capable of determining the number of descriptors in each of queues 202-1 , 202-3, 202-5, and 202-7.
- Read circuit 208 is capable of determining the number of descriptors in each of queues 202-2, 202-4, 202-6, and 202-8.
- Read circuit 204 and write circuit 208 are capable of informing stream traffic manager 212 of the number of descriptors in the respective queues 202. Further, write circuit 204 and read circuit 208 are capable of retrieving descriptors from queues 202 under control of stream traffic manager 212.
- buffer(s) 238 store descriptors retrieved from queues 202 by way of DMA 1 10.
- controller 236 is capable of requesting that DMA 1 10 retrieve a particular number of descriptors depending upon the amount of space available within buffer(s) 238.
- DMA 1 10 provides the retrieved descriptors to stream traffic manager 212.
- stream traffic manager 212 is capable of internally storing, within buffer(s) 238, a subset of the descriptors stored in each of queues 202.
- the format or syntax of the descriptors indicates how many descriptors are needed to form a packet and the number of bytes in the packet.
- Controller 236 in response to determining that buffer 218 has space available to receive data, evaluates the descriptors stored within buffer(s) 238 corresponding to kernel circuit 234-1 (e.g., where the descriptors were retrieved from queue 202-1 ) and determines, based upon the data within the descriptor(s) themselves, the number of descriptors to execute to retrieve a sufficient amount of data (e.g., packet(s)) to store in buffer 218 and not overrun the available space of buffer 218.
- a sufficient amount of data e.g., packet(s)
- each of data mover engines 240 is capable of retrieving data from host system 102 and sending data to host system 102 via DMA 1 10. Data mover engines 240 are capable of operating concurrently.
- Controller 236 is capable of assigning descriptors to be executed from buffer(s) 238 to available ones of data mover engines 240.
- Each data mover engine 240 processes the assigned descriptors by fetching the data specified by each of the respective descriptors.
- a data mover engine 240 is capable of sending retrieved packetized data specified by the descriptor(s) to buffer 218 via interconnect 214.
- input buffer 218 is capable of storing the packetized data, converting the packetized data into a data stream, and providing the data stream to kernel circuit 234-1 .
- the packet handling abilities of the stream traffic manager 212 allow packets that may correspond to different data streams to be retrieved in an interleaved manner. Packets can be retrieved from host system 102 (or sent to host system 102) in an interleaved manner for N different data streams.
- Stream traffic manager 212 is capable of performing the operations described for each of kernel circuits 234. As such, stream traffic manager 212 is capable of continually monitoring the input buffer for each kernel circuit 234 and initiating a data transfer to the buffer only in response to first determining that the input buffer has space to receive and store the data. In other words, controller 236 is capable of continually determining which descriptors in queues 202 have corresponding buffers in IC 104 that have sufficient space available and then executing such descriptors.
- the communication bus connecting IC 104 and host system 102 is capable of simultaneously carrying multiple descriptors and/or data being fetched.
- Each of interconnects 214 and 216 is capable of conveying a single packet at a time.
- arbitration circuit 206 is capable of implementing a round-robin arbitration scheme to pass one packet at a time corresponding to different kernel circuits. In other embodiments, arbitration circuit 206 may use a different arbitration scheme. Because stream traffic manager 212 only executes descriptors (initiates read requests) for those kernel circuits 234 having available space in the input buffer, the packet received from stream arbitration 206 is passed on to the intended input buffer of the target kernel circuit 234 and is guaranteed not to have any back-pressure. Space for receiving the packetized data is guaranteed since space in the input buffer was pre-al located.
- Stream traffic manager 212 is further capable of instructing DMA 1 10 to fetch data in an interleaved manner.
- controller 236 requests DMA 1 10 to retrieve one or more packets for kernel circuit 234-1 , then one or more packets for kernel circuit 234-2, and so on based upon which kernel circuits are busy and available space in the input buffers.
- Stream traffic manager 212 performs arbitration among kernel circuits 234 knowing how busy each of kernel circuits 234 is and how much data storage is available within each respective input buffer of each kernel circuit 234.
- controller 236 stores the first "N" descriptors for each of the write queues 202 locally in buffer(s) 238 and performs a round-robin arbitration scheme checking each input buffer of each kernel circuit for available space.
- Architecture 100 is capable of operating in a similar manner when transferring data from kernel circuits 234 to host system 102.
- stream traffic manager 212 is capable of storing the first "N" descriptors of each of the read queues 202-2, 202-4, 202-6, and 202-8.
- Stream traffic manager 212 is capable of determining when result data is available in output queues for kernel circuits 234.
- controller 236 initiates a data transfer from the output buffer to host system 102 using an available data mover engine 240. Availability of the descriptor indicates that host system 102 has available space for receiving the results from the kernel circuit.
- kernel circuit 234-1 is capable of operating on data from input buffer 218. Kernel circuit 234-1 outputs result data to output buffer 220 as a data stream.
- Stream traffic manager 212 e.g., controller 236, is capable of monitoring the output buffers to determine when data is available, e.g., at least a complete packet of data is available in an output buffer and the corresponding read queue has sufficient space available to store the data (e.g., the at least a complete packet).
- controller 236 In response to determining that output buffer 220 has data available and determining that a descriptor is available in the corresponding read queue 202-2 (which may be retrieved and cached in a buffer 238 in stream traffic manager 212), controller 236 initiates a data transfer from output buffer 220 through interconnect 216 to DMA 1 10 and to host system 102.
- Output buffer 216 converts the data stream to packetized data before sending the data to interconnect 216 and on to host system 102.
- arbitration 210 is capable of implementing round-robin arbitration. In other embodiments, arbitration 210 is capable of implementing other arbitration techniques. The arbitration techniques, whether round-robin or otherwise, implement interleaving or rotation of data streams and/or packets from kernel circuits 234.
- each active kernel circuit receives a portion of the IC's data transfer bandwidth.
- Concurrent operation of multiple streaming enabled kernel circuits typically means that such kernel circuits are designed to operate on fragments of data as the data fragments arrive at each respective kernel circuit, rather than operating on the entire completed data transfer before computing commences. This ability to operate on smaller fragments of data gives streaming enabled kernel circuits as described herein quicker access to data, which facilitates lower latency, higher performance, lower data storage requirements, lower overall cost, and lower power consumption.
- stream traffic manager 212 When interleaving (or rotating) among different kernel circuits sending data to and/or receiving data from DMA 1 10, stream traffic manager 212 is capable of ensuring that the interconnect fabric, e.g., interconnects 214, 216, are not blocked by a slow kernel circuit. This is accomplished, at least in part, by using buffers 218-232.
- each of buffers 218-232 is sized to store at least one complete packet of data. As discussed, data directed to kernel circuits is not sent unless buffer space is available in the input buffer of the kernel circuit.
- the kernel circuit is capable of emptying the buffer on the kernel circuit's own time table without negatively affecting traffic on interconnect 214 thereby preventing a congestion condition known as "head-of-line blocking.”
- data directed to host system 102 from kernel circuits is not sent from the kernel circuits across interconnect 216 until a full packet has been transferred to the output buffer.
- each output buffer is capable of receiving and storing a minimum of an entire packet before attempting to send the data to interconnect 216. This feature ensures that once transmission of a packet commences, the transmission will complete as quickly as interconnect 216 and the upstream infrastructure can absorb the transfer irrespective of kernel circuit behavior or kernel circuit output data rate.
- the kernel circuits and buffers are implemented using programmable circuitry. As such, the buffers are only created for kernel circuits that are actually implemented in IC 104. Circuit resources of IC 104 are not wasted on input and/or output buffers when a small number of kernel circuits are deployed. Resource usage scales with the number of kernel circuits implemented in IC 104.
- data transfer across interconnects 214, 216 is regulated through a system of buffer credits managed by stream traffic manager 212.
- runtime 128 is capable of providing a variety of application programming interfaces (APIs) that may be invoked by the user applications to support communication directly with kernel circuits using data streams.
- APIs application programming interfaces
- the following is a list of example APIs provided by runtime 128.
- cICreateHostPipe An OpenCL API that creates a read or write type data buffer for streaming data also referred to as a "streaming pipe”.
- Runtime 128 further may provide APIs for creating, destroying, starting, stopping, and modifying read and/or write queue pairs:
- a queue handle for the created write queue is returned for future access.
- a queue handle for the created read queue is returned for future access.
- xcIModifyQueue Modifies parameters of the specified read/write queue.
- xcIStartQueue Brings the specified read/write queue to a running state where the queue is able to start accepting and processing DMA requests.
- Runtime 128 further may provide APIs for issuing writes to kernel circuits and reads from kernel circuits such as:
- Driver 130 further may provide APIs supporting operation of DMA 1 10 such as:
- runtime 128 provides input/output control
- IOCTL system calls for input/output operations relating to IC 104 that can be invoked to create, destroy, start, stop, and modify read and/or write requests.
- these system calls are not available to user space applications executing in host system 102.
- Runtime 128 further may provide Portable Operating System Interface (POSIX) read/write functions and asynchronous I/O (AIO) read/write functions that are available to user space applications executed within host system 102.
- POSIX Portable Operating System Interface
- AIO asynchronous I/O
- a system executing an electronic design automation (EDA) application that includes a hardware compiler/system linker is capable of mapping kernel arguments to queues during a design flow (e.g., high-level synthesis, synthesis, placement, routing, and/or configuration bitstream generation) implementing the kernel.
- the mapping information is generated and stored with the configuration bitstream (e.g., a partial configuration bitstream) specifying the kernel circuit within a container file.
- the container file is stored in host system 102 for use and implementation within IC 104.
- host system 102 When host system 102 retrieves the container file to implement the configuration bitstream from the container file with IC 104, host system 102 further is capable of extracting the metadata including the mapping information generated during compilation.
- the mapping information is provided to runtime 128 for use in setting up communication paths to route data streams between host system 102 and the kernel circuit once implemented within IC 104.
- the EDA application is capable of generating a kernel circuit (e.g., a configuration bitstream specifying the kernel circuit) configured to use data streams in lieu of memory mapped transactions involving either off-chip RAM or internal RAM for data transfers based upon the usage of the "pipe" data constructs within the program code for the kernel.
- a kernel circuit e.g., a configuration bitstream specifying the kernel circuit
- the EDA application in response to detecting the pipe data structures, is capable of generating the necessary hardware infrastructure and/or circuitry supporting data transfers using data streams as described in connection with FIGs. 1 and/or 2.
- An example of a kernel specified in OpenCL is provided below as Example 1 .
- mapping information When compiling the above example kernel, the EDA application generates mapping information for p1 and p2.
- the mapping information includes register settings for configuring stream traffic manager 212 (e.g., by storing such settings in maps 242 and 244) and DMA 1 10 (by storing in control registers therein) to properly route data streams between the host system 102 and a particular kernel circuit such as kernel circuit 234-1 once implemented within IC 104.
- the mapping information specifies the particular routejd and flowjd to which each pipe is bound and/or static information relating to pipe p1 and pipe p2.
- This mapping data is stored as metadata within the container file for the configuration bitstream specifying the kernel circuit generated from the kernel (e.g., program code).
- runtime 128 and/or driver 130 assigns the operation to p1 and binds p1 to queue structure 202-1.
- Host system 102 looks up a routejd for kernel circuit 234-1 from internal tables. The routejd specifies the location of kernel circuit 234-1.
- Host system 102 configures the control registers of DMA 1 10 with pipe p1 and the associated queue 202-1.
- Host system 102 creates an entry correlating the routejd for kernel circuit 234-1 with queue 202-1 and pipe p1.
- stream traffic manager 212 in response to receiving data corresponding to pipe p1 , is capable of tagging kernel circuit bound data belonging to p1 with the correct routejd. Given data tagged with this routejd, stream traffic manager 212 and interconnect 214 are able to deliver data to kernel circuit 234-1 via buffer 218.
- runtime 128 and/or driver 130 are capable of assigning that operation to p2 and binding p2 to queue 202-1.
- Host system 102 looks up the flowjd that is used to tag host bound data from kernel circuit 234-1.
- kernel circuit 234-1 is capable of tagging outbound data with the appropriate flowjd.
- buffer 220 includes circuitry that is capable of tagging the outbound data with the appropriate flowjd.
- Host system 102 configures DMA 1 10 with pipe p2 and associates pipe p2 with queue 202-2.
- Host system 102 further creates an entry correlating the flowjd for kernel circuit 234-1 (e.g., buffer 220) with queue 202-2 and pipe p2 for the data transfer.
- Stream traffic manager 212 is further capable of binding host-bound traffic tagged with the flowjd to pipe p2 when forwarding that data to DMA 1 10.
- DMA 1 10 is commanded to begin operation according to Example 1 above.
- FIG. 3 illustrates an example method 300 of transferring data between a host system and kernel circuits of a hardware accelerator using data streams.
- Method 300 can begin in a state where the host system stores one or more container files within memory.
- Each container file includes one or more configuration bitstreams and corresponding metadata.
- Each of the configuration bitstreams which may be partial configuration bitstreams, specifies one or more kernel circuits.
- the host system selects a container file.
- the container file includes a configuration bitstream and metadata for the configuration bitstream.
- the configuration bitstream may be a partial configuration bitstream.
- the host system selects the container file in response to the user application requesting hardware accelerated functionality implemented by kernel circuits specified by the configuration bitstream in the container file.
- the user application may specify the particular container file to be selected or retrieved from memory and
- the host system extracts the configuration bitstream from the container file.
- the host system loads the configuration bitstream into an IC, e.g., IC 104, of the hardware accelerator.
- IC e.g., IC 104
- the kernel circuitry specified by the configuration bitstream is physically implemented within the IC and available to perform tasks requested by the host system.
- the host system determines one or more pipe properties from the metadata. For example, the host system extracts metadata for the configuration bitstream from the selected container file.
- the metadata includes mapping information generated when the kernels were compiled.
- the mapping data includes one or more pipe properties that may be used to configure DMA 1 10 and stream traffic manager 212.
- the pipe properties may include settings, e.g., register settings, such as a routejd and/or a flowjd that may be loaded into the DMA 1 10 and/or the stream traffic manager to establish routes for exchanging data between the host system and the kernel circuit or circuits implemented by the configuration bitstream extracted from the selected container file.
- the metadata for the configuration bitstream includes additional information generated during the design flow that allows the stream traffic manager to operate more efficiently.
- the metadata can specify information, e.g., settings, that are specific to each kernel.
- the stream traffic manager is capable of adjusting how data is streamed to the kernel circuits and/or streamed from the kernel circuits to the host system on a per-kernel circuit basis.
- the metadata can specify the size of the kernel circuit's working data set (which corresponds to packet size), the compute time required for the kernel circuit per data set, the amount of prefetching desired for the kernel circuit, and the like.
- the stream traffic manager can adjust the amount of data retrieved for the kernels and the amount of prefetching in accordance with the metadata for that particular kernel circuit during operation.
- the host system is capable of sending the settings (e.g., pipe properties and/or other information as described) to the stream traffic manager and/or the DMA to configure the data path for streaming data between the implemented kernel circuit and the host system.
- the host system invokes a function or functions available in the driver and/or the runtime to configure the data path.
- the function for example, writes the settings to the control registers of the DMA and the maps of the stream traffic manager.
- the stream traffic manager may include additional control registers that may be written with the settings described herein.
- the host system implements a data transfer directly from the host system to a kernel circuit as a data stream using the settings. For example, the host system adds one or more descriptors to the write queue within the driver that corresponds to the input buffer of the target kernel circuit.
- the DMA is capable of retrieving one or more of the descriptors and providing the retrieved descriptors to the stream traffic manager.
- the stream traffic manager stores the descriptors temporarily within internal buffers.
- the stream traffic manager is capable of monitoring the state of the input buffer for the target kernel circuit and when space is available within the input buffer, execute one or more of the descriptors corresponding to the input buffer of the target kernel circuit using an available data mover engine contained therein.
- DMA 1 10 retrieves data from host memory in packetized form.
- Stream traffic manager streams the data to the input buffer of the target kernel circuit.
- the input buffer is capable of converting the packetized data into streamed data.
- the data that is transferred to the target kernel circuit includes one or more instructions embedded therein for the kernel circuit.
- the commands are said to be "in-band" with or relative to the data.
- the kernel circuits and/or the host system are capable of exchanging continuous data streams or optionally data streams interspersed with instructions (e.g., command or status information).
- the host system is capable of determining that the data transfer is to be implemented as a data stream based on a data type used by the user application requesting the data transfer and/or the particular API invoked by the user application.
- the host system implements a further data transfer from the kernel circuit directly to the host system as a data stream using the pipe properties. For example, the host system adds one or more descriptors to the read queue of the driver that corresponds to the output buffer of the target kernel circuit.
- the DMA is capable of retrieving one or more of the descriptors and providing the retrieved descriptors to the stream traffic manager.
- the stream traffic manager stores the descriptors temporarily within internal buffers.
- the stream traffic manager is capable of monitoring the state of the output buffer for the kernel circuit and when a data stream is available within the output buffer, execute one or more of the descriptors corresponding to the output buffer of the target kernel circuit using an available data mover engine contained therein.
- the data mover engine of the stream traffic manager retrieves packetized data from the output buffer of the target kernel circuit and provides the packetized data to the DMA.
- the output buffer converts the data stream to packetized data.
- the DMA provides the packetized data to the host memory over the communication bus.
- FIG. 4 illustrates an example architecture 400 for exchanging data between kernel circuits using data streams.
- Architecture 400 supports use cases where applications require a plurality of large and complex kernel circuits and additional ICs are used to augment the programmable circuitry provided by a primary IC.
- the primary IC is configured to support communication with the host system via an endpoint and a DMA.
- the primary IC also includes a stream traffic manager.
- the stream traffic manager is capable of routing packetized data for kernel circuits to one of several different ports, each connected to an independent interconnect. Partitioning kernel circuits to different interconnects allows the kernel circuits to be located in different physical regions of an IC, e.g., different dies in the case of a multi-die IC. Further, the different interconnects isolate kernel circuits of different regions from interfering with one another. This partitioning allows multi-die ICs to be used and also secondary ICs to be used.
- Architecture 400 includes IC 104 and an IC 402.
- ICs 104 and 402 are coupled to a same circuit board, e.g., a hardware accelerator, that may also include RAM (not shown).
- each of ICs 104 and 402 is implemented as a multi-die IC.
- IC 104 includes dies 404 and 406.
- IC 402 includes dies 408 and 410.
- Each of dies 404, 406, 408, and 410 is implemented to include programmable circuitry as described in greater detail herein in connection with FIG. 7.
- one or more of dies 404, 406, 408, and 410 includes one or more hardwired circuit blocks.
- each of dies 404, 406, 408, and 410 is implemented as a field programmable gate array (FPGA).
- FPGA field programmable gate array
- dies 404 and 406 are included within a same package, while dies 408 and 410 are included in a different package.
- IC 104 and IC 402 can be implemented using any of a variety of available multi-die technologies.
- dies 404 and 406 are mounted on an interposer that includes wires capable of conveying signals between dies 404 and 406.
- dies 408 and 410 are mounted on an interposer that includes wires capable of conveying signals between dies 408 and 410.
- the dies may be mounted using a plurality of solder bumps or another connection technology.
- the interposer includes a plurality of through vias that allow selected signals to pass external to the multi-die IC package to a substrate, for example.
- dies 404 and 408 are shaded to better illustrate the different circuit blocks included in each respective die.
- dies 404 and 408 include additional circuit blocks not included in dies 406 and 410, respectively.
- die 404 includes endpoint 108, DMA 1 10, stream traffic manager 212, and transceiver 442, whereas die 406 does not.
- one or more of endpoint 108, DMA 1 10, and/or transceiver 442 are implemented as hardwired circuit blocks.
- endpoint 108, DMA 1 10, and/or transceiver 442 are implemented in programmable circuitry. These circuit structures are not repeated within die 406.
- die 408 includes transceiver 444 and satellite stream traffic manager 412, whereas die 410 does not. These structures are not repeated in die 410.
- endpoint 108, DMA 1 10, and stream traffic manager 212 are implemented substantially as described in connection with FIGs. 1 and 2.
- stream traffic manager 212 includes additional I/O ports.
- stream traffic manager 212 includes additional I/O ports that connect to transceiver 442.
- one or more I/O ports of stream traffic manager 212 couple to die 406 and, in particular, to interconnect 416.
- interconnect 416 each represent an instance of interconnect 214 and an instance of interconnect 216.
- each of dies 404 and 406 includes an instance of interconnect 214 and interconnect 216.
- kernel circuits 234 and the corresponding buffers are spread across dies 404 and 406.
- IC 104 is capable of operating as a master in that die 404 includes endpoint 108 to communicate with host system 102.
- host system 102 is capable configuring DMA 1 10, stream traffic manager 212, and satellite stream traffic manager 412 to route packetized data.
- stream traffic manager 212 is capable of passing any necessary mapping data and/or settings on to satellite stream traffic manager 412.
- host system 102 is capable of offloading tasks to IC 104 and/or IC 402. Further, host system 102 is capable of directing tasks to one or more of kernel circuits 234 and/or one or more of kernel circuits 440.
- the data when data is exchanged between kernel circuits located in a same die, the data may flow from a sending kernel circuit to an interconnect and from the interconnect to the receiving kernel circuit bypassing, but under control of, stream traffic manager 212 and/or satellite stream traffic manager 412 as the case may be.
- the output buffer of the sending kernel circuit converts the data stream output from the sending kernel circuit into packetized data
- the input buffer of the receiving kernel circuit converts the packetized data into a data stream for consumption by the receiving kernel circuit.
- kernel circuits can be implemented to communicate directly with one another.
- the kernel circuits are created and implemented within programmable circuitry with this capability built in.
- Such connections are illustrated in FIG. 4 where kernel circuit 234-3 is capable of communicating directly with kernel 234-4 to provide data results thereto without using stream traffic manager 212.
- stream traffic manager 212 and/or satellite stream traffic manager 412 is needed.
- the streaming architecture described within this disclosure which uses in-band instructions within the data streams passed from kernel circuit to kernel circuit, allows one kernel circuit to pass data directly to another kernel circuit with the instruction included in the data stream thereby implementing the chained processing of data through multiple kernel circuits without involvement of host system 102.
- the streaming architecture reduces the overhead imposed on the host system and makes more efficient use of the hardware resources.
- an upstream kernel circuit e.g., a sending kernel circuit, performs compression while a downstream kernel circuit performs encryption.
- the upstream kernel circuit sends the resulting compressed data to the stream traffic manager circuitry, which routes the data, which has been packetized by the output buffer of the sending kernel circuit, to the downstream kernel circuit, e.g., the receiving kernel circuit.
- the input buffer of the receiving kernel circuit converts the packetized data into a data stream.
- the downstream kernel circuit may provide the resulting encrypted data back to the stream traffic manager circuitry, which may then route the encrypted data to yet another kernel circuit or provide the encrypted data to host system 102.
- I/O devices 620 include, but are not limited to, a keyboard, a display device, a pointing device, one or more communication ports, and a network adapter.
- a network adapter refers to circuitry that enables system 600 to become coupled to other systems, computer systems, remote printers, and/or remote storage devices through intervening private or public networks. Modems, cable modems, Ethernet cards, and wireless transceivers are examples of different types of network adapters that may be used with system 600.
- System 600 may include fewer components than shown or additional components not illustrated in FIG. 6 depending upon the particular type of device and/or system that is implemented.
- the particular operating system, application(s), and/or I/O devices included may vary based upon system type.
- one or more of the illustrative components may be incorporated into, or otherwise form a portion of, another component.
- a processor may include at least some memory.
- System 600 may be used to implement a single computer or a plurality of networked or interconnected computers each implemented using the architecture of FIG. 6 or an architecture similar thereto.
- the programmable interconnect circuitry typically includes a large number of interconnect lines of varying lengths interconnected by programmable interconnect points (PIPs).
- PIPs programmable interconnect points
- the programmable logic circuitry implements the logic of a user design using programmable elements that may include, for example, function generators, registers, arithmetic logic, and so forth.
- An IOB 704 may include, for example, two instances of an I/O logic element (IOL) 715 in addition to one instance of an INT 71 1.
- IOL I/O logic element
- the actual I/O pads connected to IOL 715 may not be confined to the area of IOL 715.
- PROC 710 may be implemented as dedicated circuitry, e.g., as a hardwired processor, that is fabricated as part of the die that implements the programmable circuitry of the IC.
- PROC 710 may represent any of a variety of different processor types and/or systems ranging in complexity from an individual processor, e.g., a single core capable of executing program code, to an entire processor system having one or more cores, modules, co- processors, interfaces, or the like.
- PROC 710 may be omitted from architecture 700 and replaced with one or more of the other varieties of the programmable blocks described. Further, such blocks may be utilized to form a "soft processor" in that the various blocks of programmable circuitry may be used to form a processor that can execute program code as is the case with PROC 710.
- programmable circuitry refers to programmable circuit elements within an IC, e.g., the various programmable or configurable circuit blocks or tiles described herein, as well as the interconnect circuitry that selectively couples the various circuit blocks, tiles, and/or elements according to configuration data that is loaded into the IC. For example, circuit blocks shown in FIG. 7 that are external to PROC 710 such as CLBs 702 are considered programmable circuitry of the IC.
- programmable circuitry In general, the functionality of programmable circuitry is not established until configuration data is loaded into the IC.
- a set of configuration bits may be used to program programmable circuitry of an IC such as an FPGA.
- the configuration bit(s) typically are referred to as a "configuration bitstream.”
- programmable circuitry is not operational or functional without first loading a configuration bitstream into the IC.
- the configuration bitstream effectively implements a particular circuit design within the programmable circuitry.
- the circuit design specifies, for example, functional aspects of the programmable circuit blocks and physical connectivity among the various programmable circuit blocks.
- These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
- These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the operations specified in the flowchart and/or block diagram block or blocks.
- the computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operations to be performed on the computer, other
Abstract
Description
Claims
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/186,055 US10725942B2 (en) | 2018-11-09 | 2018-11-09 | Streaming platform architecture for inter-kernel circuit communication for an integrated circuit |
US16/186,102 US10924430B2 (en) | 2018-11-09 | 2018-11-09 | Streaming platform flow and architecture for an integrated circuit |
PCT/US2019/059771 WO2020097013A1 (en) | 2018-11-09 | 2019-11-05 | Streaming platform flow and architecture |
Publications (1)
Publication Number | Publication Date |
---|---|
EP3877864A1 true EP3877864A1 (en) | 2021-09-15 |
Family
ID=69159961
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP19835920.0A Pending EP3877864A1 (en) | 2018-11-09 | 2019-11-05 | Streaming platform flow and architecture |
Country Status (5)
Country | Link |
---|---|
EP (1) | EP3877864A1 (en) |
JP (1) | JP2022506592A (en) |
KR (1) | KR20210088653A (en) |
CN (1) | CN112970010A (en) |
WO (1) | WO2020097013A1 (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10725942B2 (en) | 2018-11-09 | 2020-07-28 | Xilinx, Inc. | Streaming platform architecture for inter-kernel circuit communication for an integrated circuit |
US10990547B2 (en) | 2019-08-11 | 2021-04-27 | Xilinx, Inc. | Dynamically reconfigurable networking using a programmable integrated circuit |
US11232053B1 (en) | 2020-06-09 | 2022-01-25 | Xilinx, Inc. | Multi-host direct memory access system for integrated circuits |
US11539770B1 (en) | 2021-03-15 | 2022-12-27 | Xilinx, Inc. | Host-to-kernel streaming support for disparate platforms |
US11456951B1 (en) | 2021-04-08 | 2022-09-27 | Xilinx, Inc. | Flow table modification for network accelerators |
US11606317B1 (en) | 2021-04-14 | 2023-03-14 | Xilinx, Inc. | Table based multi-function virtualization |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4908017B2 (en) * | 2006-02-28 | 2012-04-04 | 富士通株式会社 | DMA data transfer apparatus and DMA data transfer method |
US20100030927A1 (en) * | 2008-07-29 | 2010-02-04 | Telefonaktiebolaget Lm Ericsson (Publ) | General purpose hardware acceleration via deirect memory access |
CN103714026B (en) * | 2014-01-14 | 2016-09-28 | 中国人民解放军国防科学技术大学 | A kind of memory access method supporting former address data exchange and device |
CN104503948B (en) * | 2015-01-19 | 2017-08-11 | 中国人民解放军国防科学技术大学 | The close coupling of multi-core network processing framework is supported adaptively to assist processing system |
CN104679689B (en) * | 2015-01-22 | 2017-12-12 | 中国人民解放军国防科学技术大学 | A kind of multinuclear DMA segment data transmission methods counted using slave for GPDSP |
CN104679691B (en) * | 2015-01-22 | 2017-12-12 | 中国人民解放军国防科学技术大学 | A kind of multinuclear DMA segment data transmission methods using host count for GPDSP |
CN105389277B (en) * | 2015-10-29 | 2018-04-13 | 中国人民解放军国防科学技术大学 | Towards the high-performance DMA components of scientific algorithm in GPDSP |
-
2019
- 2019-11-05 KR KR1020217017275A patent/KR20210088653A/en unknown
- 2019-11-05 EP EP19835920.0A patent/EP3877864A1/en active Pending
- 2019-11-05 WO PCT/US2019/059771 patent/WO2020097013A1/en active Search and Examination
- 2019-11-05 CN CN201980073849.2A patent/CN112970010A/en active Pending
- 2019-11-05 JP JP2021524028A patent/JP2022506592A/en active Pending
Also Published As
Publication number | Publication date |
---|---|
WO2020097013A1 (en) | 2020-05-14 |
JP2022506592A (en) | 2022-01-17 |
CN112970010A (en) | 2021-06-15 |
KR20210088653A (en) | 2021-07-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10924430B2 (en) | Streaming platform flow and architecture for an integrated circuit | |
US10725942B2 (en) | Streaming platform architecture for inter-kernel circuit communication for an integrated circuit | |
EP3877864A1 (en) | Streaming platform flow and architecture | |
US11677662B2 (en) | FPGA-efficient directional two-dimensional router | |
US10437764B2 (en) | Multi protocol communication switch apparatus | |
EP3400688B1 (en) | Massively parallel computer, accelerated computing clusters, and two dimensional router and interconnection network for field programmable gate arrays, and applications | |
CN112740190A (en) | Host proxy on gateway | |
US9934175B2 (en) | Direct memory access for programmable logic device configuration | |
KR102654610B1 (en) | Multistage boot image loading and configuration of programmable logic devices | |
US11726928B2 (en) | Network interface device with bus segment width matching | |
CN112639738A (en) | Data passing through gateway | |
WO2022212056A1 (en) | Transporting request types with different latencies | |
US11496418B1 (en) | Packet-based and time-multiplexed network-on-chip | |
US11789790B2 (en) | Mechanism to trigger early termination of cooperating processes | |
CN112673351A (en) | Streaming engine | |
CN117581200A (en) | Loading data from memory during dispatch | |
US20230224261A1 (en) | Network interface device | |
Bajpai et al. | FPGA cluster based high performance cryptanalysis framework | |
CN117632256A (en) | Apparatus and method for handling breakpoints in a multi-element processor | |
Salapura et al. | A Multiprocessor System-on-a-Chip Design Methodology for Networking Applications |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: UNKNOWN |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE |
|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
17P | Request for examination filed |
Effective date: 20210520 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
DAV | Request for validation of the european patent (deleted) | ||
DAX | Request for extension of the european patent (deleted) | ||
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: EXAMINATION IS IN PROGRESS |