WO2022133047A1 - Délestage de fonction de flux de données vers des processeurs reconfigurables - Google Patents

Délestage de fonction de flux de données vers des processeurs reconfigurables Download PDF

Info

Publication number
WO2022133047A1
WO2022133047A1 PCT/US2021/063733 US2021063733W WO2022133047A1 WO 2022133047 A1 WO2022133047 A1 WO 2022133047A1 US 2021063733 W US2021063733 W US 2021063733W WO 2022133047 A1 WO2022133047 A1 WO 2022133047A1
Authority
WO
WIPO (PCT)
Prior art keywords
reconfigurable
data
buffers
dataflow
reconfigurable unit
Prior art date
Application number
PCT/US2021/063733
Other languages
English (en)
Inventor
Martin Russell Raumann
Qi ZHENG
Bandish B. Shah
Ravinder Kumar
Kin Hing LEUNG
Sumti Jairath
Gregory Frederick Grohoski
Original Assignee
SambaNova Systems, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US17/127,929 external-priority patent/US11182221B1/en
Priority claimed from US17/379,924 external-priority patent/US11237880B1/en
Priority claimed from US17/379,921 external-priority patent/US11392740B2/en
Application filed by SambaNova Systems, Inc. filed Critical SambaNova Systems, Inc.
Publication of WO2022133047A1 publication Critical patent/WO2022133047A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7867Architectures of general purpose stored program computers comprising a single central processing unit with reconfigurable architecture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/82Architectures of general purpose stored program computers data or demand driven
    • G06F15/825Dataflow computers

Definitions

  • Prabhakar et al. “Plasticine: A Reconfigurable Architecture for Parallel Patterns,” ISCA ’ 17, June 24-28, 2017, Toronto, ON, Canada;
  • the technology disclosed relates to throughput improvements in machine learning applications that use processors like Central Processing Units (CPUs), Graphics Processing Units (GPUs), Field Programmable Gate Arrays (FPGAs), Coarse-Grained Reconfigurable Architectures (CGRAs), Application-Specific Integrated Circuits (ASICs), Application Specific Instruction-set Processors (ASIPs), and Digital Signal Processors (DSPs).
  • processors like Central Processing Units (CPUs), Graphics Processing Units (GPUs), Field Programmable Gate Arrays (FPGAs), Coarse-Grained Reconfigurable Architectures (CGRAs), Application-Specific Integrated Circuits (ASICs), Application Specific Instruction-set Processors (ASIPs), and Digital Signal Processors (DSPs).
  • CPUs Central Processing Units
  • GPUs Graphics Processing Units
  • FPGAs Field Programmable Gate Arrays
  • CGRAs Coarse-Grained Reconfigurable Architectures
  • ASICs Application-Specific Integrated
  • Conventional stored-program computer architectures are based on an instructionbased control flow paradigm, in which a primitive set of operations are performed sequentially on data stored in some storage device in response to software instructions being sequentially provided to a central processor. A program counter controls when each instruction is to execute.
  • Dataflow architectures by contrast, are based on the idea of disconnected computational actors organized into stages that can be pipelined. Data is moved down a pipeline from one processing stage to the next, rather than data being stored in a memory and processing instructions being brought to a central processor to perform each subsequent step.
  • control-flow instructions are centrally controlled
  • dataflow stages execute primarily in response to the availability of all the required operands.
  • each processing element has some way of knowing when all the operands are available before it can execute (or complete the execution of) the function of that stage.
  • Many kinds of algorithms can be implemented with dataflow processing, such as certain aspects of natural-language processing, recommendation engines, database analytics, scientific applications, SQL data processing and deep learning. The present application focuses on deep learning algorithms as an example, but the concepts discussed herein apply just as well to other types of problems.
  • Deep learning is a subset of machine learning algorithms that are inspired by the structure and function of the human brain.
  • Most deep learning algorithms involve artificial neural network architectures, in which multiple layers of neurons each receive input from neurons in a prior layer or layers, and in turn influence the neurons in the subsequent layer or layers. Training these neural network models can be computationally extremely demanding.
  • significant advances have been made by offloading many common computations to specialized GPU co-processor hardware, but even so, network training still can take an impractically long time on a single machine.
  • the computations involved in network training often include lengthy sequences that are highly repetitive, and that do not depend on the internal results from other instances of the sequence.
  • Such computations often can be parallelized by running different instances of the sequence on different machines.
  • the algorithms still require partial results to be shared periodically among the instances, so periodic sync-ups are still required as the algorithm proceeds.
  • Mechanisms for parallelizing neural network training can be divided roughly into two groups: model parallelism and data parallelism.
  • model parallelism the network model is divided up and parts of it are allocated to different machines.
  • the model is divided longitudinally, such that upstream portions of the model are executed by one machine, which passes its results to another machine that executes downstream portions of the model. In the meantime, the upstream machine can begin processing the next batch of training data through the upstream portions of the model.
  • the model may include branches which are later merged downstream. In such versions the different branches could be processed on different machines.
  • GPUs are now available which can share updates with GPUs in the same or different machines by way of specialized high bandwidth communication links directly interconnecting the GPUs, often separate from the channel that connects the GPU to its local CPU. But the sharing process is still orchestrated centrally during runtime. For example, in GPU implementations, when each processing node has completed its work prior to a synchronization step, it so notifies the host via a control communication of some kind. The host typically awaits receipt of such notifications from all the required processing nodes. Then the host sends instructions back to all the processing nodes to go ahead and exchange their partial results data with each other.
  • a further control signal round-trip occurs in case the aggregation of the partial results data from all the processing nodes is itself to be distributed among the processing nodes.
  • centralized control might be required to orchestrate the distribution of partial gradients among the workers, the combining of the partial gradients by one or more workers, and/or the distribution of the final gradients back to all the workers.
  • the control overhead arising from such centralized control can stifle scaling of the system beyond just a few nodes. Very few architectures with centralized synchronization scale adequately beyond single digit numbers of processing nodes.
  • CPUs and GPUs operate on a stream of instructions, where instructions perform stateful operations.
  • Instruction processors of this type are programmed using instructions, encoded into bits.
  • a task is specified as an ordered list of processor instructions in software.
  • These units have hardware architectures with mechanisms to track "program state".
  • Program state would include, among other things, a form of a global "program counter" register, to track the next instruction to be fetched from memory.
  • the hardware of such instruction processors would also have a pipeline to decode and execute these instructions that have been fetched (an instruction pipeline).
  • these architectures contain a pipeline through which a stream of instructions flows during execution, where each instruction performs operations and updates the hardware state.
  • a GPU can consist of an array of distributed computational units in cores, which generally rely on a shared pool of memory.
  • the distributed computational units are stored- program processors which are programmable by writing instructions that are fetched, decoded, and executed like a normal processor. Synchronization and communication are achieved by executing sequences of instructions that operate on the shared memory.
  • Reconfigurable processors including Field Programmable Gate Arrays (FPGAs)
  • FPGAs Field Programmable Gate Arrays
  • So-called Coarse- Grained Reconfigurable Architectures (CGRAs) are being developed in which the configurable units in the array are more complex than used in typical, more fine-grained FPGAs, and may enable faster or more efficient execution of various classes of functions.
  • CGRAs have been proposed that can enable implementation of energy-efficient accelerators for machine learning and artificial intelligence workloads. See, Prabhakar, et al., “Plasticine: A Reconfigurable Architecture for Parallel Patterns,” ISCA ’ 17, June 24-28, 2017, Toronto, ON, Canada.
  • Various aspects of some such CGRAs are described in the above-incorporated patent applications.
  • a CGRA as described in the above patent applications includes an array of reconfigurable units, sometimes referred to herein as Coarse Grained Reconfigurable Units (CGRUs).
  • the units can comprise somewhat specialized computational and memory units. These units are connected by a fabric to enable inter-unit communication and synchronization.
  • the components may be reconfigured in several ways, but often rely on direct hardware reconfiguration by altering their behavior under control of configuration bits loaded from a bit file into registers prior to runtime. No instructions are fetched, decoded, or executed during runtime; instead, state machines are configured by the bit file contents to implement sequences of operations.
  • Units of such CGRAs operate on streams of data and control messages (as opposed to instructions) that flow through a sea of configurable units, where the configurable units are programmed using configuration data, such as a bit file.
  • Configurble units have architectures that look and operate differently than stored program instructionbased processors, as they have to manage execution in different ways.
  • Arrays of configurable units as in CGRAs and FPGAs have a different programming contract: configuration bits. These architectures do not have the hardware to fetch and process instructions, they do not have a global "program counter" register in the sense of instruction processors, and they do not have a pipeline that is built to fetch, decode, and execute an instruction stream.
  • configurable execution units and stateful elements are physically distributed on chip, and connected together using a programmable interconnect.
  • a program is not a stream of instructions; instead, configuration bits program the configurable execution units to construct a custom control and data path for an application.
  • the configurable units are programmed to operate on streams of data and control messages, to produce other data and control messages. This makes such architectures inherently distributed, without a single global program state.
  • CGRAs are optimized for performing certain kinds of complex computations that are common in specific types of applications such as deep learning. This can limit their effectiveness when performing more general or simpler kinds of computations. Also, since CGRAs are optimized for processing flowing data, the orchestration of inter-node communications and data syncing can be a sub-optimal use of their power. It is desirable, therefore, to find a way to perform such functions without impacting the ability of the CGRA units to continue performing functions more in line with the capabilities of their powerful hardware components. It is particularly desirable to avoid or minimize any requirement for a separate host to perform such orchestration of inter-node communications and data syncing, in order to avoid the control overhead that tends to stifle scaling.
  • parallelizing deep learning applications requires periodic sharing of intermediate results among the various nodes operating in parallel.
  • intermediate results can include both partially aggregated gradients being shared with those of other worker nodes in order to enable calculation of the fully aggregated gradients, and fully aggregated gradients or updated neural network parameters being returned to the worker nodes.
  • partially aggregated gradients can be passed among the GPUs over a direct, high bandwidth link, but the GPUs themselves are then occupied in calculating the full aggregations and the updated parameters.
  • each GPU calculates the partial aggregations for one segment of the parameters, and these intermediate results are shared among the GPUs on the direct interconnect while each GPU works on calculating the partial aggregations for the next segment of parameters.
  • each GPU may receive the partial aggregations from other GPUs early enough, it cannot begin work on its full aggregation work yet because it is still working on calculating the partial aggregations for the next segment of parameters. It would be desirable to improve scaling of data parallelization in neural network training applications, so that larger and more powerful neural networks can be trained effectively.
  • the invention involves a system including a plurality of functional units that execute different segments of a dataflow, and share intermediate results via a peer- to-peer messaging protocol.
  • the functional units are reconfigurable, with different units being reconfigurable at different levels of granularity.
  • the peer-to-peer messaging protocol includes control tokens or other mechanisms by which the consumer of the intermediate results learns that data have been transferred, and in response thereto triggers its next dataflow segment.
  • a host or configuration controller configures the data units with their respective dataflow segments, but once execution of the configured dataflow begins, no host need be involved in orchestrating data synchronization, the transfer of intermediate results, or the triggering of processing after the data are received. Control overhead is therefore minimized.
  • Figure 1 shows an architectural level schematic of a data center in accordance with an implementation.
  • Figure 2A shows host sender buffers and host receiver buffers located in a host memory of a first host processor of a first processing node in the data center of Figure 1.
  • Figure 2B shows host sender buffers and host receiver buffers located in a host memory of a second host processor of a second processing node in the data center of Figure 1.
  • Figure 3 A shows interface sender buffers and interface receiver buffers located at a first Network Interface Controller operatively coupled to the first processing node.
  • Figure 3B shows interface sender buffers and interface receiver buffers located at a second Network Interface Controller operatively coupled to the second processing node.
  • Figure 4A shows reconfigurable processor (RP) sender buffers and reconfigurable processor receiver buffers located in a processor memory of a first reconfigurable processor operatively coupled to the first processing node.
  • RP reconfigurable processor
  • Figure 4B shows reconfigurable processor sender buffers and reconfigurable processor receiver buffers located in a processor memory of a second reconfigurable processor operatively coupled to the second processing node.
  • Figure 5A is a heuristics diagram of a runtime logic running at the first host processor.
  • Figure 5B is a heuristics diagram of a runtime logic running at the second host processor.
  • Figure 6 is a message sequence chart illustrating one implementation of a debugging logic running at the first host processor and detecting errors in execution of configuration files on one or more of the reconfigurable processors operatively coupled to the first processing node.
  • Figure 7 is a message sequence chart illustrating one implementation of the debugging logic of Figure 6 detecting errors in the execution of configuration files on one or more of the reconfigurable processors operatively coupled to the second processing node.
  • Figure 8 is a message sequence chart illustrating one implementation of one or more of the reconfigurable processors operatively coupled to the first processing node issuing remote procedure calls to the first host processor.
  • Figure 9 is a message sequence chart illustrating one implementation of one or more of the reconfigurable processors operatively coupled to the second processing node issuing remote procedure calls to the first host processor.
  • Figure 10 is a message sequence chart illustrating one implementation of a testing logic running at the first host processor and determining and reporting test statistics for execution of test configuration files on one or more of the reconfigurable processors operatively coupled to the first processing node.
  • Figure 11 is a message sequence chart illustrating one implementation of the testing logic of Figure 10 determining and reporting test statistics for execution of test configuration files on one or more of the reconfigurable processors operatively coupled to the second processing node.
  • Figure 12 is a message sequence chart illustrating one implementation of executing a first set of functions in configuration files on one or more of the reconfigurable processors operatively coupled to the first processing node and executing a second set of functions in the configuration files on the first host processor.
  • Figure 13 is a message sequence chart illustrating one implementation of executing a first set of functions in configuration files on one or more of the reconfigurable processors operatively coupled to the first processing node and executing a second set of functions in the configuration files on the second host processor.
  • Figure 14A shows sender and receiver buffers used by individual reconfigurable processors in the reconfigurable processors operatively coupled to the first processing node for data streaming.
  • Figure 14B shows sender and receiver buffers used by individual reconfigurable processors in the reconfigurable processors operatively coupled to the second processing node for data streaming.
  • Figure 15 is a message sequence chart illustrating one implementation of executing a first set of functions in configuration files on a first reconfigurable processor operatively coupled to the first processing node and executing a second set of functions in the configuration files on a second reconfigurable processor operatively coupled to the first processing node.
  • Figure 16 is a message sequence chart illustrating one implementation of executing a first set of functions in configuration files on a first reconfigurable processor operatively coupled to the first processing node and executing a second set of functions in the configuration files on a first reconfigurable processor operatively coupled to the second processing node.
  • Figure 17A is a message sequence chart illustrating one implementation of asynchronous tensor streaming in which a next tensor is buffered while a reconfigurable processor is processing a current tensor.
  • Figure 17B is a message sequence chart illustrating one implementation of asynchronous tensor streaming in which a next tensor is buffered before a reconfigurable processor processes a current tensor.
  • Figure 17C is a message sequence chart illustrating one implementation of asynchronous tensor streaming in which a next tensor is buffered after a reconfigurable processor has processed a current tensor.
  • Figure 18 is a message sequence chart illustrating one implementation of executing configuration files on reconfigurable processors that are on different processing nodes in the data center.
  • Figure 19 shows one implementation of memory mapping and allocating virtual buffers to physical buffers located in memories of different network components in the data center.
  • Figure 20 shows an architectural level schematic of one implementation of the data center in which the processing nodes of the data center do not include host processors.
  • Figure 21 is a message sequence chart illustrating one implementation of buffer-based inter-node streaming of configuration data over the network fabric.
  • Figure 22 is a message sequence chart illustrating another implementation of bufferbased inter-node streaming of configuration data over the network fabric.
  • Figure 23 illustrates one implementation of executing a model/application in parallel using the disclosed buffer-based inter-node streaming of configuration data over the network fabric 136. This is referred to herein as “model parallelism.”
  • Figure 24 illustrates one implementation of executing multiple instances of a model/application in parallel using the disclosed buffer-based inter-node streaming of configuration data over the network fabric 136. This is referred to herein as “data parallelism.”
  • Figure 25 illustrates one implementation of executing configuration files on heterogeneous reconfigurable processors.
  • Figure 26 illustrates one implementation of executing configuration files using NIC or SmartNIC devices that are embedded on the reconfigurable processors.
  • Figure 27 is a system diagram illustrating a system including a host, a memory, and an example reconfigurable data processor on which the technology disclosed can be applied.
  • Figure 28 is a simplified block diagram of a top-level network and components of a CGRA (Coarse-Grained Reconfigurable Architecture).
  • CGRA Coarse-Grained Reconfigurable Architecture
  • Figure 29 is a simplified diagram of a tile and an array level network usable in the configuration of Figure 27, where the configurable units are nodes on the array level network and are configurable to implement a Look-Up Table with input offsetting.
  • Figure 29B illustrates an example switch unit connecting elements in an array level network.
  • FIG 30 is a block diagram illustrating an example configurable unit, such as a Pattern Compute Unit (PCU).
  • PCU Pattern Compute Unit
  • FIG 31 is a block diagram illustrating an example configurable unit, such as a Pattern Memory Unit (PMU).
  • PMU Pattern Memory Unit
  • Figure 32 illustrates an example processing node which includes a host, eight reconfigurable processors, and a SmartNIC.
  • FIG 33 is a block diagram of a SmartNIC that can be used in the processing nodes of Figure 32 or Figure 48.
  • Figure 34 depicts conceptual examples of specific common parallel patterns of Map, FlatMap, Fold and HashReduce.
  • Figure 35 illustrates a section from an example processing graph.
  • Figure 36 illustrates an overall structure of an example compiler that can be used to generate dataflow graphs from high-level programs of applications.
  • Figure 37 illustrates one example organization of a configuration file.
  • Figure 38 illustrates a simple deep learning application implemented with data parallelism across multiple reconfigurable processors in a compute node.
  • Figure 39 illustrates the temporal progress resulting from a conventional implementation of data parallelism.
  • Figures 40 and 41 illustrate the temporal progress resulting from improved implementations using an FPGA on the SmartNIC.
  • Figure 42 illustrates an example data center incorporating multiple processing nodes.
  • Figure 43 illustrates a stochastic gradient descent deep learning application implemented with data parallelism across multiple processing nodes.
  • Figure 44 illustrates a dataflow graph fragment that is configured into each of the reconfigurable processors in Figure 42.
  • Figure 45 illustrates a dataflow graph fragment that is configured into the SmartNICs of each of the processing nodes in the system of Figure 42.
  • Figures 46A and 46B (collectively Figure 46) is a detail of a uni-directional ring allreduce collective for each of the steps 4514, 4518 and 4522 in Figure 45.
  • Figures 47A, 47B, 47C, 47D, 47E, 47F, 47G, and 47H illustrate the uni-directional ring all-reduce algorithm of Figure 46.
  • Figure 48 illustrates another example processing node which includes a host, eight reconfigurable processors, and a SmartNIC for each of the reconfigurable processors.
  • FIG. 1 shows an architectural level schematic of a data center 100 in accordance with an implementation. Because Figure 1 is an architectural diagram, certain details of the data center 100 are intentionally omitted to improve the clarity of the description. It may be noted that data center 100 can include the same, more, or fewer elements configured in the same or different manner in other implementations.
  • the discussion of Figure 1 will be organized as follows. First, the elements of the figure will be described, followed by their interconnections. Then, the use of the elements in the system will be described in greater detail.
  • Figure 1 shows first and second processing nodes in the data center 100.
  • the first processing node is identified as “processing node 1,” and the second processing node is identified as “processing node n.”
  • the first and second processing nodes are configured to collaboratively execute configuration files for applications in a distributed fashion.
  • the data center 100 can have any number of processing nodes operatively coupled for data communications through a network 136 (also called herein “network fabric 136”).
  • network 136 also called herein “network fabric 136”.
  • Examples of the network 136 include a Storage Area Network (SAN) and a Local Area Network (LAN).
  • SAN Storage Area Network
  • LAN Local Area Network
  • the SAN can be implemented with a variety of data communications fabrics, devices, and protocols.
  • the fabrics for the SAN can include Fibre Channel, Ethernet, InfiniBand, Serial Attached Small Computer System Interface (‘SAS’), or the like.
  • Data communication protocols for use with the SAN can include Advanced Technology Attachment (‘ATA’), Fibre Channel Protocol, Small Computer System Interface (‘SCSI’), Internet Small Computer System Interface (‘iSCSI’), HyperSCSI, Non-Volatile Memory Express (‘NVMe’) over Fabrics, or the like.
  • ATA Advanced Technology Attachment
  • SCSI Small Computer System Interface
  • iSCSI Internet Small Computer System Interface
  • NVMe Non-Volatile Memory Express
  • the LAN can also be implemented with a variety of fabrics, devices, and protocols.
  • the fabrics for the LAN can include Ethernet (802.3), wireless (802.11), or the like.
  • Data communication protocols for use in the LAN can include Transmission Control Protocol (‘TCP’), User Datagram Protocol (‘UDP’), Internet Protocol (IP), Hypertext Transfer Protocol (‘HTTP’), Wireless Access Protocol (‘WAP’), Handheld Device Transport Protocol (‘HDTP’), Session Initiation Protocol (‘SIP’), Real-time Transport Protocol (‘RTP’), or the like.
  • TCP Transmission Control Protocol
  • UDP User Datagram Protocol
  • IP Internet Protocol
  • HTTP Hypertext Transfer Protocol
  • WAP Wireless Access Protocol
  • HDTP Handheld Device Transport Protocol
  • SIP Session Initiation Protocol
  • RTP Real-time Transport Protocol
  • the network 136 also connects other network components in the data center 100.
  • Examples of other network components include buses, switches, routers, load balancers, hypervisors, and Application Programming Interfaces (APIs).
  • the switches for example, can receive packets via a plurality of input ports and can transmit packets via a plurality of output ports.
  • the processing nodes in the data center 100 can communicate with each other through the network 136 using a variety of networking paths established by the switches.
  • Another example of the network 136 is a Wide Area Network (WAN).
  • WAN Wide Area Network
  • a processing node is an addressable application running on a hardware device or virtual device that attaches to a network, and is capable of sending, receiving, or forwarding information over a communications channel to or from other processing nodes.
  • Examples of electronic devices which can be deployed as hardware processing nodes include all varieties of computers, workstations, laptop computers, handheld computers, and smartphones.
  • Processing nodes can be implemented in a cloud-based server system. More than one virtual device configured as a processing node can be implemented using a single physical device.
  • the data center 100 comprises a pool of reconfigurable dataflow resources.
  • the pool of reconfigurable dataflow resources can have a variety of compute scales and hierarchies.
  • the pool of reconfigurable dataflow resources can be a single processing node operatively coupled to a plurality of reconfigurable processors, which in turn is supported by different bus and memory resources.
  • the processing node can have a host processor (e.g., a CPU) that exchanges data with the reconfigurable processors, for example, over a local bus like Peripheral Component Interconnect Express (PCIe) interface or another interconnect fabric.
  • PCIe Peripheral Component Interconnect Express
  • the host processor can have a runtime processor (or a runtime logic) that manages resource allocation, memory mapping, and execution of configuration files for applications requesting execution from the host processor.
  • PCIe is described in formal PCI Express specifications available from PCI-SIG Administration, Beaverton, OR, all of which are incorporated herein by reference to the extent they are available at the filing date of this patent application.
  • PCIe bus and “PCIe fabric” refer to a bus or fabric that satisfies the requirements of Revision 1.0 of the PCI Express specification or any subsequent revision thereof.
  • PCIe is described also for example in Jackson and Budruk, PCI Express Technology 3.0, available from MindShare, Inc., Cedar Park, TX, also incorporated by reference herein.
  • PCIe bus and “PCIe fabric” are used interchangeably herein.
  • the pool of reconfigurable dataflow resources can be a rack (or cluster) of processing nodes connected through the network 136.
  • Each processing node in the rack can run a respective plurality of reconfigurable processors and include a respective host processor configured with a respective runtime processor.
  • the runtime processors distributed across the processing nodes, communicate with each other to provide unified access to reconfigurable processors attached not only to their own processing node but also to reconfigurable processors attached to every other processing node in the data center 100.
  • the pool of reconfigurable dataflow resources can be a pod that comprises a plurality of racks connected through the network 136.
  • the pool of reconfigurable dataflow resources can be a superpod that comprises a plurality of pods connected through the network 136.
  • the pool of reconfigurable dataflow resources can be a zone that comprises a plurality of superpods connected through the network 136.
  • the pool of reconfigurable dataflow resources can be the data center 100 that comprises a plurality of zones connected through the network 136.
  • the pool of reconfigurable dataflow resources can include bus (or transfer) resources.
  • bus resources include PCIe channels, Direct Memory Access (DMA) channels, and Double Data Rate (DDR) channels.
  • the pool of reconfigurable dataflow resources can include memory (or storage) resources. Examples of the memory resources include main memory (e.g., off-chip/extemal Dynamic Random Access Memory (DRAM), NAND flash), local secondary storage (e.g., local disks (e.g., HDD, SSD)), and remote secondary storage (e.g., distributed file systems, web servers). Other examples of the memory resources include latches, registers, flops, bypass networks, and caches (e.g., ones explicitly addressed by RAMs/DRAMs/SRAMs).
  • the pool of reconfigurable dataflow resources is dynamically scalable to meet the performance requirements of applications requesting execution. The applications access the pool of reconfigurable dataflow resources over one or more networks (e.g., the Internet).
  • the first processing node comprises a first host processor 102a. Examples of the first host processor 102a include x86 and x64 processors.
  • the first host processor 102a interfaces with a host memory 134a (e.g., RAM).
  • the first host processor 102a has a compiler 112a to compile applications and a runtime logic 122a to execute the compiled applications on a plurality of reconfigurable processors 142a.
  • the runtime logic 122a is configured to provide on-demand access to the pool of reconfigurable dataflow resources, which can be rapidly provisioned and released with minimal management effort or service provider interaction.
  • reconfigurable processors 142a examples include Field Programmable Gate Arrays (FPGAs), Coarse-Grained Reconfigurable Architectures (CGRAs), Application- Specific Integrated Circuits (ASICs), and Application Specific Instruction-set Processor (ASIP).
  • the reconfigurable processors 142a interface with a reconfigurable processor memory 162a (e.g., DRAM).
  • Each of the reconfigurable processors 142a includes an array of configurable units (e.g., compute units and memory units) in a programmable interconnect fabric.
  • the array of configurable units in a reconfigurable processor is partitionable into a plurality of subarrays (or tiles) of configurable units.
  • the processing nodes in the data center 100 include processors instead of/in addition to the reconfigurable processors 142a.
  • processors include Graphics Processing Units (GPUs) and Digital Signal Processors (DSPs).
  • a Network Interface Controller 132a (e.g., NIC, SmartNIC) connects the first host processor 102a and the reconfigurable processors 142a to the network 136.
  • a bus switch 124a uses local buses 125a, 126a, and 127a to operatively couple the first host processor 102a, the reconfigurable processors 142a, and the Network Interface Controller 132a. Examples of the local buses 125a, 126a, and 127a include Peripheral Component Interconnect Express (PCIe), Cache Coherent Interconnect for Accelerators (CCIX), Compute Express Link (CXL), and Open Coherent Accelerator Processor Interface (OpenCAPI).
  • PCIe Peripheral Component Interconnect Express
  • CCIX Cache Coherent Interconnect for Accelerators
  • CXL Compute Express Link
  • OpenCAPI Open Coherent Accelerator Processor Interface
  • the second processing node comprises a second host processor 102n.
  • Examples of the second host processor 102n include x86 and x64 processors.
  • the second host processor 102n interfaces with a host memory 134n (e.g., RAM).
  • the second host processor 102n has a compiler 112n to compile applications and a runtime logic 122n to execute the compiled applications on a plurality of reconfigurable processors 142n.
  • the runtime logic 122n is configured to provide on-demand access to the pool of reconfigurable dataflow resources, which can be rapidly provisioned and released with minimal management effort or service provider interaction.
  • reconfigurable processors 142n examples include Field Programmable Gate Arrays (FPGAs), Coarse-Grained Reconfigurable Architectures (CGRAs), Application- Specific Integrated Circuits (ASICs), and Application Specific Instruction-set Processor (ASIP).
  • the reconfigurable processors 142n interface with a reconfigurable processor memory 162n (e.g., DRAM).
  • Each of the reconfigurable processors 142n includes an array of configurable units (e.g., compute units and memory units) in a programmable interconnect fabric.
  • the array of configurable units in a reconfigurable processor is partitionable into a plurality of subarrays (or tiles) of configurable units.
  • the processing nodes in the data center 100 include processors instead of/in addition to the reconfigurable processors 142n.
  • processors include Graphics Processing Units (GPUs) and Digital Signal Processors (DSPs).
  • a Network Interface Controller 132n (e.g., NIC, SmartNIC) connects the second host processor 102n and the reconfigurable processors 142n to the network 136.
  • a bus switch 124n uses local buses 125n, 126n, and 127n to operatively couple the second host processor 102n, the reconfigurable processors 142n, and the Network Interface Controller 132n. Examples of the local buses 125n, 126n, and 127n include Peripheral Component Interconnect Express (PCIe), Cache Coherent Interconnect for Accelerators (CCIX), Compute Express Link (CXL), and Open Coherent Accelerator Processor Interface (OpenCAPI).
  • PCIe Peripheral Component Interconnect Express
  • CCIX Cache Coherent Interconnect for Accelerators
  • CXL Compute Express Link
  • OpenCAPI Open Coherent Accelerator Processor Interface
  • FIG. 2A shows host sender buffers 212a and host receiver buffers 202a located in the host memory 134a.
  • the host sender buffers 212a are reconfigurable processors-to-host processor buffers that are configured to receive data from the reconfigurable processors 142a and provide the data to the first host processor 102a.
  • the host receiver buffers 202a are host processor-to-reconfigurable processors buffers that are configured to receive data from the first host processor 102a and provide the data to the reconfigurable processors 142a. Examples of the data include scalar data (e.g., control bits) and vector data (e.g., vectors, tensors, arguments, commands).
  • the host memory 134a and therefore the host sender buffers 212a and the host receiver buffers 202a, are accessible to each of the host processors (e.g., first and second host processors 102a, 102n), each of the reconfigurable processors (e.g., reconfigurable processors 142a, 142n), and each of the Network Interface Controllers (e.g., Network Interface Controllers 132a, 132n) in the data center 100.
  • the host processors e.g., first and second host processors 102a, 102n
  • each of the reconfigurable processors e.g., reconfigurable processors 142a, 142n
  • the Network Interface Controllers e.g., Network Interface Controllers 132a, 132n
  • the host sender buffers 212a and the host receiver buffers 202a can be First-In, First-Out (FIFO) buffers, First-In, Last-Out (FILO) buffers, Last-In, First-Out (LIFO) buffers, Last-In, Last-Out (LILO) buffers, or circular buffers.
  • the host sender buffers 212a and the host receiver buffers 202a can be of size 8 bytes, 16 bytes, 32 bytes, 64 bytes, 128 bytes, 256 bytes, and so on, or any convenient size appropriate for the transfer of data between the host processor, the network interface controllers, and the reconfigurable processors.
  • FIG. 2B shows host sender buffers 212n and host receiver buffers 202n located in the host memory 134n.
  • the host sender buffers 212n are reconfigurable processors-to-host processor buffers that are configured to receive data from the reconfigurable processors 142n and provide the data to the second host processor 102n.
  • the host receiver buffers 202n are host processor-to-reconfigurable processors buffers that are configured to receive data from the second host processor 102n and provide the data to the reconfigurable processors 142n. Examples of the data include scalar data (e.g., control bits) and vector data (e.g., vectors, tensors, arguments, commands).
  • the host memory 134n and therefore the host sender buffers 212n and the host receiver buffers 202n, are accessible to each of the host processors (e.g., first and second host processors 102a, 102n), each of the reconfigurable processors (e.g., reconfigurable processors 142a, 142n), and each of the Network Interface Controllers (e.g., Network Interface Controllers 132a, 132n) in the data center 100.
  • the host processors e.g., first and second host processors 102a, 102n
  • each of the reconfigurable processors e.g., reconfigurable processors 142a, 142n
  • the Network Interface Controllers e.g., Network Interface Controllers 132a, 132n
  • the host sender buffers 212n and the host receiver buffers 202n can be First-In, First-Out (FIFO) buffers, First-In, Last-Out (FILO) buffers, Last-In, First-Out (LIFO) buffers, Last-In, Last-Out (LILO) buffers, or circular buffers.
  • the host sender buffers 212n and the host receiver buffers 202n can be of size 8 bytes, 16 bytes, 32 bytes, 64 bytes, 128 bytes, 256 bytes, and so on, or any convenient size appropriate for the transfer of data between the host processor, the network interface controllers, and the reconfigurable processors.
  • Figure 3 A shows interface sender buffers 312a and interface receiver buffers 302a located at the Network Interface Controller 132a.
  • the interface sender buffers 312a are reconfigurable processors-to-host processor buffers that are configured to receive data from the reconfigurable processors 142a and provide the data to the first host processor 102a.
  • the interface receiver buffers 302a are host processor-to-reconfigurable processors buffers that are configured to receive data from the first host processor 102a and provide the data to the reconfigurable processors 142a. Examples of the data include scalar data (e.g., control bits) and vector data (e.g., vectors, tensors, arguments, commands).
  • the Network Interface Controller 132a and therefore the interface sender buffers 312a and the interface receiver buffers 302a, are accessible to each of the host processors (e.g., first and second host processors 102a, 102n), each of the reconfigurable processors (e.g., reconfigurable processors 142a, 142n), and each of the Network Interface Controllers (e.g., Network Interface Controllers 132a, 132n) in the data center 100.
  • the host processors e.g., first and second host processors 102a, 102n
  • each of the reconfigurable processors e.g., reconfigurable processors 142a, 142n
  • the Network Interface Controllers e.g., Network Interface Controllers 132a, 132n
  • the interface sender buffers 312a and the interface receiver buffers 302a can be First-In, First-Out (FIFO) buffers, First-In, Last-Out (FILO) buffers, Last-In, First-Out (LIFO) buffers, Last-In, Last-Out (LILO) buffers, or circular buffers.
  • the interface sender buffers 312a and the interface receiver buffers 302a can be of size 8 bytes, 16 bytes, 32 bytes, 64 bytes, 128 bytes, 256 bytes, and so on, or any convenient size appropriate for the transfer of data between the host processor, the network interface controllers, and the reconfigurable processors.
  • FIG. 3B shows interface sender buffers 312n and interface receiver buffers 302n located at the Network Interface Controller 132n.
  • the interface sender buffers 312n are reconfigurable processors-to-host processor buffers that are configured to receive data from the reconfigurable processors 142n and provide the data to the second host processor 102n.
  • the interface receiver buffers 302n are host processor-to-reconfigurable processors buffers that are configured to receive data from the second host processor 102n and provide the data to the reconfigurable processors 142n. Examples of the data include scalar data (e.g., control bits) and vector data (e.g., vectors, tensors, arguments, commands).
  • the Network Interface Controller 132n and therefore the interface sender buffers 312n and the interface receiver buffers 302n, are accessible to each of the host processors (e.g., first and second host processors 102a, 102n), each of the reconfigurable processors (e.g., reconfigurable processors 142a, 142n), and each of the Network Interface Controllers (e.g., Network Interface Controllers 132a, 132n) in the data center 100.
  • the host processors e.g., first and second host processors 102a, 102n
  • each of the reconfigurable processors e.g., reconfigurable processors 142a, 142n
  • the Network Interface Controllers e.g., Network Interface Controllers 132a, 132n
  • the interface sender buffers 312n and the interface receiver buffers 302n can be First-In, First-Out (FIFO) buffers, First-In, Last-Out (FILO) buffers, Last-In, First-Out (LIFO) buffers, Last-In, Last-Out (LILO) buffers, or circular buffers.
  • the interface sender buffers 312n and the interface receiver buffers 302n can be of size 8 bytes, 16 bytes, 32 bytes, 64 bytes, 128 bytes, 256 bytes, and so on, or any convenient size appropriate for the transfer of data between the host processor, the network interface controllers, and the reconfigurable processors.
  • FIG. 4 A shows reconfigurable processor (RP) sender buffers 412a and reconfigurable processor (RP) receiver buffers 402a located in the reconfigurable processor memory 162a of the reconfigurable processors 142a.
  • the reconfigurable processor sender buffers 412a are reconfigurable processors-to-host processor buffers that are configured to receive data from the reconfigurable processors 142a and provide the data to the first host processor 102a.
  • the reconfigurable processor receiver buffers 402a are host processor-to- reconfigurable processors buffers that are configured to receive data from the first host processor 102a and provide the data to the reconfigurable processors 142a.
  • the reconfigurable processor memory 162a and therefore the reconfigurable processor sender buffers 412a and the reconfigurable processor receiver buffers 402a, are accessible to each of the host processors (e.g., first and second host processors 102a, 102n), each of the reconfigurable processors (e.g., reconfigurable processors 142a, 142n), and each of the Network Interface Controllers (e.g., Network Interface Controllers 132a, 132n) in the data center 100.
  • the host processors e.g., first and second host processors 102a, 102n
  • each of the reconfigurable processors e.g., reconfigurable processors 142a, 142n
  • the Network Interface Controllers e.g., Network Interface Controllers 132a, 132n
  • the reconfigurable processor sender buffers 412a and the reconfigurable processor receiver buffers 402a can be First-In, First-Out (FIFO) buffers, First-In, Last-Out (FILO) buffers, Last-In, First-Out (LIFO) buffers, Last-In, Last-Out (LILO) buffers, or circular buffers.
  • the reconfigurable processor sender buffers 412a and the reconfigurable processor receiver buffers 402a can be of size 8 bytes, 16 bytes, 32 bytes, 64 bytes, 128 bytes, 256 bytes, and so on, or any convenient size appropriate for the transfer of data between the host processor, the network interface controllers, and the reconfigurable processors.
  • FIG. 4B shows reconfigurable processor (RP) sender buffers 412n and reconfigurable processor (RP) receiver buffers 402n located in the reconfigurable processor memory 162n of the reconfigurable processors 142n.
  • the reconfigurable processor sender buffers 412n are reconfigurable processors-to-host processor buffers that are configured to receive data from the reconfigurable processors 142n and provide the data to the second host processor 102n.
  • the reconfigurable processor receiver buffers 402n are host processor-to- reconfigurable processors buffers that are configured to receive data from the second host processor 102n and provide the data to the reconfigurable processors 142n.
  • the reconfigurable processor memory 162n and therefore the reconfigurable processor sender buffers 412n and the reconfigurable processor receiver buffers 402n, are accessible to each of the host processors (e.g., first and second host processors 102a, 102n), each of the reconfigurable processors (e.g., reconfigurable processors 142a, 142n), and each of the Network Interface Controllers (e.g., Network Interface Controllers 132a, 132n) in the data center 100.
  • the host processors e.g., first and second host processors 102a, 102n
  • each of the reconfigurable processors e.g., reconfigurable processors 142a, 142n
  • the Network Interface Controllers e.g., Network Interface Controllers 132a, 132n
  • the reconfigurable processor sender buffers 412n and the reconfigurable processor receiver buffers 402n can be First-In, First-Out (FIFO) buffers, First-In, Last-Out (FILO) buffers, Last-In, First-Out (LIFO) buffers, Last-In, Last-Out (LILO) buffers, or circular buffers.
  • the reconfigurable processor sender buffers 412n and the reconfigurable processor receiver buffers 402n can be of size 8 bytes, 16 bytes, 32 bytes, 64 bytes, 128 bytes, 256 bytes, and so on, or any convenient size appropriate for the transfer of data between the host processor, the network interface controllers, and the reconfigurable processors.
  • the buffers can be defined by a virtual address space that maps to a physical range of memory addresses (which may be contiguous or discontiguous) in the memory.
  • the virtual buffers are read from and written to at locations in the memory indicated using a read pointer and write pointer, respectively.
  • the pointers are held in a memory (which may be the same as or separate to the memory holding the virtual buffer).
  • Figure 5A is a heuristics diagram of the runtime logic 122a.
  • the runtime logic 122a comprises debugging logic 502a and testing logic 512a.
  • the runtime logic 122a is configured to load and execute one or more configuration files for applications on one or more of the reconfigurable processors 142a.
  • the reconfigurable processors 142a are configured to process the configuration files and generate outputs, and to send the outputs to the first host processor 102a using at least one of the reconfigurable processors-to-host processor buffers (e.g., host sender buffers 212a, host sender buffers 212n, interface sender buffers 312a, interface sender buffers 312n, reconfigurable processor sender buffers 412a, reconfigurable processor sender buffers 412n).
  • the reconfigurable processors-to-host processor buffers e.g., host sender buffers 212a, host sender buffers 212n, interface sender buffers 312a, interface sender buffers 312n, reconfigurable processor sender buffers 412a, reconfigurable processor sender buffers 412n.
  • the debugging logic 502a running on the first host processor 102a, is configured to detect errors (e.g., in the execution of the configuration files). In one implementation, the debugging logic 502a is further configured to report the errors to a debugging console on the first host processor 102a based on comparison of the outputs to expected outputs. In another implementation, the debugging logic 502a is further configured to report the errors to a debug output file on the first host processor 102a based on the comparison of the outputs to the expected outputs.
  • debugging logic running on a particular host processor or reconfigurable processor in the data center 100 can report errors to any other host processor or reconfigurable processor in the data center 100.
  • the debugging logic 502a running on the first host processor 102a, can report errors to a debugging console on the second host processor 102n based on comparison of outputs to expected outputs.
  • the debugging logic 502a can report errors to a debug output file on the second host processor 102n based on comparison of the outputs to the expected outputs.
  • the runtime logic 122a is further configured to execute, on the reconfigurable processors 142a, one or more test configuration files for test applications.
  • the reconfigurable processors 142a are further configured to process the test configuration files and generate test outputs, and to send the test outputs to the first host processor 102a using at least one of the reconfigurable processors-to-host processor buffers (e.g., host sender buffers 212a, host sender buffers 212n, interface sender buffers 312a, interface sender buffers 312n, reconfigurable processor sender buffers 412a, reconfigurable processor sender buffers 412n).
  • the testing logic 512a running on the first host processor 102a, is configured to determine test statistics based on the test outputs, and to report the test statistics to a test output file on the first host processor 102a.
  • testing logic running on a particular host processor or reconfigurable processor in the data center 100 can report test statistics to a test output file on any other host processor or reconfigurable processor in the data center 100.
  • the testing logic 512a, running on the first host processor 102a can report test statistics to a test output file on the second host processor 102n.
  • Figure 5B is a heuristics diagram of the runtime logic 122n.
  • the runtime logic 122n comprises debugging logic 502n and testing logic 512n.
  • the runtime logic 122n is configured to load and execute one or more configuration files for applications on one or more of the reconfigurable processors 142n.
  • the reconfigurable processors 142n are configured to process the configuration files and generate outputs, and to send the outputs to the second host processor 102n using at least one of the reconfigurable processors-to-host processor buffers (e.g., host sender buffers 212a, host sender buffers 212n, interface sender buffers 312a, interface sender buffers 312n, reconfigurable processor sender buffers 412a, reconfigurable processor sender buffers 412n).
  • the reconfigurable processors-to-host processor buffers e.g., host sender buffers 212a, host sender buffers 212n, interface sender buffers 312a, interface sender buffers 312n, reconfigurable processor sender buffers 412a, reconfigurable processor sender buffers 412n.
  • the debugging logic 502n running on the second host processor 102n, is configured to detect errors (e.g., in execution of the configuration files). In one implementation, the debugging logic 502n is further configured to report errors to a debugging console on the second host processor 102n based on comparison of the outputs to expected outputs. In another implementation, the debugging logic 502n is further configured to report the errors to a debug output file on the second host processor 102n based on the comparison of the outputs to the expected outputs.
  • debugging logic running on a particular host processor or reconfigurable processor in the data center 100 can report errors to any other host processor or reconfigurable processor in the data center 100.
  • the debugging logic 502n, running on the second host processor 102n can report errors to a debugging console on the first host processor 102a based on comparison of outputs to expected outputs.
  • the debugging logic 502n can report errors to a debug output file on the first host processor 102a based on comparison of outputs to expected outputs.
  • testing logic running on a particular host processor or reconfigurable processor in the data center 100 can report test statistics to a test output file on any other host processor or reconfigurable processor in the data center 100.
  • the testing logic 512n, running on the second host processor 102n can report test statistics to a test output file on the first host processor 102a.
  • Figure 6 is a message sequence chart 600 illustrating one implementation of the debugging logic 502a detecting errors in execution of configuration files on one or more of the reconfigurable processors (RP) 142a.
  • the compiler 112a compiles an application 602 to generate a graph that includes one or more configuration files for the application 602.
  • the compiler 112a sends the graph to the runtime logic 122a for execution.
  • the runtime logic 122a loads the configuration files on one or more of the reconfigurable processors 142a.
  • the runtime logic 122a triggers execution of the bit file by sending an ‘execution’ (or start) command to the reconfigurable processors by writing to a particular CSR (control and status register) that exists for this purpose.
  • the reconfigurable processors 142a process the configuration files and generate outputs (e.g., vectors, tensors).
  • the reconfigurable processors 142a send the outputs to sender buffers 632 (or reconfigurable processors-to-host processor buffers).
  • sender buffers 632 examples include host sender buffers 212a, host sender buffers 212n, interface sender buffers 312a, interface sender buffers 312n, reconfigurable processor sender buffers 412a, and reconfigurable processor sender buffers 412n.
  • the sender buffers 632 provide the outputs to the debugging logic 502a.
  • the debugging logic 502a detects errors in the execution of the configuration files based on comparison of the outputs to expected outputs.
  • the debugging logic 502a reports the errors to a debugging console or a debug output file on the first host processor 102a.
  • Other implementations may perform the operations in different orders and/or with different, fewer, or additional operations than the ones illustrated in Figure 6. Multiple operations can be combined in some implementations.
  • operations three and six comprise streaming network packets between reconfigurable processors (e.g., RPs 142a) and a host processor (e.g., host 102a) on a same processing node 1 over local buses (e.g., PCIe buses) using a protocol like Transmission Control Protocol (TCP).
  • reconfigurable processors e.g., RPs 142a
  • host processor e.g., host 102a
  • TCP Transmission Control Protocol
  • Figure 7 is a message sequence chart 700 illustrating one implementation of the debugging logic 502a detecting errors in execution of configuration files on one or more of the reconfigurable processors (RP) 142n.
  • the compiler 112a compiles an application 702 to generate a graph that includes one or more configuration files for the application 702.
  • the compiler 112a sends the graph to the runtime logic 122a for execution.
  • the runtime logic 122a loads the configuration files on one or more of the reconfigurable processors 142n. Once the configuration files have been loaded, the runtime logic 122a triggers execution of the bit file by sending an ‘execution’ (or start) command to the reconfigurable processors by writing to a particular CSR that exists for this purpose.
  • the reconfigurable processors 142n process the configuration files and generate outputs (e.g., vectors, tensors).
  • the reconfigurable processors 142n send the outputs to sender buffers 732 (or reconfigurable processors-to-host processor buffers).
  • the sender buffers 732 include host sender buffers 212a, host sender buffers 212n, interface sender buffers 312a, interface sender buffers 312n, reconfigurable processor sender buffers 412a, and reconfigurable processor sender buffers 412n.
  • the sender buffers 732 provide the outputs to the debugging logic 502a.
  • the debugging logic 502a detects errors in the execution of the configuration files based on comparison of the outputs to expected outputs. At operation eight, the debugging logic 502a reports the errors to a debugging console or a debug output file on the first host processor 102a.
  • Other implementations may perform the operations in different orders and/or with different, fewer, or additional operations than the ones illustrated in Figure 7. Multiple operations can be combined in some implementations.
  • operations three and six comprise streaming network packets between one or more reconfigurable processors (e.g., RPs 142n) on the second processing node and a host processor (e.g., host 102a) on the first processing node over the network fabric 136 (e.g., Ethernet, InfiniBand (IB)) using protocols like RDMA over Converged Ethernet (RoCE), TCP, User Datagram Protocol (UDP), and Quick UDP Internet Connections (QUIC).
  • Figure 8 is a message sequence chart 800 illustrating one implementation of one or more of the reconfigurable processors (RP) 142a issuing remote procedure calls to the first host processor 102a.
  • the compiler 112a compiles an application 802 to generate a graph that includes one or more configuration files for the application 802.
  • the compiler 112a sends the graph to the runtime logic 122a for execution.
  • the runtime logic 122a loads the configuration files on one or more of the reconfigurable processors 142a. Once the configuration files have been loaded, the runtime logic 122a triggers execution of the bit file by sending an ‘execution’ (or start) command to the reconfigurable processors by writing to a particular CSR that exists for this purpose.
  • the reconfigurable processors 142a process the configuration files and generate outputs (e.g., vectors, tensors).
  • the reconfigurable processors 142a issue one or more remote procedure calls to the first host processor 102a using sender buffers 832 (or reconfigurable processors-to-host processor buffers).
  • sender buffers 832 include host sender buffers 212a, host sender buffers 212n, interface sender buffers 312a, interface sender buffers 312n, reconfigurable processor sender buffers 412a, and reconfigurable processor sender buffers 412n.
  • the reconfigurable processors 142a notify the first host processor 102a of error reporting using the remote procedure calls.
  • the reconfigurable processors 142a use at least one of the sender buffers 832 to send one or more argument values to the first host processor 102a for execution of the remote procedure calls.
  • the sender buffers 832 provide the remote procedure calls and the argument values to the runtime logic 122a.
  • one or more responses to the remote procedure calls are sent to the reconfigurable processors 142n via the buffers (e.g., sender buffers of the first host processor 102a and receiver buffers of the reconfigurable processors 142a).
  • Other implementations may perform the operations in different orders and/or with different, fewer, or additional operations than the ones illustrated in Figure 8. Multiple operations can be combined in some implementations.
  • FIG 8 is a message sequence chart 900 illustrating one implementation of one or more of the reconfigurable processors (RP) 142n issuing remote procedure calls to the first host processor 102a.
  • the compiler 112a compiles an application 902 to generate a graph that includes one or more configuration files for the application 902.
  • the compiler 112a sends the graph to the runtime logic 122a for execution.
  • the runtime logic 122a loads the configuration files on one or more of the reconfigurable processors 142n. Once the configuration files have been loaded, the runtime logic 122a triggers execution of the bit file by sending an ‘execution’ (or start) command to the reconfigurable processors by writing to a particular CSR that exists for this purpose.
  • the reconfigurable processors 142n process the configuration files and generate outputs (e.g., vectors, tensors).
  • the reconfigurable processors 142n issue one or more remote procedure calls to the first host processor 102a using sender buffers 932 (or reconfigurable processors-to-host processor buffers).
  • Examples of the sender buffers 932 include host sender buffers 212a, host sender buffers 212n, interface sender buffers 312a, interface sender buffers 312n, reconfigurable processor sender buffers 412a, and reconfigurable processor sender buffers 412n.
  • the reconfigurable processors 142n notify the first host processor 102a of error reporting using the remote procedure calls.
  • the reconfigurable processors 142n use at least one of the sender buffers 932 to send one or more argument values to the first host processor 102a for execution of the remote procedure calls.
  • the sender buffers 932 provide the remote procedure calls and the argument values to the runtime logic 122a.
  • one or more responses to the remote procedure calls are sent to the reconfigurable processors 142n via the buffers (e.g., sender buffers of the first host processor 102a and receiver buffers of the reconfigurable processors 142n).
  • the buffers e.g., sender buffers of the first host processor 102a and receiver buffers of the reconfigurable processors 142n.
  • Other implementations may perform the operations in different orders and/or with different, fewer, or additional operations than the ones illustrated in Figure 9. Multiple operations can be combined in some implementations.
  • operations three and seven comprise streaming network packets between one or more reconfigurable processors (e.g., RPs 142n) on the second processing node and a host processor (e.g., host 102a) on the first processing node over the network fabric 136 (e.g., Ethernet, InfiniBand (IB)) using protocols like RDMA over Converged Ethernet (RoCE), TCP, User Datagram Protocol (UDP), and Quick UDP Internet Connections (QUIC).
  • Figure 10 is a message sequence chart 1000 illustrating one implementation of the testing logic 512a reporting test statistics for execution of test configuration files on one or more of the reconfigurable processors (RP) 142a.
  • the compiler 112a compiles a test application 1002 to generate a test graph that includes one or more test configuration files for the test application 1002.
  • the compiler 112a sends the test graph to the runtime logic 122a for execution.
  • the runtime logic 122a loads the test configuration files on one or more of the reconfigurable processors 142a.
  • the runtime logic 122a triggers execution of the bit file by sending an ‘execution’ (or start) command to the reconfigurable processors by writing to a particular CSR that exists for this purpose.
  • the reconfigurable processors 142a process the test configuration files and generate test outputs (e.g., vectors, tensors).
  • the reconfigurable processors 142a send the test outputs to sender buffers 1032 (or reconfigurable processors-to-host processor buffers).
  • Examples of the sender buffers 1032 include host sender buffers 212a, host sender buffers 212n, interface sender buffers 312a, interface sender buffers 312n, reconfigurable processor sender buffers 412a, and reconfigurable processor sender buffers 412n.
  • the sender buffers 1032 provide the test outputs to the testing logic 512a.
  • the testing logic 512a determines test statistics based on the test outputs.
  • the testing logic 512a reports the test statistics to a test output file on the first host processor 102a.
  • Other implementations may perform the operations in different orders and/or with different, fewer, or additional operations than the ones illustrated in Figure 10. Multiple operations can be combined in some implementations.
  • operations three and six comprise streaming network packets between reconfigurable processors (e.g., RPs 142a) and a host processor (e.g., host 102a) on a same processing node 1 over local buses (e.g., PCIe buses) using a protocol like Transmission Control Protocol (TCP).
  • reconfigurable processors e.g., RPs 142a
  • host processor e.g., host 102a
  • TCP Transmission Control Protocol
  • Figure 11 is a message sequence chart 1100 illustrating one implementation of the testing logic 512a reporting test statistics for execution of test configuration files on one or more of the reconfigurable processors (RP) 142n.
  • the compiler 112a compiles a test application 1102 to generate a test graph that includes one or more test configuration files for the test application 1102.
  • the compiler 112a sends the test graph to the runtime logic 122a for execution.
  • the runtime logic 122a loads the test configuration files on one or more of the reconfigurable processors 142n.
  • the runtime logic 122a triggers execution of the bit file by sending an ‘execution’ (or start) command to the reconfigurable processors by writing to a particular CSR that exists for this purpose.
  • the reconfigurable processors 142n process the test configuration files and generate test outputs (e.g., vectors, tensors).
  • the reconfigurable processors 142n send the test outputs to sender buffers 1132 (or reconfigurable processors-to-host processor buffers).
  • Examples of the sender buffers 1132 include host sender buffers 212a, host sender buffers 212n, interface sender buffers 312a, interface sender buffers 312n, reconfigurable processor sender buffers 412a, and reconfigurable processor sender buffers 412n.
  • the sender buffers 1132 provide the test outputs to the testing logic 512a.
  • the testing logic 512a determines test statistics based on the test outputs.
  • the testing logic 512a reports the test statistics to a test output file on the first host processor 102a.
  • Other implementations may perform the operations in different orders and/or with different, fewer, or additional operations than the ones illustrated in Figure 11. Multiple operations can be combined in some implementations.
  • operations three and six comprise streaming network packets between one or more reconfigurable processors (e.g., RPs 142n) on the second processing node and a host processor (e.g., host 102a) on the first processing node over the network fabric 136 (e.g., Ethernet, InfiniBand (IB)) using protocols like RDMA over Converged Ethernet (RoCE), TCP, User Datagram Protocol (UDP), and Quick UDP Internet Connections (QUIC).
  • Figure 12 is a message sequence chart 1200 illustrating one implementation of executing a first set of functions in configuration files and/or data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) on one or more of the reconfigurable processors (RP) 142a and executing a second set of functions and/or data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) in the configuration files on the first host processor 102a.
  • a first set of functions in configuration files and/or data therefor e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)
  • control data e.g., control tokens
  • the compiler 112a receives an application 1202 for compilation.
  • the compiler 112a compiles the application 1202 to generate one or more configuration files 1212.
  • the configuration files 1212 include a plurality of functions.
  • the plurality of functions includes a first set of functions 1214 and a second set of functions 1224.
  • Examples of functions in the plurality of functions include non-linearities like Rectified Linear Unit (ReLU) and its variants (e.g., leaky ReLU), hyperbolic tangent, sigmoid, and softmax, element-wise addition, matrix multiplication (e.g., General Matrix Multiply (GeMM)), layer normalization (e.g., batch normalization), loss functions like cross-entropy, and tensor shape modifiers like transpose.
  • the compiler 112a sends the configuration files 1212 to the runtime logic 122a for execution.
  • the runtime logic 122a loads the first set of functions 1214 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) and the second set of functions 1224 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) on one or more of the reconfigurable processors 142a.
  • the data therefor e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)
  • control data e.g., control tokens
  • the reconfigurable processors 142a process the first set of functions 1214 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) and generate a first set of outputs (e.g., vectors, tensors).
  • the data therefor e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)
  • control data e.g., control tokens
  • the reconfigurable processors 142a transmit functions in the second set of functions 1224 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) to the first host processor 102a using one or more reconfigurable processors-to-host processor buffers.
  • functions in the second set of functions 1224 e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)
  • control data e.g., control tokens
  • data on which the functions in the second set of functions 1224 are executed is transmitted to the first host processor 102a using the reconfigurable processors-to-host processor buffers.
  • respective ones of the reconfigurable processors- to-host processor buffers are used to transmit respective functions in the second set of functions 1224 and/or data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) to the first host processor 102a.
  • data therefor e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)
  • control data e.g., control tokens
  • One example workload sharing flow includes using one or more of the reconfigurable processor sender buffers 412a and one or more of the host receiver buffers 202a.
  • the reconfigurable processors 142a transmit the functions in the second set of functions 1224 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) to the reconfigurable processor sender buffers 412a.
  • the reconfigurable processor sender buffers 412a transmit the functions in the second set of functions 1224 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) to the host receiver buffers 202a.
  • the host receiver buffers 202a transmit the functions in the second set of functions 1224 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) to the first host processor 102a.
  • the data therefor e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)
  • control data e.g., control tokens
  • the first host processor 102a executes the functions in the second set of functions 1224 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) to generate a second set of outputs (or results 1234) (e.g., vectors, tensors).
  • the first host processor 102a transmits the results 1234 to one or more of the reconfigurable processors 142a using one or more host processor-to-reconfigurable processors buffers.
  • respective ones of the host processor-to-reconfigurable processors buffers are used to transmit respective results of executing respective functions in the second set of functions 1224 and/or data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) to the reconfigurable processors 142a.
  • One workload sharing flow includes using one or more of the host sender buffers 212a and one or more of the reconfigurable processor receiver buffers 402a.
  • the first host processor 102a transmits the results 1234 to the host sender buffers 212a.
  • the host sender buffers 212a transmit the results 1234 to the reconfigurable processor receiver buffers 402a.
  • the reconfigurable processor receiver buffers 402a transmit the results 1234 to the reconfigurable processors 142a.
  • one or more functions in the first set of functions 1214 waits for results of execution of one or more functions in the second set of functions 1224 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) on the first host processor 102a to combine the results with results of execution of one or more functions in the first set of functions 1214 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) on the reconfigurable processors 142a.
  • the data therefor e.g., weight
  • the first set of functions 1214 and the second set of functions 1224 operate separately and in parallel.
  • one or more functions in the second set of functions 1224 daisy chain the results to one or more functions in the first set of functions 1214, and vice-versa.
  • one or more functions in the second set of functions 1224 execute for a certain number of iterations before returning the results to the reconfigurable processors 142a.
  • Other implementations may perform the operations in different orders and/or with different, fewer, or additional operations than the ones illustrated in Figure 12. Multiple operations can be combined in some implementations.
  • operations six, seven, eight, ten, eleven, and twelve comprise streaming network packets between reconfigurable processors (e.g., RPs 142a) and a host processor (e.g., host 102a) on a same processing node 1 over local buses (e.g., PCIe buses) using a protocol like Transmission Control Protocol (TCP).
  • reconfigurable processors e.g., RPs 142a
  • host processor e.g., host 102a
  • TCP Transmission Control Protocol
  • Figure 13 is a message sequence chart 1300 illustrating one implementation of executing a first set of functions in configuration files and/or data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) on one or more of the reconfigurable processors (RP) 142a and executing a second set of functions and/or data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) in the configuration files on the second host processor 102n.
  • a first set of functions in configuration files and/or data therefor e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)
  • control data e.g., control tokens
  • the compiler 112a receives an application 1302 for compilation.
  • the compiler 112a compiles the application 1302 to generate one or more configuration files 1312.
  • the configuration files 1312 include a plurality of functions.
  • the plurality of functions includes a first set of functions 1314 and a second set of functions 1324.
  • Examples of functions in the plurality of functions include non-linearities like Rectified Linear Unit (ReLU) and its variants (e.g., leaky ReLU), hyperbolic tangent, sigmoid, and softmax, element-wise addition, matrix multiplication (e.g., General Matrix Multiply (GeMM)), layer normalization (e.g., batch normalization), loss functions like crossentropy, and tensor shape modifiers like transpose.
  • the compiler 112a sends the configuration files 1312 to the runtime logic 122a for execution.
  • the runtime logic 122a loads the first set of functions 1314 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) and the second set of functions 1324 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) on one or more of the reconfigurable processors 142a.
  • the data therefor e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)
  • control data e.g., control tokens
  • the reconfigurable processors 142a process the first set of functions 1314 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) and generate a first set of outputs (e.g., vectors, tensors).
  • the data therefor e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)
  • control data e.g., control tokens
  • the reconfigurable processors 142a transmit functions in the second set of functions 1324 and/or data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) to the second host processor 102n using one or more reconfigurable processors-to-host processor buffers.
  • data on which the functions in the second set of functions 1324 are executed is transmitted to the second host processor 102n using the reconfigurable processors-to-host processor buffers.
  • respective ones of the reconfigurable processors- to-host processor buffers are used to transmit respective functions in the second set of functions 1324 and/or data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) to the second host processor 102n.
  • data therefor e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)
  • control data e.g., control tokens
  • One example workload sharing flow includes using one or more of the reconfigurable processor sender buffers 412a and one or more of the host receiver buffers 202n.
  • the reconfigurable processors 142a transmit the functions in the second set of functions 1324 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) to the reconfigurable processor sender buffers 412a.
  • the reconfigurable processor sender buffers 412a transmit the functions in the second set of functions 1324 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) to the host receiver buffers 202n.
  • the host receiver buffers 202n transmit the functions in the second set of functions 1324 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) to the second host processor 102n.
  • the data therefor e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)
  • control data e.g., control tokens
  • the second host processor 102n executes the functions in the second set of functions 1324 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) to generate a second set of outputs (or results 1334) (e.g., vectors, tensors).
  • the second host processor 102n transmits the results 1334 to one or more of the reconfigurable processors 142a using one or more host processor-to-reconfigurable processors buffers.
  • respective ones of the host processor-to-reconfigurable processors buffers are used to transmit respective results of executing respective functions in the second set of functions 1324 and/or data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) to the reconfigurable processors 142a.
  • One workload sharing flow includes using one or more of the host sender buffers 212n and one or more of the reconfigurable processor receiver buffers 402a.
  • the second host processor 102n transmits the results 1334 to the host sender buffers 212n.
  • the host sender buffers 212n transmit the results 1334 to the reconfigurable processor receiver buffers 402a.
  • the reconfigurable processor receiver buffers 402a transmit the results 1334 to the reconfigurable processors 142a.
  • one or more functions in the first set of functions 1314 wait for results of execution of one or more functions in the second set of functions 1324 on the second host processor 102n to combine the results with results of execution of one or more functions in the first set of functions 1314 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) on the reconfigurable processors 142a.
  • the data therefor e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)
  • control data e.g., control tokens
  • the first set of functions 1314 and the second set of functions 1324 operate separately and in parallel.
  • one or more functions in the second set of functions 1324 daisy chain the results to one or more functions in the first set of functions 1314, and vice-versa.
  • one or more functions in the second set of functions 1324 executes for a certain number of iterations before returning the results to the reconfigurable processors 142a.
  • Other implementations may perform the operations in different orders and/or with different, fewer, or additional operations than the ones illustrated in Figure 13. Multiple operations can be combined in some implementations.
  • operations six, seven, eight, ten, eleven, and twelve comprise streaming network packets between one or more reconfigurable processors (e.g., RPs 142a) on the first processing node and a host processor (e.g., host 102n) on the second processing node over the network fabric 136 (e.g., Ethernet, InfiniBand (IB)) using protocols like RDMA over Converged Ethernet (RoCE), TCP, User Datagram Protocol (UDP), and Quick UDP Internet Connections (QUIC).
  • FIG 14A shows sender and receiver buffers used by individual reconfigurable processors in the reconfigurable processors 142a.
  • Reconfigurable processor 1 (RP 1) receiver buffers 1402a and reconfigurable processor 1 (RP 1) sender buffers 1412a are used by a first reconfigurable processor in the reconfigurable processors 142a to receive data from and send data to another host processor or reconfigurable processor in the data center 100.
  • Reconfigurable processor n (RP n) receiver buffers 1422a and reconfigurable processor n (RP n) sender buffers 1432a are used by a second reconfigurable processor in the reconfigurable processors 142a to receive data from and send data to another host processor or reconfigurable processor in the data center 100.
  • the reconfigurable processor 1 receiver buffers 1402a, the reconfigurable processor 1 sender buffers 1412a, the reconfigurable processor n receiver buffers 1422a, and the reconfigurable processor n sender buffers 1432a are located in the reconfigurable processor memory 162a.
  • FIG 14B shows sender and receiver buffers used by individual reconfigurable processors in the reconfigurable processors 142n.
  • Reconfigurable processor 1 (RP 1) receiver buffers 1402n and reconfigurable processor 1 (RP 1) sender buffers 1412n are used by a first reconfigurable processor in the reconfigurable processors 142n to receive data from and send data to another host processor or reconfigurable processor in the data center 100.
  • Reconfigurable processor n (RP n) receiver buffers 1422n and reconfigurable processor n (RP n) sender buffers 1432n are used by a second reconfigurable processor in the reconfigurable processors 142n to receive data from and send data to another host processor or reconfigurable processor in the data center 100.
  • the reconfigurable processor 1 receiver buffers 1402n, the reconfigurable processor 1 sender buffers 1412n, the reconfigurable processor n receiver buffers 1422n, and the reconfigurable processor n sender buffers 1432n are located in the reconfigurable processor memory 162n.
  • Figure 15 is a message sequence chart 1500 illustrating one implementation of executing a first set of functions in configuration files and/or data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) on a first reconfigurable processor in the reconfigurable processors 142a and executing a second set of functions in the configuration files and/or data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) on a second reconfigurable processor in the reconfigurable processors 142a.
  • a first set of functions in configuration files and/or data therefor e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)
  • control data e.g., control tokens
  • the first reconfigurable processor is identified as “RP 1” and the second reconfigurable processor is identified as “RP N ”
  • the first reconfigurable processor and the second reconfigurable processor are operatively coupled to a same processing node, i.e., the first processing node. This is referred to herein as “intra-node processing.”
  • the compiler 112a receives an application 1502 for compilation.
  • the compiler 112a compiles the application 1502 to generate one or more configuration files 1512.
  • the configuration files 1512 include a plurality of functions.
  • the plurality of functions includes a first set of functions 1514 and a second set of functions 1524.
  • Examples of functions in the plurality of functions include non-linearities like Rectified Linear Unit (ReLU) and its variants (e.g., leaky ReLU), hyperbolic tangent, sigmoid, and softmax, element-wise addition, matrix multiplication (e.g., General Matrix Multiply (GeMM)), layer normalization (e.g., batch normalization), loss functions like cross-entropy, and tensor shape modifiers like transpose.
  • the compiler 112a sends the configuration files 1512 to the runtime logic 122a for execution.
  • the runtime logic 122a loads the first set of functions 1514 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) and the second set of functions 1524 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) on the first reconfigurable processor.
  • the data therefor e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)
  • control data e.g., control tokens
  • the first reconfigurable processor processes the first set of functions 1514 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) and generates a first set of outputs (e.g., vectors, tensors).
  • the data therefor e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)
  • control data e.g., control tokens
  • the first reconfigurable processor transmits functions in the second set of functions 1524 and/or data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) to the second reconfigurable processor using one or more reconfigurable processors-to- reconfigurable processors buffers.
  • functions in the second set of functions 1524 and/or data therefor e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)
  • control data e.g., control tokens
  • data on which the functions in the second set of functions 1524 and/or the data therefor e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)
  • control data e.g., control tokens
  • respective ones of the reconfigurable processors-to- reconfigurable processors buffers are used to transmit respective functions in the second set of functions 1524 and/or data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) to the second reconfigurable processor.
  • One example workload sharing flow includes using one or more of the reconfigurable processor 1 (RP 1) sender buffers 1412a and one or more of the reconfigurable processor N (RP N) receiver buffers 1422a.
  • the first reconfigurable processor transmits the functions in the second set of functions 1524 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) to the reconfigurable processor 1 sender buffers 1412a.
  • the reconfigurable processor 1 sender buffers 1412a transmit the functions in the second set of functions 1524 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) to the reconfigurable processor N receiver buffers 1422a.
  • the reconfigurable processor N receiver buffers 1422a transmit the functions in the second set of functions 1524 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) to the second reconfigurable processor.
  • the data therefor e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)
  • control data e.g., control tokens
  • the second reconfigurable processor executes the functions in the second set of functions 1524 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) to generate a second set of outputs (or results 1534) (e.g., vectors, tensors).
  • the second reconfigurable processor transmits the results 1534 to the first reconfigurable processor using one or more of the reconfigurable processors-to-reconfigurable processors buffers.
  • respective ones of the reconfigurable processors-to-reconfigurable processors buffers are used to transmit respective results of executing respective functions in the second set of functions 1524 and/or data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) to the first reconfigurable processor.
  • One workload sharing flow includes using one or more of the reconfigurable processor N (RP N) sender buffers 1432a and one or more of the reconfigurable processor 1 (RP 1) receiver buffers 1402a.
  • the second reconfigurable processor transmits the results 1534 to the reconfigurable processor N sender buffers 1432a.
  • the reconfigurable processor N sender buffers 1432a transmit the results 1534 to the reconfigurable processor 1 receiver buffers 1402a.
  • the reconfigurable processor 1 receiver buffers 1402a transmit the results 1534 to the first reconfigurable processor.
  • one or more functions in the first set of functions 1514 waits for results of execution of one or more functions in the second set of functions 1524 on the second reconfigurable processor to combine the results with results of execution of one or more functions in the first set of functions 1514 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) on the first reconfigurable processor.
  • the first set of functions 1514 and the second set of functions 1524 operate separately and in parallel.
  • one or more functions in the second set of functions 1524 daisy chain the results to one or more functions in the first set of functions 1514, and vice-versa.
  • one or more functions in the second set of functions 1524 executes for a certain number of iterations before returning the results to the first reconfigurable processor.
  • Other implementations may perform the operations in different orders and/or with different, fewer, or additional operations than the ones illustrated in Figure 15. Multiple operations can be combined in some implementations.
  • operations six, seven, eight, ten, eleven, and twelve comprise streaming network packets between reconfigurable processors on a same processing node 1 over local buses (e.g., PCIe buses) using a protocol like Transmission Control Protocol (TCP).
  • TCP Transmission Control Protocol
  • Figure 16 is a message sequence chart 1600 illustrating one implementation of executing a first set of functions in configuration files and/or data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) on a first reconfigurable processor in the reconfigurable processors 142a and executing a second set of functions in the configuration files and/or data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) on a first reconfigurable processor in the reconfigurable processors 142n.
  • a first set of functions in configuration files and/or data therefor e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)
  • control data e.g., control tokens
  • the first reconfigurable processor in the reconfigurable processors 142a is identified as “RP 1” and the first reconfigurable processor in the reconfigurable processors 142n is identified as “RP 1'.”
  • the first reconfigurable processor in the reconfigurable processors 142a and the first reconfigurable processor in the reconfigurable processors 142n are operatively coupled to different processing nodes, i.e., the first processing node and the second processing node. This is referred to herein as “inter-node processing.”
  • the compiler 112a receives an application 1602 for compilation.
  • the compiler 112a compiles the application 1602 to generate one or more configuration files 1612.
  • the configuration files 1612 include a plurality of functions.
  • the plurality of functions includes a first set of functions 1614 and a second set of functions 1624.
  • functions in the plurality of functions include non-linearities like Rectified Linear Unit (ReLU) and its variants (e.g., leaky ReLU), hyperbolic tangent, sigmoid, and softmax, element-wise addition, matrix multiplication (e.g., General Matrix Multiply (GeMM)), layer normalization (e.g., batch normalization), loss functions like cross-entropy, and tensor shape modifiers like transpose.
  • the compiler 112a sends the configuration files 1612 to the runtime logic 122a for execution.
  • the runtime logic 122a loads the first set of functions 1614 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) and the second set of functions 1624 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) on the first reconfigurable processor in the reconfigurable processors 142a.
  • the data therefor e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)
  • control data e.g., control tokens
  • the first reconfigurable processor in the reconfigurable processors 142a processes the first set of functions 1614 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) and generates a first set of outputs (e.g., vectors, tensors).
  • the data therefor e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)
  • control data e.g., control tokens
  • the first reconfigurable processor in the reconfigurable processors 142a transmits functions in the second set of functions 1624 and/or data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) to the first reconfigurable processor in the reconfigurable processors 142n using one or more reconfigurable processors-to-reconfigurable processors buffers.
  • functions in the second set of functions 1624 and/or data therefor e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)
  • data on which the functions in the second set of functions 1624 and/or the data therefor e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)
  • NLP data natural language processing
  • control data e.g., control tokens
  • respective ones of the reconfigurable processors-to-reconfigurable processors buffers are used to transmit respective functions in the second set of functions 1624 and/or data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) to the first reconfigurable processor in the reconfigurable processors 142n.
  • One example workload sharing flow includes using one or more of the reconfigurable processor 1 (RP 1) sender buffers 1412a and one or more of the reconfigurable processor 1' (RP 1') receiver buffers 1402n.
  • the first reconfigurable processor in the reconfigurable processors 142a transmits the functions in the second set of functions 1624 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) to the reconfigurable processor 1 sender buffers 1412a.
  • the data therefor e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)
  • control data e.g., control tokens
  • the reconfigurable processor 1 sender buffers 1412a transmit the functions in the second set of functions 1624 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) to the reconfigurable processor 1' receiver buffers 1402n.
  • the data therefor e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)
  • control data e.g., control tokens
  • the reconfigurable processor 1' receiver buffers 1402n transmit the functions in the second set of functions 1624 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) to the first reconfigurable processor in the reconfigurable processors 142n.
  • the data therefor e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)
  • control data e.g., control tokens
  • the first reconfigurable processor in the reconfigurable processors 142n executes the functions in the second set of functions 1624 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) to generate a second set of outputs (or results 1634) (e.g., vectors, tensors).
  • the first reconfigurable processor in the reconfigurable processors 142n transmits the results 1634 to the first reconfigurable processor in the reconfigurable processors 142a using one or more of the reconfigurable processors-to-reconfigurable processors buffers.
  • respective ones of the reconfigurable processors-to-reconfigurable processors buffers are used to transmit respective results of executing respective functions in the second set of functions 1624 and/or data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) to the first reconfigurable processor in the reconfigurable processors 142a.
  • One workload sharing flow includes using one or more of the reconfigurable processor 1' (RP 1') sender buffers 1412n and one or more of the reconfigurable processor 1 (RP 1) receiver buffers 1402a.
  • the first reconfigurable processor in the reconfigurable processors 142n transmits the results 1634 to the reconfigurable processor 1' sender buffers 1412n.
  • the reconfigurable processor 1' sender buffers 1412n transmit the results 1634 to the reconfigurable processor 1 receiver buffers 1402a.
  • the reconfigurable processor 1 receiver buffers 1402a transmit the results 1634 to the first reconfigurable processor in the reconfigurable processors 142a.
  • one or more functions in the first set of functions 1614 waits for results of execution of one or more functions in the second set of functions 1624 on the first reconfigurable processor in the reconfigurable processors 142n to combine the results with results of execution of one or more functions in the first set of functions 1614 and/or the data therefor (e.g., weights, coefficients, vectors, tensors (image data, audio data, natural language processing (NLP data)), control data (e.g., control tokens)) on the first reconfigurable processor in the reconfigurable processors 142a.
  • the first set of functions 1614 and the second set of functions 1624 operate separately and in parallel.
  • one or more functions in the second set of functions 1624 daisy chain the results to one or more functions in the first set of functions 1614, and vice-versa.
  • one or more functions in the second set of functions 1624 executes for a certain number of iterations before returning the results to the first reconfigurable processor in the reconfigurable processors 142a.
  • Other implementations may perform the operations in different orders and/or with different, fewer, or additional operations than the ones illustrated in Figure 16. Multiple operations can be combined in some implementations.
  • operations six, seven, eight, ten, eleven, and twelve comprise streaming network packets between reconfigurable processors on different processing nodes 1 and n over the network fabric 136 (e.g., Ethernet, InfiniBand (IB)) using protocols like RDMA over Converged Ethernet (RoCE), TCP, User Datagram Protocol (UDP), and Quick UDP Internet Connections (QUIC).
  • network fabric 136 e.g., Ethernet, InfiniBand (IB)
  • protocols like RDMA over Converged Ethernet (RoCE), TCP, User Datagram Protocol (UDP), and Quick UDP Internet Connections (QUIC).
  • Figure 17A is a message sequence chart 1700A illustrating one implementation of asynchronous tensor streaming in which a next tensor is buffered while a reconfigurable processor is processing a current tensor.
  • a reconfigurable processor in the data center 100 e.g., one or more of the reconfigurable processors 142a
  • the series of data units 1712 includes a sequence of tensors 1 to N.
  • a first plurality of buffers 1704 is configured to receive data units in the series of data units 1712 from a source memory 1702 (e.g., host memory 134a, host memory 134n), and to stream the data units to the reconfigurable processor for processing.
  • a source memory 1702 e.g., host memory 134a, host memory 134n
  • buffers in the first plurality of buffers 1704 include First-In, First-Out (FIFO) buffers, First-In, Last-Out (FILO) buffers, Last-In, First-Out (LIFO) buffers, Last-In, Last-Out (LILO) buffers, and circular buffers.
  • the buffers in the first plurality of buffers 1704 can be of size 8 bytes, 16 bytes, 32 bytes, 64 bytes, 128 bytes, 256 bytes, and so on, or any convenient size appropriate for the transfer of data between the host processor, the network interface controllers, and the reconfigurable processors.
  • a second plurality of buffers 1706 is configured to stream results of processing the data units from the reconfigurable processor, and to send the results to a destination memory 1708 (e.g., reconfigurable processor memory 162a, reconfigurable processor memory 162n) for storage.
  • Examples of buffers in the second plurality of buffers 1706 include First-In, First-Out (FIFO) buffers, First-In, Last-Out (FILO) buffers, Last-In, First- Out (LIFO) buffers, Last-In, Last-Out (LILO) buffers, and circular buffers.
  • the buffers in the second plurality of buffers 1706 can be of size 8 bytes, 16 bytes, 32 bytes, 64 bytes, 128 bytes, 256 bytes, and so on, or any convenient size appropriate for the transfer of data between the host processor, the network interface controllers, and the reconfigurable processors.
  • a runtime logic (e.g., runtime logic 122a, runtime logic 122n) is configured to cause the buffers in the first plurality of buffers 1704 to receive a next data unit in the series of data units 1712 from the source memory 1702 while the reconfigurable processor processes a current data unit in the series of data units 1712.
  • the runtime logic is further configured to stream the next data unit to the reconfigurable processor for processing after the buffers in the second plurality of buffers 1706 stream results of processing the current data unit from the reconfigurable processor.
  • tensor 1 is the current data unit and tensors 2 and 3 are next data units.
  • the buffers in the first plurality of buffers 1704 receive tensor 1 from the source memory 1702.
  • the buffers in the first plurality of buffers 1704 stream tensor 1 to the reconfigurable processor.
  • the reconfigurable processor starts processing tensor 1. While the reconfigurable processor is processing tensor 1, the buffers in the first plurality of buffers 1704 receive tensors 2 and 3 from the source memory 1702 at timesteps four and five, respectively.
  • the reconfigurable processor streams results of processing tensor 1 (result 1) to the buffers in the second plurality of buffers 1706.
  • the buffers in the second plurality of buffers 1706 stream the results of processing tensor 1 to the destination memory 1708 for storage.
  • the buffers in the first plurality of buffers 1704 stream tensor 2 to the reconfigurable processor.
  • streaming of tensor 2 from the buffers in the first plurality of buffers 1704 to the reconfigurable processor precedes the streaming of the results of processing tensor 1 from the buffers in the second plurality of buffers 1706 to the destination memory 1708.
  • processing of tensors in one or more previous timesteps/iterations e.g., tensors 2 and 3 by the reconfigurable processors 142a overlaps with the processing of a tensor in a current timestep/iteration (e.g., tensor 1) by the reconfigurable processors 142a. This is referred to herein as “meta-pipelining.” Multiple steps can be combined in some implementations.
  • Figure 17B is a message sequence chart 1700B illustrating one implementation of asynchronous tensor streaming in which a next tensor is buffered before a reconfigurable processor processes a current tensor.
  • the runtime logic is further configured to cause the buffers in the first plurality of buffers 1704 to receive the next data unit from the source memory 1702 before the reconfigurable processor starts processing the current data unit.
  • the buffers in the first plurality of buffers 1704 receive tensor 1 from the source memory 1702.
  • the buffers in the first plurality of buffers 1704 stream tensor 1 to the reconfigurable processor.
  • the buffers in the first plurality of buffers 1704 receive tensors 2 and 3 from the source memory 1702 at timesteps three and four, respectively.
  • the reconfigurable processor starts processing tensor 1.
  • the reconfigurable processor streams results of processing tensor 1 (result 1) to the buffers in the second plurality of buffers 1706.
  • the buffers in the second plurality of buffers 1706 stream the results of processing tensor 1 to the destination memory 1708 for storage.
  • the buffers in the first plurality of buffers 1704 stream tensor 2 to the reconfigurable processor.
  • streaming of tensor 2 from the buffers in the first plurality of buffers 1704 to the reconfigurable processor precedes the streaming of the results of processing tensor 1 from the buffers in the second plurality of buffers 1706 to the destination memory 1708.
  • Other implementations may perform the steps in different orders and/or with different, fewer, or additional steps than the ones illustrated in Figure 17B. Multiple steps can be combined in some implementations.
  • Figure 17C is a message sequence chart 1700C illustrating one implementation of asynchronous tensor streaming in which a next tensor is buffered after a reconfigurable processor has processed a current tensor.
  • the runtime logic is further configured to cause the buffers in the first plurality of buffers 1704 to receive the next data unit from the source memory 1702 after the buffers in the second plurality of buffers 1706 stream the results of processing the current data unit from the reconfigurable processor.
  • the buffers in the first plurality of buffers 1704 receive tensor 1 from the source memory 1702.
  • the buffers in the first plurality of buffers 1704 stream tensor 1 to the reconfigurable processor.
  • the reconfigurable processor starts processing tensor 1.
  • the reconfigurable processor streams results of processing tensor 1 (result 1) to the buffers in the second plurality of buffers 1706.
  • the buffers in the first plurality of buffers 1704 receive tensors 2 and 3 from the source memory 1702 at timesteps five and six, respectively.
  • the buffers in the second plurality of buffers 1706 stream the results of processing tensor 1 to the destination memory 1708 for storage.
  • the buffers in the first plurality of buffers 1704 stream tensor 2 to the reconfigurable processor.
  • Other implementations may perform the steps in different orders and/or with different, fewer, or additional steps than the ones illustrated in Figure 17C. Multiple steps can be combined in some implementations.
  • FIG 18 is a message sequence chart 1800 illustrating one implementation of executing configuration files on reconfigurable processors that are on different processing nodes in the data center 100. This is referred to herein as “inter-node execution of configuration files.”
  • the data center 100 comprises a pool of reconfigurable dataflow resources.
  • the pool of reconfigurable dataflow resources includes a plurality of processing nodes (e.g., processing nodes 1 to n). Respective processing nodes in the plurality of processing nodes are operatively coupled to respective pluralities of reconfigurable processors (RPs) and respective pluralities of buffers.
  • the respective processing nodes are also operatively coupled to respective host processors.
  • the respective processing nodes are also operatively coupled to respective pluralities of Network Interface Controllers (NICs) or Smart Network Interface Controllers (SmartNICs).
  • NICs Network Interface Controllers
  • SmartNICs Smart Network Interface Controllers
  • buffers in the respective pluralities of buffers are located in respective memories of the respective pluralities of reconfigurable processors.
  • the respective memories of the respective pluralities of reconfigurable processors include off- chip and/or on-chip memories like DRAM, NAND flash, SRAM, latches, flops, bypass networks, and registers.
  • the buffers are located in respective memories of NICs or SmartNICs in the respective pluralities of NICs or SmartNICs.
  • the buffers are located in respective memories of host processors (e.g., RAM/ROM, caches) in the respective host processors.
  • the buffers can be located in or attached to any network component of the data center 100 such as PCIe buses, Double Data Rate (DDR) channels, Dual In-Line Memory Modules (DIMMs), routers, and switches.
  • the buffers can be First-In, First-Out (FIFO) buffers, First-In, Last-Out (FILO) buffers, Last-In, First-Out (LIFO) buffers, Last-In, Last-Out (LILO) buffers, or circular buffers.
  • the buffers can be of size 8 bytes, 16 bytes, 32 bytes, 64 bytes, 128 bytes, 256 bytes, and so on, or any convenient size appropriate for the transfer of data between the host processor, the network interface controllers, and the reconfigurable processors.
  • a compiler 1812 compiles applications 1802 (operation one) and generates configuration files 1822 (operation two).
  • the configuration files 1822 specify configurations of virtual dataflow resources 1824 required to execute the configuration files 1822.
  • the virtual dataflow resources 1824 include a first virtual reconfigurable processor 1824al in a first virtual processing node 1824a, a second virtual reconfigurable processor 1824bl in a second virtual processing node 1824b, and virtual buffers 1824c that stream data between the first virtual reconfigurable processor 1824al and the second virtual reconfigurable processor 1824bl.
  • the virtual buffers 1824c comprise first virtual SmartNIC buffers 1824cl and second virtual SmartNIC buffers 1824c2.
  • a runtime processor 1832 is operatively coupled to the pool of reconfigurable dataflow resources and configured to receive the configuration files 1822 (operation three).
  • the runtime processor 1832 comprises a runtime logic 1842 and an allocation logic 1844.
  • the allocation logic 1844 is configured to allocate reconfigurable dataflow resources in the pool of reconfigurable dataflow resources to the virtual dataflow resources 1824 (operation four).
  • the allocated reconfigurable dataflow resources include a first processing node in the respective processing nodes allocated to the first virtual processing node 1824a, a second processing node in the respective processing nodes allocated to the second virtual processing node 1824b, a first reconfigurable processor, operatively coupled to the first processing node, allocated to the first virtual reconfigurable processor 1824al, a second reconfigurable processor operatively coupled to the second processing node allocated to the second virtual reconfigurable processor 1824bl, and a first plurality of buffers, operatively coupled to the first processing node, and a second plurality of buffers, operatively coupled to the second processing node, allocated to the virtual buffers 1824c.
  • the runtime logic 1842 is configured to execute the configuration files 1822 using the allocated reconfigurable dataflow resources (operation five). The runtime logic 1842 also writes starting weights to the RPs in an embodiment.
  • buffers can be allocated for inter-node streaming of configuration data (e.g., bit stream) by mapping physical memory addresses of the buffers to memories of different network components in the data center 100 (e.g., host memories, reconfigurable processor memories, NIC memories, SmartNIC memories, PCIe bus memories, DDR channel memories, DIMM memories, etc.).
  • configuration data e.g., bit stream
  • mapping physical memory addresses of the buffers to memories of different network components in the data center 100 e.g., host memories, reconfigurable processor memories, NIC memories, SmartNIC memories, PCIe bus memories, DDR channel memories, DIMM memories, etc.
  • the buffers are programmable and can be allocated by specifying physical memory addresses.
  • the physical memory addresses of the buffers specify memory locations of the buffers.
  • the physical memory addresses of the buffers can be designated by the host processors and/or by the reconfigurable processors.
  • the configurations of the virtual buffers 1824c specify virtual memory segments of the buffers allocated for execution of the applications 1802 (e.g., the first and second plurality of buffers), including virtual address spaces (e.g., starting or base addresses) of the virtual memory segments and sizes of the virtual address spaces (e.g., sizes of the memory blocks in bytes).
  • the runtime processor 1832 maps the virtual address spaces of the virtual memory segments to physical address spaces of physical memory segments in memory where the allocated buffers are located.
  • the memory can be host processor memory, reconfigurable processor memory (off-chip or on- chip), NIC memory, SmartNIC memory, PCIe memory, DMA memory, DIMM memory, or any other network component memory in the data center 100.
  • Figure 19 shows one implementation of memory mapping 1900 the virtual buffers 1824c to allocated buffers 1902/physical buffers 1902 located in respective physical memories of example reconfigurable dataflow resources such as SmartNIC one (SmartNIC 1) memory, SmartNIC two (SmartNIC 2) memory, reconfigurable processor one (RP 1) memory, reconfigurable processor two (RP 2) memory, PCIe 1 memory, DMA 1 memory, and host processor 1 memory.
  • SmartNIC one SmartNIC 1
  • SmartNIC two SmartNIC 2
  • RP 1 memory reconfigurable processor one
  • RP 2 reconfigurable processor two
  • CSRs 1913, 1923, 1933, 1943, 1953, 1963, and 1973 in the allocated physical element are used to map the application virtual buffer addresses to the appropriate physical addresses by having the runtime logic program them, (e.g., SmartNIC 1 buffers 1912, SmartNIC 2 buffers 1922, RP 1 buffers 1932, RP 2 buffers 1942, PCIe 1 buffers 1952, DMA 1 buffers 1962, host 1 buffers 1972) to the allocated buffers 1902 in a contiguous physical memory space (e.g., SmartNIC 1 buffers 1914 (first range of physical memory addresses), SmartNIC 2 buffers 1924 (second range of physical memory addresses), RP 1 buffers 1934 (third range of physical memory addresses), RP 2 buffers 1944 (fourth range of physical memory addresses), PCIe 1 buffers 1954 (fifth range of physical memory addresses), DMA 1 buffers 1964 (sixth range of physical memory addresses), host 1 buffers 1974 (seventh range of physical memory
  • the runtime processor 1832 configures control and status registers of the reconfigurable dataflow resources with configuration data (e.g., bit stream) identifying the mapping between the virtual address spaces and the physical address spaces for the configuration files 1822 to access the physical memory segments during execution of the applications 1802.
  • configuration data e.g., bit stream
  • a first set of the physical memory segments mapped to buffers allocated to a first one of the applications 1802 are different from a second set of the physical memory segments mapped to buffers allocated to a second one of the applications 1802.
  • access of the buffers allocated to the first one of the applications 1802 is confined to the first set of the physical memory segments
  • access of the buffers allocated to the second one of the applications 1802 is confined to the second set of the physical memory segments.
  • the reconfigurable processors have respective pluralities of buffers for respective applications such that a first plurality of buffers can be used to stream configuration data (e.g., bit stream) to execute configuration files for a first application, a second plurality of buffers can be used to stream configuration data (e.g., bit stream) to execute configuration files for a second application, a third plurality of buffers can be used to stream configuration data (e.g., bit stream) to execute configuration files for a third application, and so on.
  • the configuration files for the first, second, and third applications can be executed in parallel or sequence using the first, second, and third plurality of buffers, respectively.
  • the configuration files for the first, second, and third applications can be executed, in parallel or in sequence, on a single reconfigurable processor using the first, second, and third plurality of buffers, respectively.
  • the configuration files for the first, second, and third applications can be executed, in parallel or in sequence, across reconfigurable processors on a same processing node using the first, second, and third plurality of buffers, respectively, such that, in some implementations, each of the first, second, and third plurality of buffers includes a set of sender (TX) buffers and receiver (RX) buffers for each reconfigurable processor or NIC or SmartNIC on the same processing node used to execute the configuration files.
  • TX sender
  • RX receiver
  • the configuration files for the first, second, and third applications can be executed, in parallel or in sequence, across reconfigurable processors on different processing nodes using the first, second, and third plurality of buffers, respectively, such that, in some implementations, each of the first, second, and third plurality of buffers includes a set of sender (TX) buffers and receiver (RX) buffers for each reconfigurable processor or NIC or SmartNIC on the different processing nodes used to execute the configuration files.
  • TX sender
  • RX receiver
  • the runtime processor 1832 runs on each host processor in the data center 100 and provides unified access to the pool of reconfigurable dataflow resources in the data center 100. Additional details about how the allocation logic 1844 spans the userspace and kernel space of a host processor on which a runtime processor or runtime logic runs can be found in US Nonprovisional Patent Application No. 16/922,975, filed July 7, 2020, entitled, “RUNTIME VIRTUALIZATION OF RECONFIGURABLE DATAFLOW RESOURCES,” (Attorney Docket No. SBNV 1026-1), which is incorporated herein by reference (specific reference is made to the runtime library 312, the kernel module 322, the resource manager 471, the device driver 474, and other allocation logic and components in the application incorporated by reference).
  • the runtime processor 1832 runs on each reconfigurable processor in the data center 100 and provides unified access to the pool of reconfigurable dataflow resources in the data center 100. In yet another implementation, the runtime processor 1832 runs as a hypervisor only on a subset of the host processors in the data center 100 (e.g., only on one host processor). In yet another implementation, the runtime processor 1832 runs as a hypervisor only on a subset of the reconfigurable processors in the data center 100 (e.g., only on one reconfigurable processor).
  • Figure 20 shows an architectural level schematic 2000 of one implementation of the data center 100 in which the processing nodes of the data center 100 do not include host processors.
  • the implementation shown in the architectural level schematic 2000 is configured to execute other implementations discussed in this application (e.g., intra-node processing, inter-node execution of configuration files), except that the other implementations are executed without using the host processors.
  • functionalities that are otherwise performed by host processors are instead performed by the reconfigurable processors in the data center 100.
  • Some examples of functionalities performed by the reconfigurable processors in host-less implementations include hosting the compiler 1812, compiling the applications 1802, generating the configuration files 1822, generating configurations of the virtual dataflow resources 1824, hosting the runtime processor 1832, memory mapping, resource allocation (e.g., designating and allocating physical memory addresses of the buffers and other reconfigurable dataflow resources), execution of the configuration files 1822, parsing incoming network packets and running anomaly detection in ultra-low and deterministic latency, etc.).
  • the functionalities that are otherwise performed by the host processors are obviated by other network components in the data center 100, for example, by the SmartNICs that comprise microcontrollers to locally trigger host-like commands without requiring an external host.
  • the runtime processor 1832 can be considered a distributed runtime processor, a distributed runtime logic, a distributed resource manager, and/or a distributed resource allocator that provides unified access to the pool of reconfigurable dataflow resources in the data center 100.
  • buffers to stream, over a network fabric, configuration data (e.g., bit stream) between reconfigurable processors that are on different processing nodes in the data center 100.
  • configuration data e.g., bit stream
  • buffer-based inter-node streaming of configuration data e.g., bit stream
  • Figure 21 is a message sequence chart 2100 illustrating one implementation of bufferbased inter-node streaming of configuration data (e.g., bit stream) over the network fabric 136.
  • buffers used for the inter-node streaming i.e., sender buffers 2176a, receiver buffers 2178a, sender buffers 2176n, and receiver buffers 2178n, are located in respective memories of the SmartNIC devices 132a and 132n.
  • these buffers can be located in any network component of the data center 100 (e.g., memories of host processors, memories of reconfigurable processors, memories of NIC devices, memories on PCIe buses, memories on DDR channels, memories of DIMMs, etc.).
  • the local buses 125a, 126a, 127a, 125n, 126n, and 127n and bus switches 124a and 124n that operatively couple reconfigurable processors on a same processing node to a host processor of the same processing node and to a NIC device or a SmartNIC device attached to the same processing node are PCIe buses 2132a, 2136a, 2132n, and 2136n and PCIe switches (PEX) 2112a, 2134a, 2112n, and 2134n, respectively.
  • PCIe protocol can be replaced by or supplemented with other bus protocols such as Cache Coherent Interconnect for Accelerators (CCIX), Compute Express Link (CXL), and Open Coherent Accelerator Processor Interface (OpenCAPI).
  • Cache Coherent Interconnect for Accelerators Cache Coherent Interconnect for Accelerators
  • CXL Compute Express Link
  • OpenCAPI Open Coherent Accelerator Processor Interface
  • some preceding operations are omitted for the sake of clarity.
  • some examples of the omitted operations include the applications 1802 requesting execution, the compiler 1812 compiling the applications 1802 and generating the configuration files 1822, the runtime processor 1832 allocating physical resources, i.e., reconfigurable dataflow resources, for execution of the configuration files 1822, and the runtime processor 1832 loading the configuration files 1812 on the allocated reconfigurable dataflow resources.
  • These omitted operations can be executed on any host processor or any reconfigurable processor in the data center 100.
  • the virtual dataflow resources 1824 and the virtual buffers 1824c are allocated reconfigurable dataflow resources of the processing node 1 and the processing node n in the data center 100.
  • the first virtual processing node 1824a is allocated the processing node 1 (hereinafter “a first processing node”).
  • the first virtual reconfigurable processor 1824al is allocated reconfigurable processor N (RP N) on the processing node 1 (hereinafter “a first reconfigurable processor”).
  • the second virtual processing node 1824b is allocated the processing node n (hereinafter “a second processing node”).
  • the second virtual reconfigurable processor 1824bl is allocated reconfigurable processor N (RP N) on the processing node n (hereinafter “a second reconfigurable processor”).
  • the first virtual SmartNIC buffers 1824cl are allocated the sender buffers 2176a and the receiver buffers 2178a (hereinafter “a first plurality of buffers”).
  • the second virtual SmartNIC buffers 1824c2 are allocated the sender buffers 2176n and the receiver buffers 2178n (hereinafter “a second plurality of buffers”).
  • the first plurality of buffers includes a first set of sender buffers 2176a configured to receive data from the first reconfigurable processor and provide the data to a second set of receiver buffers 2178n in the second plurality of buffers.
  • the second set of receiver buffers 2178n are configured to provide the data to the second reconfigurable processor.
  • the second plurality of buffers includes a second set of sender buffers 2176n configured to receive data from the second reconfigurable processor and provide the data to a first set of receiver buffers 2178a in the first plurality of buffers.
  • the first set of receiver buffers 2178a are configured to provide the data to the first reconfigurable processor.
  • the runtime processor 1832 is configured to configure the first SmartNIC 132a with a routing table that specifies the first reconfigurable processor as a local reconfigurable processor, and the second reconfigurable processor as a destination reconfigurable processor.
  • the runtime processor 1832 is configured to configure the second SmartNIC 132n with a routing table that specifies the second reconfigurable processor as a local reconfigurable processor, and the first reconfigurable processor as a destination reconfigurable processor.
  • Figure 21 shows one implementation of how the runtime processor 1832 executes the configuration files 1822 on the first processing node (processing node 1) and the second processing node (processing node n).
  • the execution includes streaming data (e.g., configuration data (e.g., bit stream) and application data (weights, coefficients, vectors, tensors, control data (e.g., control tokens), etc.) for the configuration files 1822 that define the applications 1802 between the first reconfigurable processor and the second reconfigurable processor using one or more buffers in the first plurality of buffers and one or more buffers in the second plurality of buffers, thereby the streaming bypassing the first host processor 102a and the second host processor 102n (as indicated by the dotted lines in Figure 21).
  • streaming data e.g., configuration data (e.g., bit stream) and application data (weights, coefficients, vectors, tensors, control data (e.g., control tokens), etc.
  • the message sequence chart 2100 can be executed without using host processors (e.g., as the host-less implementations discussed with respect to Figure 20). This saves latency and improves throughput, and also does not require any processing time needed on the first and second host processors 102a and 102n (e.g., for processing by their respective operating systems).
  • the execution includes streaming input data for the applications 1802 from the first reconfigurable processor to the second reconfigurable processor.
  • one or more of the sender buffers in the first set of sender buffers 2176a are configured to receive the input data from the first reconfigurable processor (operation one) and provide the input data to one or more receiver buffers in the second set of receiver buffers 2178n (operation two).
  • the first reconfigurable processor is configured to push the input data to the first SmartNIC 132a (e.g., via the PCIe Endpoint Port (EP) 2146a) (operation one).
  • operation one is accomplished by an address generator of the first reconfigurable processor (e.g., Address Generation and Coalescing Units) (AGCU)) writing the input data to physical memory addresses mapped to the sender buffers in the first set of sender buffers 2176a (e.g., via a hardware write (HWRITE) command).
  • AGCU Address Generation and Coalescing Units
  • HWRITE hardware write
  • the first SmartNIC 132a is configured to write the input data, after encapsulation, into the sender buffers in the first set of sender buffers 2176a.
  • the first SmartNIC 132a is configured to update tail pointers of the sender buffers in the first set of sender buffers 2176a in response to the writing of the input data. In one implementation, the first SmartNIC 132a is configured to process the input data as payload 2156a, apply encapsulation, store it in caches 2186a, and stream it to the second SmartNIC 132n over the network fabric 136 (e.g., via the MAC port 2196a).
  • operations one and six comprise streaming network packets between the first reconfigurable processor and the first SmartNIC 132a over the local buses PCIe 2132a and 2136a using a protocol like Transaction Layer Packet (TLP) (e.g., 2120a, 2128a).
  • TLP Transaction Layer Packet
  • operation two comprises streaming network packets from the first SmartNIC 132a to the second SmartNIC 132n over the network fabric 136 (e.g., Ethernet, InfiniBand (IB)) using protocols like RDMA over Converged Ethernet (RoCE), TCP, User Datagram Protocol (UDP), and Quick UDP Internet Connections (QUIC) (e.g., 2198a, 2198n).
  • RoCE Ethernet, InfiniBand
  • UDP User Datagram Protocol
  • QUIC Quick UDP Internet Connections
  • the receiver buffers in the second set of receiver buffers 2178n are configured to provide the input data to the second reconfigurable processor (operation three).
  • operation three is accomplished by an address generator of the second reconfigurable processor (e.g., Address Generation and Coalescing Units) (AGCU)) reading the input data from physical memory addresses mapped to the receiver buffers in the second set of receiver buffers 2178n (e.g., via a hardware read (HWREAD) command).
  • AGCU Address Generation and Coalescing Units
  • HWREAD hardware read
  • the first SmartNIC 132a is configured to send the input data to the second SmartNIC 132n in response to the updated tail pointers.
  • the second SmartNIC 132n is configured to write the input data, after decapsulation, into the receiver buffers in the second set of receiver buffers 2178n. In one implementation, the second SmartNIC 132n is configured to update tail pointers of the receiver buffers in the second set of receiver buffers 2178n in response to the writing of the input data.
  • the second reconfigurable processor is configured to pull the input data from the second SmartNIC 132n (e.g., via the PCIe Endpoint Port (EP) 2146n) by reading the input data from the receiver buffers in the second set of receiver buffers 2178n in response to the updated tail pointers.
  • EP PCIe Endpoint Port
  • the execution includes streaming output data for the applications 1802 from the second reconfigurable processor to the first reconfigurable processor.
  • the output data is generated as a result of processing the input data (e.g., processing of the input data by the second reconfigurable processor).
  • one or more of the sender buffers in the second set of sender buffers 2176n are configured to receive the output data from the second reconfigurable processor (operation four) and provide the output data to one or more receiver buffers in the first set of receiver buffers 2178a (operation five).
  • the second reconfigurable processor is configured to push the output data to the second SmartNIC 132n (e.g., via the PCIe Endpoint Port (EP) 2146n) (operation four).
  • operation four is accomplished by an address generator of the second reconfigurable processor (e.g., Address Generation and Coalescing Units) (AGCU)) writing the output data to physical memory addresses mapped to the sender buffers in the second set of sender buffers 2176n (e.g., via a hardware write (HWRITE) command).
  • AGCU Address Generation and Coalescing Units
  • HWRITE hardware write
  • the second SmartNIC 132n is configured to write the output data, after encapsulation, into the sender buffers in the second set of sender buffers 2176n.
  • the second SmartNIC 132n is configured to update tail pointers of the sender buffers in the second set of sender buffers 2176n in response to the writing of the output data. In one implementation, the second SmartNIC 132n is configured to process the output data as payload 2156n, apply encapsulation, store it in caches 2186n, and stream it to the first SmartNIC 132a over the network fabric 136 (e.g., via the MAC port 2196n).
  • operations three and four comprise streaming network packets between the second reconfigurable processor to the second SmartNIC 132n over the local buses PCIe 2132n and 2136n using a protocol like Transaction Layer Packet (TLP) (e.g., 2120n, 2128n).
  • TLP Transaction Layer Packet
  • operation five comprises streaming network packets from the second SmartNIC 132n to the first SmartNIC 132a over the network fabric 136 (e.g., Ethernet, InfiniBand (IB)) using protocols like RDMA over Converged Ethernet (RoCE), TCP, User Datagram Protocol (UDP), and Quick UDP Internet Connections (QUIC) (e.g., 2198a, 2198n).
  • RoCE Ethernet, InfiniBand
  • UDP User Datagram Protocol
  • QUIC Quick UDP Internet Connections
  • the receiver buffers in the first set of receiver buffers 2178a are configured to provide the output data to the first reconfigurable processor (operation six).
  • operation six is accomplished by an address generator of the first reconfigurable processor (e.g., Address Generation and Coalescing Units) (AGCU)) reading the output data from physical memory addresses mapped to the receiver buffers in the first set of receiver buffers 2178a (e.g., via a hardware read (HWREAD) command).
  • AGCU Address Generation and Coalescing Units
  • HWREAD hardware read
  • the second SmartNIC 132n is configured to send the output data to the first SmartNIC 132a in response to the updated tail pointers.
  • the first SmartNIC 132a is configured to write the output data, after decapsulation, into the receiver buffers in the first set of receiver buffers 2178a. In one implementation, the first SmartNIC 132a is configured to update tail pointers of the receiver buffers in the first set of receiver buffers 2178a in response to the writing of the output data.
  • the first reconfigurable processor is configured to pull the output data from the first SmartNIC 132a (e.g., via the PCIe Endpoint Port (EP) 2146a) by reading the output data from the receiver buffers in the first set of receiver buffers 2178a in response to the updated tail pointers.
  • EP PCIe Endpoint Port
  • the first reconfigurable processor notifies the second reconfigurable processor of remote invocations using one or more remote procedure calls.
  • the first reconfigurable processor uses the sender buffers in the first set of sender buffers 2176a and the receiver buffers in the second set of receiver buffers 2178n to send, over the network fabric 136, one or more argument values to the second reconfigurable processor for execution of the remote procedure calls (similar to operation 2 in Figure 21).
  • the second reconfigurable processor notifies the first reconfigurable processor of remote invocations using one or more remote procedure calls.
  • the second reconfigurable processor uses the sender buffers in the second set of sender buffers 2176n and the receiver buffers in the first set of receiver buffers 2178a to send, over the network fabric 136, one or more argument values to the first reconfigurable processor for execution of the remote procedure calls (similar to operation 5 in Figure 21).
  • Figure 22 is a message sequence chart 2200 illustrating another implementation of buffer-based inter-node streaming of configuration data (e.g., bit stream) over the network fabric 136.
  • Figure 22 shows another implementation of how the runtime processor 1832 executes the configuration files 1822 on the first processing node (processing node 1) and the second processing node (processing node n).
  • the execution includes streaming data (e.g., configuration data (e.g., bit stream) and application data (weights, coefficients, vectors, tensors, control data (e.g., control tokens), etc.) for the configuration files 1822 that define the applications 1802 between the first reconfigurable processor and the second host processor 102n using one or more buffers in the first plurality of buffers and one or more buffers in the second plurality of buffers, thereby the streaming bypassing the first host processor 102a (as indicated by the dotted lines in Figure 22).
  • This saves latency and improves throughput, and also does not require any processing time needed on the first host processor 102a (e.g., for processing by its operating system).
  • the execution includes streaming input data for the applications 1802 from the first reconfigurable processor to the second host processor 102n.
  • one or more of the sender buffers in the first set of sender buffers 2176a are configured to receive the input data from the first reconfigurable processor (operation one) and provide the input data to one or more receiver buffers in the second set of receiver buffers 2178n (operation two).
  • the first reconfigurable processor is configured to push the input data to the first SmartNIC 132a (e.g., via the PCIe endpoint port (EP) 2146a) (operation one).
  • operation one is accomplished by an address generator of the first reconfigurable processor (e.g., Address Generation and Coalescing Units) (AGCU)) writing the input data to physical memory addresses mapped to the sender buffers in the first set of sender buffers 2176a (e.g., via a hardware write (HWRITE) command).
  • AGCU Address Generation and Coalescing Units
  • HWRITE hardware write
  • the first SmartNIC 132a is configured to write the input data, after encapsulation, into the sender buffers in the first set of sender buffers 2176a.
  • the first SmartNIC 132a is configured to update tail pointers of the sender buffers in the first set of sender buffers 2176a in response to the writing of the input data. In one implementation, the first SmartNIC 132a is configured to process the input data as payload 2156a, apply encapsulation, store it in caches 2186a, and stream it to the second SmartNIC 132n over the network fabric 136 (e.g., via the MAC port 2196a).
  • operations one and six comprise streaming network packets between the first reconfigurable processor and the first SmartNIC 132a over the local buses PCIe 2132a and 2136a using a protocol like Transaction Layer Packet (TLP) (e.g., 2120a, 2128a).
  • TLP Transaction Layer Packet
  • operation two comprises streaming network packets from the first SmartNIC 132a to the second SmartNIC 132n over the network fabric 136 (e.g., Ethernet, InfiniBand (IB)) using protocols like RDMA over Converged Ethernet (RoCE), TCP, User Datagram Protocol (UDP), and Quick UDP Internet Connections (QUIC) (e.g., 2198a, 2198n).
  • RoCE Ethernet, InfiniBand
  • UDP User Datagram Protocol
  • QUIC Quick UDP Internet Connections
  • the receiver buffers in the second set of receiver buffers 2178n are configured to provide the input data to the second host processor 102n (operation three).
  • operation three is accomplished by an address generator of the second host processor 102n (e.g., the second host processor reads DMAed data once the DMA operation is complete.) reading the input data from physical memory addresses mapped to the receiver buffers in the second set of receiver buffers 2178n (e.g., via a hardware read (HWREAD) command).
  • the first SmartNIC 132a is configured to send the input data to the second SmartNIC 132n in response to the updated tail pointers.
  • the second SmartNIC 132n is configured to write the input data, after decapsulation, into the receiver buffers in the second set of receiver buffers 2178n. In one implementation, the second SmartNIC 132n is configured to update tail pointers of the receiver buffers in the second set of receiver buffers 2178n in response to the writing of the input data.
  • the second host processor 102n is configured to pull the input data from the second SmartNIC 132n (e.g., via the PCIe Endpoint Port (EP) 2146n) by reading the input data from the receiver buffers in the second set of receiver buffers 2178n in response to the updated tail pointers.
  • SmartNIC would DMA the payload into host 102n memory 134n, then notify the host via a DMA completion mechanism.
  • the execution includes streaming output data for the applications 1802 from the second host processor 102n to the first reconfigurable processor.
  • the output data is generated as a result of processing the input data (e.g., processing of the input data by the second host processor 102n).
  • one or more of the sender buffers in the second set of sender buffers 2176n are configured to receive the output data from the second host processor 102n (operation four) and provide the output data to one or more receiver buffers in the first set of receiver buffers 2178a (operation five).
  • the second host processor 102n is configured to push the output data to the second SmartNIC 132n (e.g., via the PCIe Endpoint Port (EP) 2146n) (operation four). In some implementations, operation four is accomplished by a DMA operation .
  • the second SmartNIC 132n is configured to write the output data, after encapsulation, into the sender buffers in the second set of sender buffers 2176n.
  • the second SmartNIC 132n is configured to update tail pointers of the sender buffers in the second set of sender buffers 2176n in response to the writing of the output data.
  • the second SmartNIC 132n is configured to process the output data as payload 2156n, apply encapsulation, store it in caches 2186n, and stream it to the first SmartNIC 132a over the network fabric 136 (e.g., via the MAC port 2196n).
  • operations three and four comprise streaming network packets between the second host processor 102n to the second SmartNIC 132n over the local buses PCIe 2132n and 2136n using a protocol like Transaction Layer Packet (TLP) (e.g., 2120n, 2128n).
  • TLP Transaction Layer Packet
  • operation five comprises streaming network packets from the second SmartNIC 132n to the first SmartNIC 132a over the network fabric 136 (e.g., Ethernet, InfiniBand (IB)) using protocols like RDMA over Converged Ethernet (RoCE), TCP, User Datagram Protocol (UDP), and Quick UDP Internet Connections (QUIC) (e.g., 2198a, 2198n).
  • RoCE Ethernet, InfiniBand
  • UDP User Datagram Protocol
  • QUIC Quick UDP Internet Connections
  • the receiver buffers in the first set of receiver buffers 2178a are configured to provide the output data to the first reconfigurable processor (operation six).
  • operation six is accomplished by an address generator of the first reconfigurable processor (e.g., Address Generation and Coalescing Units) (AGCU)) reading the output data from physical memory addresses mapped to the receiver buffers in the first set of receiver buffers 2178a (e.g., via a hardware read (HWREAD) command).
  • AGCU Address Generation and Coalescing Units
  • HWREAD hardware read
  • the second SmartNIC 132n is configured to send the output data to the first SmartNIC 132a in response to the updated tail pointers.
  • the first SmartNIC 132a is configured to write the output data, after decapsulation, into the receiver buffers in the first set of receiver buffers 2178a. In one implementation, the first SmartNIC 132a is configured to update tail pointers of the receiver buffers in the first set of receiver buffers 2178a in response to the writing of the output data.
  • the first reconfigurable processor is configured to pull the output data from the first SmartNIC 132a (e.g., via the PCIe Endpoint Port (EP) 2146a) by reading the output data from the receiver buffers in the first set of receiver buffers 2178a in response to the updated tail pointers.
  • EP PCIe Endpoint Port
  • the first reconfigurable processor notifies the second host processor 102n of remote invocations using one or more remote procedure calls.
  • the first reconfigurable processor uses the sender buffers in the first set of sender buffers 2176a and the receiver buffers in the second set of receiver buffers 2178n to send, over the network fabric 136, one or more argument values to the second host processor 102n for execution of the remote procedure calls (similar to operation 2 in Figure 22).
  • the second host processor 102n notifies the first reconfigurable processor of remote invocations using one or more remote procedure calls.
  • the second host processor 102n uses the sender buffers in the second set of sender buffers 2176n and the receiver buffers in the first set of receiver buffers 2178a to send, over the network fabric 136, one or more argument values to the first reconfigurable processor for execution of the remote procedure calls (similar to operation 5 in Figure 22).
  • the technology disclosed allows a remote entity which executed the remote procedure call to produce one or more result values and send them back to the remote caller using a distinct set of buffers.
  • the two communicating entities may designate two buffer queues, one in each direction. The caller will send the data by copying it into a first buffer queue. The receiver will pull the data out of the first buffer queue, compute an operation, and then place the result in a second buffer queue. The original caller will simply wait until the second buffer queue has data available and will be able to use the result computed remotely as soon as it arrives over the second buffer queue.
  • SmartNICs can be replaced by NICs, which can be controlled by NIC DMAs or via the host processors to implement the flows illustrated in Figures 21 and 22 (e.g., updating the head and tail pointers of the buffers).
  • NIC DMAs e.g., updating the head and tail pointers of the buffers.
  • operations two and five of Figures 21 and 22 are executed by the first and second host processors 102a and 102n by initiating Remote DMA (RDMA) of the networking packets between the first NIC 132a and the second NIC 132n, and updating the corresponding tail pointers of the buffers upon arrival of the network packets.
  • RDMA Remote DMA
  • the SmartNICs and the NICs are embedded on-chip on the reconfigurable processors.
  • Figure 23 illustrates one implementation of executing 2300 a model/application in parallel using the disclosed buffer-based inter-node streaming of configuration data (e.g., bit stream) over the network fabric 136. This is referred to herein as “model parallelism.”
  • Application 2302 is a dataflow graph with a set of processing modules (e.g., processing modules 1 to 5). Examples of the processing modules include neurons or layers of deep neural networks.
  • the runtime processor 1832 is configured to partition the set of processing modules into a first subset of processing modules 2304a and a second subset of processing modules 2304b.
  • the runtime processor 1832 is configured to execute configuration files 2322a for the first subset of processing modules 2304a on the first reconfigurable processor (e.g., RP N from the RPs 142a on the processing node 1).
  • the runtime processor 1832 is configured to execute configuration files 2322b for the second subset of processing modules 2304b on the second reconfigurable processor (e.g., RP N from the RPs 142n on the processing node n).
  • Deep neural network training implemented, for example, by Stochastic Gradient Descent (SGD) comprises a forward pass and a backward pass.
  • the backward pass comprises a delta pass and a chain pass.
  • the forward pass propagates activations in a forward direction.
  • the delta pass propagates deltas in a backward direction.
  • the chain pass calculates gradients based on the deltas as the deltas are generated in the delta pass.
  • the runtime processor 1832 is configured to use the first plurality of buffers 2176a, 2178a and the second plurality of buffers 2176n, 2178n to stream data between the first subset of processing modules 2304a and the second subset of processing modules 2304b.
  • the data includes feature maps and/or activations generated during a forward pass, and parameter gradients generated during a backward pass.
  • Figure 24 illustrates one implementation of executing 2400 multiple instances of a model/application in parallel using the disclosed buffer-based inter-node streaming of configuration data (e.g., bit stream) over the network fabric 136. This is referred to herein as “data parallelism.”
  • the runtime processor 1832 is configured to initialize a first instance of the dataflow graph 2404a and a second instance of the dataflow graph 2404b.
  • the runtime processor 1832 is configured to execute configuration files 2422a for the first instance 2404a of the dataflow graph on the first reconfigurable processor (e.g., RP N from the RPs 142a on the processing node 1).
  • the runtime processor 1832 is configured to execute configuration files 2422b for the second instance 2404b of the dataflow graph on the second reconfigurable processor (e.g., RP N from the RPs 142n on the processing node n).
  • the runtime processor 1832 is configured to use the first plurality of buffers 2176a, 2178a and the second plurality of buffers 2176n, 2178n to stream data between the first instance of the dataflow graph and the second instance of the dataflow graph.
  • the data includes gradients generated during the backward pass.
  • Figure 25 illustrates one implementation of executing 2500 configuration files on heterogeneous reconfigurable processors (e.g., RP 1 and RP 2 in Figure 25).
  • heterogeneous reconfigurable processors include Central Processing Units (CPUs), Graphics Processing Units (GPUs), Field Programmable Gate Arrays (FPGAs), Coarse-Grained Reconfigurable Architectures (CGRAs), Application-Specific Integrated Circuits (ASICs), Application Specific Instruction-set Processor (ASIP), and Digital Signal Processors (DSPs).
  • CPUs Central Processing Units
  • GPUs Graphics Processing Units
  • FPGAs Field Programmable Gate Arrays
  • CGRAs Coarse-Grained Reconfigurable Architectures
  • ASICs Application-Specific Integrated Circuits
  • ASIP Application Specific Instruction-set Processor
  • DSPs Digital Signal Processors
  • the heterogeneous reconfigurable processors have different levels of configurable granularity.
  • the runtime processor 1832 is configured to receive a set of configuration files (e.g., 1822) for an application (e.g., 1802).
  • the runtime processor 1832 is configured to load and execute a first subset of configuration files 2502a in the set of configuration files on a first reconfigurable processor (RP 1) in the heterogeneous reconfigurable processors.
  • the first reconfigurable processor has a first configuration and/or a first level of configurable granularity.
  • the runtime processor 1832 is configured to load and execute a second subset of configuration files 2502b in the set of configuration files on a second reconfigurable processor (RP 2) in the heterogeneous reconfigurable processors.
  • the second reconfigurable processor has a second configuration and/or a second level of configurable granularity that is different from the first configuration and/or the first level of configurable granularity.
  • the first configuration is a bit-level configurable granularity
  • the first reconfigurable processor is a Field-Programmable Gate Array (FPGA).
  • the second configuration is a word-level configurable granularity
  • the second reconfigurable processor is a Coarse-Grained Reconfigurable Architecture (CGRA).
  • the first configuration is a gate-level reconfigurability
  • the first reconfigurable processor is the FPGA.
  • the second configuration is a register transfer-level reconfigurability, and the second reconfigurable processor has the CGRA.
  • the first configuration uses bit- wise Look-Up Tables (LUTs) and switches, and the first reconfigurable processor is the FPGA.
  • LUTs Look-Up Tables
  • the second configuration uses word-wide Issue Slots (ISs)/ Arithmetic Logic Units (ALUs)/Functional Units (FUs)/Processing Elements (PEs), Register Files (RFs), and interconnections, and the second reconfigurable processor has the CGRA.
  • ISs word-wide Issue Slots
  • ALUs Arithmetic Logic Units
  • FUs Fluorescence Units
  • PEs Processing Elements
  • RFs Register Files
  • interconnections and the second reconfigurable processor has the CGRA.
  • a number of the ISs used by the second reconfigurable processor is fewer than a number of the LUTs used by the first reconfigurable processor.
  • a number of bits required to configure the second reconfigurable processor is orders of magnitude smaller than a number of bits required to configure the first reconfigurable processor.
  • Figure 26 illustrates one implementation of executing 2600 configuration files using NIC or SmartNIC devices that are embedded on the reconfigurable processors.
  • a first reconfigurable processor (e.g., RP N from the RPs 142a on the processing node 1) has a first Network Interface Controller (NIC), and the first NIC has a first plurality of buffers 2176a, 2178a.
  • a second reconfigurable processor (e.g., RP N from the RPs 142n on the processing node n) has a second NIC, and the second NIC has a second plurality of buffers 2176n, 2178n.
  • the runtime processor 1832 is configured to execute the configuration files 1812 for the applications 1802 using the first reconfigurable processor and the second reconfigurable processor.
  • the execution includes streaming data (e.g., configuration data (e.g., bit stream) and application data (weights, coefficients, vectors, tensors, control data (e.g., control tokens), etc.) for the configuration files 1822 that define the applications 1802 between the first reconfigurable processor and the second reconfigurable processor using the first plurality of buffers of the first NIC and the second plurality of buffers of the second NIC.
  • streaming data e.g., configuration data (e.g., bit stream) and application data (weights, coefficients, vectors, tensors, control data (e.g., control tokens), etc.
  • Figure 27 is a diagram illustrating a system 2700 including a host 2720, a memory 2740, and an example reconfigurable data processor 2710 in which a computation unit as described herein is deployed by hardware or by configuration of reconfigurable components and configured with the virtualization logic 2797.
  • the reconfigurable data processor 2710 includes an array 2790 of configurable units and a configuration load/unload controller 2795.
  • the virtualization logic 2797 can include resources that support or enable simultaneous execution of multiple, unrelated application graphs (or related ones) in an array of configurable units on one die or one multichip module.
  • a first application graph is implemented in virtual machine VM1 in a particular set 2798 of configurable units
  • a second application graph is implemented in virtual machine VM2 in another set 2799 of configurable units.
  • Configurable units in an array 2790 of configurable units are further described in reference to Figures 30 and 31 and configured with the virtualization logic 2797.
  • Configurable units can include, or can have units configured to implement, a computation unit or computation units, as described herein.
  • the reconfigurable data processor 2710 includes an external I/O interface 2730 connected to the host 2720 by line 2725, and an external I/O interface 2750 connected to the memory 2740 by line 2745.
  • the I/O interfaces 2730, 2750 connect via a bus system 2715 to the array 2790 of configurable units and to the configuration load/unload controller 2795.
  • the bus system 2715 may have a bus width of carrying one chunk of data, which can be for this example one hundred and twenty-eight bits (references to one hundred and twenty-eight bits throughout can be considered as an example chunk size more generally).
  • the host 2720 can send the configuration file to the memory 2740 via the I/O interface 2730, the bus system 2715, and the I/O interface 2750 in the reconfigurable data processor 2710.
  • the configuration file can be loaded in many ways, as suits a particular architecture, including in data paths outside the configurable processor 2710.
  • the configuration file can be retrieved from the memory 2740 via the memory I/O interface 2750. Chunks of the configuration file can then be sent in a distribution sequence to configurable units in the array 2790 of configurable units in the reconfigurable data processor 2710.
  • An external clock generator 2770 or other clock line sources can provide a clock line 2775 or clock lines to elements in the reconfigurable data processor 2710, including the array 2790 of configurable units, and the bus system 2715, and the external data I/O interfaces.
  • the bus system 2715 can communicate data at a processor clock rate via a clock line 2775 or clock lines.
  • FIG 28 is a simplified block diagram 2800 of components of a CGRA (Coarse- Grained Reconfigurable Architecture) processor.
  • the CGRA processor has two tiles (Tile 1 , Tile2).
  • the tile comprises an array of configurable units connected to a bus system, including array level networks in this example.
  • An array of configurable units (e.g., 2790, Figure 27) in the tile includes computation units in hardware or by configuration of reconfigurable components, which are configured with the virtualization logic 2797.
  • the bus system includes a top-level network connecting the tiles to external VO interface 2805 (or any number of interfaces). In other embodiments, different bus system configurations may be utilized.
  • the configurable units in each tile are nodes on the array level network in this embodiment.
  • Each of the tiles has four AGCUs (Address Generation and Coalescing Units) (e.g., MAGCU1, AGCU9, AGCU13, AGCU14).
  • the AGCUs are nodes on the top-level network and nodes on the array level networks and include resources for routing data among nodes on the top-level network and nodes on the array level network in each tile.
  • Nodes on the top-level network in this example include one or more external I/Os, including interface 2805.
  • the interfaces to external devices include resources for routing data among nodes on the top-level network and external devices, such as high-capacity memory, host processors, other CGRA processors, FPGA devices and so on, that are connected to the interfaces.
  • One of the AGCUs in a tile is configured in this example to be a master AGCU, which includes an array configuration load/unload controller for the tile.
  • a master AGCU which includes an array configuration load/unload controller for the tile.
  • more than one array configuration load/unload controller can be implemented, and one array configuration load/unload controller may be implemented by logic distributed among more than one AGCU.
  • the MAGCU1 includes a configuration load/unload controller for Tilel
  • MAGCU2 includes a configuration load/unload controller for Tile2.
  • a configuration load/unload controller can be designed for loading and unloading configuration of more than one tile.
  • more than one configuration controller can be designed for configuration of a single tile.
  • the configuration load/unload controller can be implemented in other portions of the system, including as a stand-alone node on the toplevel network and the array level network or networks.
  • the top-level network is constructed using top-level switches (2811, 2813, 2814, and 2816) connecting to each other as well as to other nodes on the top-level network, including the AGCUs, and I/O interface 2805.
  • the top-level network includes links (e.g., Li l, L9, L21, L22) connecting the top-level switches. Data travels in packets between the top-level switches on the links, and from the switches to the nodes on the network connected to the switches.
  • top-level switches 2811 and 2812 are connected by a link L14
  • toplevel switches 2814 and 2815 are connected by a link L9
  • top-level switches 2811 and 2814 are connected by a link LI 3
  • top-level switches 2812 and 2813 are connected by a link L21.
  • the links can include one or more buses and supporting control lines, including for example a chunk-wide bus (vector bus).
  • the top-level network can include data, request and response channels operable in coordination for transfer of data in a manner analogous to an AXI compatible protocol. See, AMBA® AXI and ACE Protocol Specification, ARM.
  • Top-level switches can be connected to AGCUs.
  • top-level switches 2811, 2812, 2814, and 2815 are connected to MAGCU1, AGCU9, AGCU13 and AGCU14 in the tile Tilel, respectively.
  • Top-level switches 2812, 2813, 2815, and 2816 are connected to MAGCU2, AGCU22, AGCU23 and AGCU24 in the tile Tile2, respectively.
  • Top-level switches can be connected to one or more external I/O interfaces (e.g., interface 2805).
  • external I/O interfaces e.g., interface 2805
  • Figure 29 is a simplified diagram of a tile and an array level network usable in the configuration of Figure 28, where the configurable units in the array are nodes on the array level network and are configurable to implement the virtualization logic 2797.
  • the array of configurable units 2900 includes a plurality of types of configurable units, which are configured with the virtualization logic 2797.
  • the types of configurable units in this example include Pattern Compute Units (PCUs), Pattern Memory Units (PMUs), Switch units (S), and Address Generation and Coalescing Units (each including two address generators AG and a shared CU).
  • PCUs Pattern Compute Units
  • PMUs Pattern Memory Units
  • S Switch units
  • Address Generation and Coalescing Units each including two address generators AG and a shared CU.
  • Prabhakar et al. “Plasticine: A Reconfigurable Architecture For Parallel Patterns,” ISCA ’ 17, June 24-28, 2017, Toronto, ON, Canada, which is incorporated by reference as if fully set forth herein.
  • the PCUs (e.g., 2942) and PMUs (e.g., 2943) in the array of configurable units 2900 can include resources configurable for embodiment of a computation unit, an example configuration of which is described herein.
  • Each of these configurable units contains a configuration store comprising a set of registers or flip-flops that represent either the setup or the sequence to run a program, and can include the number of nested loops, the limits of each loop iterator, the routes and/or instructions to be executed for each stage including stages, the source of the operands, and the network parameters for the input and output interfaces.
  • the configuration file can include entries of Look-Up Tables as described herein.
  • each of these configurable units contains a configuration store comprising a set of registers or flip-flops that store status usable to track progress in nested loops or otherwise.
  • a configuration file in the configuration store contains a bit-stream representing the initial configuration, or starting state, of each of the components that execute the program. This bit-stream is referred to as a bit file.
  • Program load is the process of setting up the configuration stores in the array of configurable units based on the contents of the bit file to allow the components to execute a program (i.e., a machine), including programs that utilize the virtualization logic 2797. Program Load may also require the load of all PMU memories.
  • the array level network includes links interconnecting configurable units in the array.
  • the links in the array level network include one or more and, in this case, three kinds of physical buses: a chunk-level vector bus (e.g., one hundred and twenty-eight bits of data), a word-level scalar bus (e.g., thirty-two bits of data), and a multiple bit-level control bus.
  • interconnect 2921 between switch units 2911 and 2912 includes a vector bus interconnect with a vector bus width of one hundred and twenty-eight bits, a scalar bus interconnect with a scalar bus width of thirty -two bits, and a control bus interconnect.
  • the scalar bus can have a thirty -two -bit payload and carry scalar operands or control information.
  • data can be represented using floating point data formats, including standard or nonstandard formats.
  • Example formats include FP32 and BF16, among others. It can be understood that the number of data values carried on the scalar and vector buses is a function of the encoding format of the data values, with FP32 utilizing thirty -two bits per value and BF16 using sixteen bits per value.
  • the control bus can carry control handshakes such as tokens and other lines.
  • the vector and scalar buses can be packet switched, including headers that indicate a destination of each packet and other information such as sequence numbers that can be used to reassemble a file when the packets are received out of order.
  • Each packet header can contain a destination identifier that identifies the geographical coordinates of the destination switch unit (e.g., the row and column in the array), and an interface identifier that identifies the interface on the destination switch (e.g., North, South, East, West, etc.) used to reach the destination unit.
  • the control network can be circuit switched based on timing circuits in the device, for example.
  • the configuration load/unload controller can generate a header for each chunk of configuration data (e.g., bit stream) of one hundred and twenty-eight bits. The header is transmitted on a header bus to each configurable unit in the array of configurable unit.
  • a chunk of data of one hundred and twenty-eight bits is transmitted on the vector bus that provides the chunk as vector inputs to a configurable unit.
  • the vector bus can include one hundred and twenty-eight payload lines, and a set of header lines.
  • the header can include a sequence ID for each chunk, which can include:
  • Bits that form a chunk number Bits that indicate a column identifier.
  • the configuration unload controller can write the unload data out of order to the memory.
  • the shifting in the configuration serial chains in a configuration data (e.g., bit stream) store in a configurable unit is from LSB (Least-Significant-Bit) to MSB (Most-Significant-Bit), or MSB out first.
  • FIG 29B illustrates an example switch unit connecting elements in an array level network.
  • a switch unit can have eight interfaces.
  • the North, South, East and West interfaces of a switch unit are used for connections between switch units.
  • the Northeast, Southeast, Northwest and Southwest interfaces of a switch unit are each used to make connections to PCU or PMU instances.
  • a set of two switch units in each tile quadrant have connections to an Address Generation and Coalescing Unit (AGCU) that include multiple Address Generation (AG) units and a Coalescing Unit (CU) connected to the multiple address generation units.
  • the Coalescing Unit (CU) arbitrates between the AGs and processes memory requests.
  • Each of the eight interfaces of a switch unit can include a vector interface, a scalar interface, and a control interface to communicate with the vector network, the scalar network, and the control network.
  • data can be sent via one or more unit switches and one or more links between the unit switches to the configurable units using the vector bus and vector interface(s) of the one or more switch units on the array level network.
  • a configuration file or bit file, before configuration of the tile can be sent from the configuration load controller using the same vector bus, via one or more unit switches and one or more links between the unit switches to the configurable unit using the vector bus and vector interface(s) of the one or more switch units on the array level network.
  • a chunk of configuration data (e.g., bit stream) in a unit file particular to a configurable unit PMU 2941 can be sent from the configuration load/unload controller 2901 to the PMU 2941, via a link 2920 between the configuration load/unload controller 2901 and the West (W) vector interface of the switch unit 2911, the switch unit 2911, and a link 2931 between the Southeast (SE) vector interface of the switch unit 2911 and the PMU 2941.
  • W West
  • SE Southeast
  • one of the AGCUs is configured to be a master AGCU, which includes a configuration load/unload controller (e.g., 2901).
  • the master AGCU implements a register through which the host (2720, Figure 27) can send commands via the bus system to the master AGCU.
  • the master AGCU controls operations on an array of configurable units in a tile and implements a program control state machine to track the state of the tile based on the commands it receives from the host through writes to the register.
  • the master AGCU issues commands to all components on the tile over a daisy- chained command bus ( Figure 30).
  • the commands include a program reset command to reset configurable units in an array of configurable units in a tile, and a program load command to load a configuration file to the configurable units.
  • the configuration load controller in the master AGCU is responsible for reading the configuration file from the memory and sending the configuration data (e.g., bit stream) to every configurable unit of the tile.
  • the master AGCU can read the configuration file from the memory at preferably the maximum throughput of the top-level network.
  • the data read from memory are transmitted by the master AGCU over the vector interface on the array level network to the corresponding configurable unit according to a distribution sequence described herein.
  • configuration and status registers holding unit files to be loaded in a configuration load process, or unloaded in a configuration unload process in a component are connected in a serial chain and can be loaded through a process of shifting bits through the serial chain.
  • a configurable unit can require multiple chunks of data to load all its configuration bits.
  • the configurable units interface with the memory through multiple memory interfaces (2750, Figure 27). Each of the memory interfaces can be accessed using several AGCUs. Each AGCU contains a reconfigurable scalar data path to generate requests for the off-chip memory. Each AGCU contains FIFOs (First-In, First-Out buffers for organizing data) to buffer outgoing commands, data, and incoming responses from the off-chip memory.
  • FIFOs First-In, First-Out buffers for organizing data
  • Figure 30 is a block diagram illustrating an example configurable unit 3000, such as a Pattern Compute Unit (PCU), which is configured with the virtualization logic 2797.
  • a configurable unit can interface with the scalar, vector, and control buses, in this example using three corresponding sets of inputs and outputs (IO): scalar inputs/outputs, vector inputs/outputs, and control inputs/outputs.
  • Scalar IOs can be used to communicate single words of data (e.g., thirty-two bits).
  • Vector IOs can be used to communicate chunks of data (e.g., one hundred and twenty-eight bits), in cases such as receiving configuration data (e.g., bit stream) in a unit configuration load process and transmitting and receiving data during operation after configuration across a long pipeline between multiple PCUs.
  • Control IOs can be used to communicate signals on control lines such as the start or end of execution of a configurable unit. Control inputs are received by control block 3090, and control outputs are provided by the control block 3090.
  • Each vector input is buffered in this example using a vector FIFO in a vector FIFO block 3060 which can include one or more vector FIFOs.
  • each scalar input is buffered using a scalar FIFO 3070.
  • a configurable unit includes multiple reconfigurable data paths in block 3080.
  • a data path in a configurable unit can be organized as a multi-stage (Stage 1 . . . Stage N), reconfigurable SIMD (Single Instruction, Multiple Data) pipeline.
  • the chunks of data pushed into the configuration serial chain in a configurable unit include configuration data (e.g., bit stream) for each stage of each data path in the configurable unit.
  • the configuration serial chain in the configuration data (e.g., bit stream) store 3020 is connected to the multiple data paths in block 3080 via lines 3021.
  • a configurable data path organized as a multi-stage pipeline can include multiple functional units (e.g., 3081, 3082, 3083, 3084, 3085, 3086) at respective stages.
  • a computation unit or parts of a computation unit can be implemented in multiple functional units at respective stages in a multi-stage pipeline or in multiple multi-stage pipelines.
  • a circuit including the virtualization logic 2797 can be implemented in multiple functional units and multiple memory units.
  • Input registers in functional units can register inputs from scalar FIFOs 3070 or Vector FIFOs 3060 or from previous stages in a multi-stage pipeline.
  • a functional unit at a stage in a multi-stage pipeline can execute a function, e.g., logical shift, an arithmetic function, comparison, a logical operation, etc., and generate an output.
  • Configurable units in the array of configurable units include configuration data (e.g., bit stream) stores 3020 (e.g., serial chains) to store unit files comprising a plurality of chunks (or sub-files of other sizes) of configuration data (e.g., bit stream) particular to the corresponding configurable units.
  • Configurable units in the array of configurable units each include unit configuration load logic 3040 connected to the configuration data (e.g., bit stream) store 3020 via line 3022, to execute a unit configuration load process.
  • the unit configuration load process includes receiving, via the bus system (e.g., the vector inputs), chunks of a unit file particular to the configurable unit and loading the received chunks into the configuration data (e.g., bit stream) store 3020 of the configurable unit.
  • the unit file loaded into the configuration data (e.g., bit stream) store 3020 can include configuration data (e.g., bit stream), including opcodes and routing configuration, for circuits (e.g., module) implementing the virtualization logic 2797 in multiple functional units and multiple memory units, as described herein.
  • configuration data e.g., bit stream
  • circuits e.g., module
  • the configuration data (e.g., bit stream) stores in configurable units in the plurality of configurable units in this example comprise serial chains of latches, where the latches store bits that control configuration of the resources in the configurable unit.
  • a serial chain in a configuration data (e.g., bit stream) store can include a shift register chain for configuration data (e.g., bit stream) and a second shift register chain for state information and counter values connected in series.
  • Input configuration data (e.g., bit stream) 3010 can be provided to a vector FIFO as vector inputs, and then be transferred to the configuration data (e.g., bit stream) store 3020.
  • Output configuration data (e.g., bit stream) 3030 can be unloaded from the configuration data (e.g., bit stream) store 3020 using the vector outputs.
  • the CGRA uses a daisy-chained completion bus to indicate when a load/unload command has been completed.
  • the master AGCU transmits the program load and unload commands to configurable units in the array of configurable units over a daisy-chained command bus.
  • a control block 3090, a daisy-chained completion bus 3091 and a daisy-chained command bus 3092 are connected to daisy-chain logic 3093, which communicates with the unit configuration load logic 3040.
  • the daisy-chain logic 3093 can include load complete status logic, as described below.
  • the daisy-chained completion bus is further described below. Other topologies for the command and completion buses are clearly possible but not described here.
  • FIG 31 is a block diagram illustrating an example configurable unit 3100, such as a Pattern Memory Unit (PMU), which is configured with the virtualization logic 2797 (i.e., ready-to-read credit counters, write credit counters, and flow control logic for operating them).
  • PMU Pattern Memory Unit
  • a PMU can contain scratchpad memory 3130 coupled with a reconfigurable scalar data path 3120 intended for address calculation (RA, WA) and control (WE, RE) of the scratchpad memory 3130, along with the bus interfaces used in the PCU.
  • the bus interfaces can include scalar inputs, vector inputs, scalar outputs and vector outputs, usable to provide write data WD.
  • the data path can be organized as a multi-stage reconfigurable pipeline, including stages of functional units FUs and associated pipeline registers PRs that register inputs and outputs of the functional units.
  • PMUs can be used to store distributed on-chip memory throughout the array of reconfigurable units.
  • a scratchpad is built with multiple SRAM banks (e.g., 3131, 3132, 3133, 3134).
  • a computation unit as described herein can include a Look-Up Table stored in the scratchpad memory 3130, from a configuration file or from other sources.
  • the scalar data path 3120 can translate a section of a raw input value I for addressing Look-Up Tables implementing a function f(I), into the addressing format utilized by the SRAM scratchpad memory 3130, adding appropriate offsets and so on, to read the entries of the Look-Up Table stored in the scratchpad memory 3130 using the sections of the input value I.
  • Each PMU can include write address calculation logic and read address calculation logic that provide write address WA, write enable WE, read address RA and read enable RE to the banking buffering logic 3135.
  • the control block 3115 can be configured to trigger the write address computation, read address computation, or both, by enabling the appropriate counters 3116.
  • a programmable counter chain 3116 (Control Inputs, Control Outputs) and control block 3115 can trigger PMU execution.
  • the configurable processor can be configured in other ways to implement a computation unit.
  • Other types of configurable processors can implement the computation unit in other ways.
  • the computation unit can be implemented using dedicated logic in some examples, or a combination of dedicated logic and instruction-controlled processors.
  • the two or more reconfigurable processors when two or more reconfigurable processors collaboratively execute an application, the two or more reconfigurable processors are independently and separately configured (e.g., by the runtime processor) with a same set of configuration files.
  • a first reconfigurable processor configured with a given set of configuration files, begins executing configuration files in the given set of configuration files and/or functions therefor and/or data therefor, and requires a second reconfigurable processor, also configured with the given set of configuration files, to execute certain configuration files in the given set of configuration files and/or functions therefor and/or data therefor
  • the second reconfigurable processor waits for a signal from the first reconfigurable processor.
  • Examples of the signal include a control signal that indicates a breakpoint/checkpoint after a quiesce condition, such as the one described in US Non-provisional Patent Application No. 16/504,627, filed July 8, 2019, entitled, “QUIESCE RECONFIGURABLE DATA PROCESSOR,” (Attorney Docket No. SBNV 1008-1).
  • the second reconfigurable processor begins execution of the certain configuration files and/or functions therefor and/or data therefor using its own copy of the given set of configuration files with which it is independently and separately configured.
  • a checkpoint is generated at the first reconfigurable processor, the checkpoint is transferred to the second reconfigurable processor, and the second reconfigurable processor loads the checkpoint and begins execution of the certain configuration files and/or functions therefor and/or data therefor.
  • a first example of accelerated deep learning is using a deep learning accelerator to train a neural network.
  • a second example of accelerated deep learning is using a deep learning accelerator to operate a trained neural network to perform inferences.
  • a third example of accelerated deep learning is using a deep learning accelerator to train a neural network and subsequently perform inference with any one or more of the trained neural network, information from same, and a variant of same.
  • neural networks include Fully Connected Neural Networks (FCNNs), Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), Long Short- Term Memory (LSTM) networks, autoencoders, deep belief networks, and Generative Adversarial Networks (GANs).
  • FCNNs Fully Connected Neural Networks
  • RNNs Recurrent Neural Networks
  • CNNs Convolutional Neural Networks
  • LSTM Long Short- Term Memory
  • autoencoders deep belief networks
  • GANs Generative Adversarial Networks
  • An example of training a neural network is determining one or more weights associated with the neural network, such as by hardware acceleration via a deep learning accelerator.
  • An example of making an inference is using a trained neural network to compute results by processing input data based on weights associated with the trained neural network.
  • weight is an example of a ‘parameter’ as used in various forms of neural network processing. For example, some neural network learning is directed to determining parameters that are then usable for performing neural network inferences using the parameters.
  • a neural network processes data according to a dataflow graph comprising layers of neurons. Stimuli (e.g., input data) are received by an input layer of neurons and the computed results of the dataflow graph (e.g., output data) are provided by an output layer of neurons.
  • Example layers of neurons include input layers, output layers, rectified linear unit layers, fully connected layers, recurrent layers, long short-term memory layers, convolutional layers, kernel layers, dropout layers, and pooling layers.
  • a neural network is conditionally and/or selectively trained, subject to hardware acceleration. After being trained, a neural network is conditionally and/or selectively used for inference, subject to hardware acceleration.
  • An example of a deep learning accelerator is one or more relatively specialized hardware elements operating in conjunction with one or more software elements to train a neural network and/or perform inference with a neural network relatively more efficiently than using relatively less specialized hardware elements.
  • Some implementations of the relatively specialized hardware elements include one or more hardware logic circuitry elements such as transistors, resistors, inductors, capacitors, wire interconnects, combinatorial logic (e.g., NAND, NOR) gates, latches, register files, memory arrays, tags for memory arrays, content-addressable memories, flash, ROM, DRAM, SRAM, Serializer/Deserializer (SerDes), I/O drivers, and the like, such as implemented via custom logic, synthesized logic, ASICs, and/or FPGAs.
  • Some of the relatively less specialized hardware elements include conventional CPUs and conventional GPUs.
  • An example of storage is one or more elements enabled to retain state information, e.g., any one or more of: a flip-flop, a latch or an array of latches, a register or an array of registers, a register file, a memory, a memory array, a magnetic storage device, an optical storage device, SRAM, DRAM, flash, and ROM.
  • storage is volatile (e.g., SRAM or DRAM) and/or non-volatile (e.g., flash or ROM).
  • An example of an Integrated Circuit (IC) is a collection of circuitry implemented on one or more portions of semiconductor material, such as a single die or a plurality of dice.
  • An example of 3D-stacking of dice is providing mechanical connectivity and/or electrical connectivity between the dice, e.g., in a dimension orthogonal to a major surface of the dice, to form a unit.
  • the mechanical connectivity and/or the electrical connectivity are variously implemented, e.g., via one or more of solder balls, microbumps, and through-silicon vias.
  • An example of 2.5D stacking of dice is providing mechanical connectivity and/or electrical connectivity between the dice via a common element (e.g., a silicon interposer) to form a unit, wherein the mechanical connectivity and/or electrical connectivity between each die and the common substrate is in a dimension orthogonal to a major surface of the die.
  • the mechanical connectivity and/or the electrical connectivity are variously implemented, e.g., via one or more of solder balls, microbumps, and through-silicon vias.
  • An example of an Application-Specific Integrated Circuit (ASIC) is an IC designed for a particular use.
  • An example of a package is an element enabled to mechanically retain and/or contain one or more electronic circuits and/or to electrically interconnect one or more electronic circuits.
  • Example electronic circuits are any one or more of one or more portions of semiconductor material, one or more dice, one or more interposers, and one or more substrates.
  • Particular examples of packages include a BGA package and variants thereof.
  • Some ICs comprise a package.
  • An example of a substrate is an element to mechanically retain and/or electrically interconnect one or more dice and/or one or more packages.
  • a particular example of a substrate is a PCB to, e.g., retain and interconnect packages.
  • a substrate is a silicon interposer to, e.g., couple one or more 3D- stacked or 2.5-stacked dice.
  • a substrate is a package, e.g., retaining a plurality of dice.
  • a SmartNIC is a network interface component, or network adapter that operates directly on data packets independent of host kernel resources and running an operating system networking stack resulting in less contention for the host processing resources, less network latency, and increases in network data packet throughput.
  • the SmartNIC accomplishes this by offloading certain IP network protocol stack processing tasks from the system host CPU, acting as a coprocessor of sorts.
  • a SmartNIC is housed on a printed circuit card plugged into a backplane, whereas in other embodiments it is integrated onto a motherboard which also supports a host CPU or one or more RPs.
  • the SmartNIC can be integrated onto a single chip with an RP or other component.
  • a SmartNIC is equipped with a fully programmable hardware implementation, supporting an operating system configured for network processing tasks.
  • the hardware implementation may comprise System-on-Chip (SoC), FPGAs, ASICs, CGRAs, or other programmable processor circuits such as the ARM family.
  • SoC System-on-Chip
  • a SmartNIC may support sets of specialized hardware functionalities accelerates a specific class of functions (e.g., Open vSwitch data-plane) or to perform generic packet and flow-filtering, packet inspection, flow table processing, encryption, RDMA, VXLAN overlays and NVMe- oF functionality.
  • a SmartNIC may include host kernel-bypass logic for sending and receiving packets to/from nodes and additional hosts.
  • the SmartNIC may accomplish this by providing a set of physical addresses comprising a shared memory for inputs and outputs.
  • the reprogrammable processor may directly access sets of SmartNIC FIFO buffers using a combination of head and tail pointers as described supra to push and pull data, thus bypassing the host kernel and reducing at least one hop.
  • a host may also interface directly to the SmartNIC by writing to a physical address without requiring drivers to control the network flow, further increasing theoretical throughput.
  • the SmartNIC may provide a configuration interface to specify the physical addresses of a plurality of VO shared memory buffers comprising FIFO queues and mapping tables for memory regions containing packet buffers.
  • the SmartNIC may couple nodes, reprogrammable processors (RPs) and hosts to retrieve packet buffers from shared memory buffers and to transmit packet buffers from host, node, or RP DRAM to the SmartNIC shared memory buffers over a network.
  • RPs reprogrammable processors
  • the network fabric is an interface to a plurality of nodes and hosts.
  • the SmartNIC provides connectivity between either a host and the network or between a node and the network.
  • a node comprises a plurality of reprogrammable processors (RPs) and bypasses the host when interfacing to the SmartNIC.
  • RPs reprogrammable processors
  • a SmartNIC may connect to a first physical/link connection over the network, coupling the SmartNIC with a host, node, or RP.
  • the SmartNIC connects to a second physical/link connection, coupling the SmartNIC to the network.
  • the physical/link connections to the network fabric interface may each be of any type, for instance, Ethernet, Fibre Channel, InfiniBand, PCIe, etc.
  • a physical/link connection may also be a wireless medium.
  • a SmartNIC includes Media Access Controllers (MACs) to interface with the physical/link connections to route data packets to the RPs and hosts.
  • MACs Media Access Controllers
  • An example SmartNIC may use an FPGA to implement the communications protocols, e.g., Transport Control Protocol (“TCP”), used to perform internet routing and may comprise PCIe high-speed network interfaces, shared physical memory and an FPGA.
  • the FPGA may implement the SmartNIC controller as the bridge between a host, node, RP, and the network at the “physical layer” to integrate directly into the data path.
  • the SmartNIC may further implement the Open System Interconnection (“OSI”) model, which is a conceptual model that characterizes and standardizes the internal functions of a communication system by partitioning it into abstraction layers.
  • a physical abstraction layer defines electrical and physical specifications between a device and a transmission medium, such as a copper or fiber optical cable.
  • the major functions and services performed by the physical layer include: (1) establishment and termination of a connection to a communications medium; (2) contention resolution; (3) flow control; and (4) modulation to convert digital data in user equipment to the corresponding signals transmitted over a communications channel. These are the signals operating over the physical cabling (such as copper and optical fiber) or over a radio link.
  • the network flows can be Transmission Control Protocol/Intemet Protocol (TCP/IP) flows, for example.
  • TCP/IP Transmission Control Protocol/Intemet Protocol
  • the SmartNICs may exchange network packets with the nodes or hosts via a network/fabric comprising media/physical links and can exchange network packets with their respective nodes or hosts via host-facing media/physical links to the host NICs.
  • Network flows used by applications to exchange data may pass through the SmartNIC as follows.
  • a host-based application may have application-layer data to convey, for instance, a remote call invocation.
  • the host remote call invocation may comprise a command or data for passing through an operating system Application Programming Interface (API) (e.g., a stream or socket) as a write to a physical address on the SmartNIC where it enters the network stack,
  • API Application Programming Interface
  • the API writes the command or data into the physical address of the shared memory FIFO and placed in one or more transport packets (e.g., TCP/IP packets).
  • transport packets e.g., TCP/IP packets with the host's Internet Protocol (IP) address as the sender
  • IP Internet Protocol
  • the frames then pass through to the first physical/link connection of the network fabric.
  • IP Internet Protocol
  • the above process is reversed where the network packets require decapsulation and data eventually arrives at a physical address for the host, node, or RP.
  • the applications execute on the reconfigurable processors in a distributed fashion by programming the individual compute and memory components and may asynchronously receive, process, and send data and control information.
  • computation may execute as deep, nested dataflow pipelines that exploit nested parallelism and data locality efficiently.
  • These dataflow pipelines contain several stages of computation, where each stage reads data from one or more input buffers with an irregular memory access pattern, performs computations on the data while using one or more internal buffers to store and retrieve intermediate results, and produces outputs that are written to one or more output buffers.
  • the structure of these pipelines depends on the control and dataflow graph representing the application. Pipelines may arbitrarily nest and loop within each other.
  • the applications/graphs/application graphs/user applications/dataflow graphs/control flow graphs/data and control flow graphs/models/deep learning applications/deep neural networks/programs/program images/jobs/tasks comprise high-level programs.
  • a high-level program is source code written in programming languages like C, C++, Java, JavaScript, Python, and Spatial, for example, using deep learning frameworks like PyTorch, TensorFlow, ONNX, Caffe, and Keras.
  • the high-level program can implement computing structures and algorithms of machine learning models like AlexNet, VGGNet, GoogLeNet, ResNet, ResNeXt, RCNN, YOLO, SqueezeNet, SegNet, GAN, BERT, ELMo, USE, Transformer, and Transformer-XL.
  • the high-level program can implement a convolutional neural network with several processing layers, such that each processing layer can include one or more nested loops.
  • the high-level program can execute irregular memory operations that involve accessing inputs and weights and performing matrix multiplications between the inputs and the weights.
  • the high-level program can include nested loops with high iteration count and loop bodies that load and multiply input values from a preceding processing layer with weights of a succeeding processing layer to produce an output for the succeeding processing layer.
  • the high-level program can have loop-level parallelism of the outermost loop body, which can be exploited using coarse-grained pipelining.
  • the high-level program can have instruction-level parallelism of the innermost loop body, which can be exploited using loop unrolling, SIMD vectorization, and pipelining.
  • loops directly nested in a loop body are termed the child loops of the outer parent loop.
  • a loop is called an innermost loop if it does not have any children, i.e., there are no nested loops within its body.
  • a loop is an outermost loop if it does not have a parent, i.e., it is not nested within another loop's body.
  • An imperfectly nested loop has a body with a mix of non-looping statements (e.g., primitive arithmetic, logical, and relational operations) and one or more child loops.
  • Parallelism in the imperfectly nested loops can be exploited at any or all loop levels, and in the operations that comprise loop bodies. Parallelism can occur in multiple forms such as fine-grained and coarse-grained pipeline parallelism, data parallelism, and task parallelism.
  • a Software Development Kit (SDK) (or dataflow graph generator) generates dataflow graphs of the high-level programs of the applications.
  • SDK transforms the input behavioral description of the high-level programs into an intermediate representation such as the dataflow graphs. This may include code optimization steps like false data dependency elimination, dead-code elimination, and constant folding.
  • the dataflow graphs encode the data and control dependencies of the high-level programs.
  • the dataflow graphs comprise nodes and edges. (In order to avoid confusion with the use of the term 'node' to refer herein to processing nodes, a graph node is sometimes referred to herein as a 'vertex'.) Graph nodes or vertices can represent compute operations and memory allocations. The edges can represent dataflow and control flow. In some implementations, each loop in the high-level programs can be represented as a controller in the dataflow graphs. The dataflow graphs support branches, loops, function calls, and other variations of control dependencies. In some implementations, after the dataflow graphs are generated, additional analyses or optimizations focused on loop transformations can be performed, such as loop unrolling, loop pipelining, loop fission/fusion, and loop tiling.
  • the SDK also supports programming the reconfigurable processors in the pool of reconfigurable dataflow resources at multiple levels, for example, from the high-level deep learning frameworks to C++ and assembly language.
  • the SDK allows programmers to develop code that runs directly on the reconfigurable processors.
  • the SDK provides libraries that contain pre-defined functions like linear algebra operations, element-wise tensor operations, non-linearities, and reductions required for creating, executing, and profiling the dataflow graphs on the reconfigurable processors.
  • the SDK communicates with the deep learning frameworks via Application Programming Interfaces (APIs).
  • APIs Application Programming Interfaces
  • the vertices in a dataflow graph represent operation units and may configure to be producers to produce tensors for execution of an application, and to be consumers to consume the tensors for execution of the application.
  • the producers and consumers asynchronously transmit data along data connections.
  • a tensor includes one or more vectors.
  • a “compiler” transforms the dataflow graphs into a hardware-specific configuration, and specifies in an execution file generated by the compiler 114.
  • the compiler partitions the dataflow graphs into memory allocations and execution fragments, where these partitions are specified in the execution file.
  • Execution fragments represent operations on data.
  • An execution fragment can comprise portions of a program representing an amount of work.
  • An execution fragment can comprise computations encompassed by a set of loops, a set of graph nodes, or some other unit of work that requires synchronization.
  • An execution fragment can comprise a fixed or variable amount of work, as needed by the program. Different ones of the execution fragments can contain different amounts of computation.
  • Execution fragments can represent parallel patterns or portions of parallel patterns and are executable asynchronously.
  • the partitioning of the dataflow graphs into the execution fragments includes treating calculations within at least one innermost loop of a nested loop of the dataflow graphs as a separate execution fragment. In other implementations, the partitioning of the dataflow graphs into the execution fragments includes treating calculations of an outer loop around the innermost loop of the dataflow graphs as a separate execution fragment. In the case of imperfectly nested loops, operations within a loop body up to the beginning of a nested loop within that loop body are grouped together as a separate execution fragment.
  • Memory allocations represent the creation of logical memory spaces in on-chip and/or off-chip memories for data required to implement the dataflow graphs, and these memory allocations are specified in the execution file. Memory allocations define the type and the number of hardware resources (functional units, storage, or connectivity components). Main memory (e.g., DRAM) is off-chip memory for providing memory allocations. Scratchpad memory (e.g., SRAM) is on-chip memory for providing memory allocations. Other memory types for which the memory allocations can be made for various access patterns and layouts include read-only Look-Up Tables (LUTs), fixed size queues (e.g., FIFOs), and register files.
  • LUTs Look-Up Tables
  • FIFOs fixed size queues
  • register files register files.
  • the compiler binds memory allocations to virtual memory units and binds execution fragments to virtual compute units, and these bindings are specified in the execution file.
  • the compiler partitions execution fragments into memory fragments and compute fragments, and these partitions are specified in the execution file.
  • a memory fragment comprises address calculations leading up to a memory access.
  • a compute fragment comprises all other operations in the parent execution fragment.
  • each execution fragment is broken up into a plurality of memory fragments and exactly one compute fragment.
  • the compiler performs the partitioning using reverse dataflow analysis such that inputs to an address used in a memory access recursively flag until the compiler reaches either constant values or (bound) loop/pattem iterators.
  • a single execution fragment can produce one or more memory fragments, depending on how many memory accesses exist in the original loop body. In cases where the same memory addressing logic is shared across multiple memory accesses, address calculation may be duplicated to create multiple memory fragments from the same execution fragment.
  • the memory fragments of the execution fragments are configured to index into data structures. At least one of the memory fragments indexes into a data structure in the logical memory spaces of one of the memory allocations.
  • Each compute and memory fragment preserves information about all loops whose loop bodies directly contain the operations in the corresponding execution fragment. In one implementation, this corresponds to replicating the calculation of the loop iterators of each loop into each compute and memory fragment. This replication allows each fragment to preserve the same iterative behavior as the original program, while also allowing distributed calculation of loop iterators.
  • the compiler translates the applications developed with commonly used open-source packages such as Keras and PyTorch into reconfigurable processor specifications.
  • the compiler generates the configuration files with configuration data (e.g., bit stream) for the placed positions and the routed data and control networks. In one implementation, this includes assigning coordinates and communication resources of the physical memory and compute units by placing and routing units onto the array of the processor while maximizing bandwidth and minimizing latency.
  • Figure 32 illustrates an example processing node 3201 which includes a host 3210 and eight RPs 3212 numbered RP0 through RP7, all interconnected by way of a PCIe bus 3220.
  • Other implementations can include other quantities of RPs 3212, such as four or 16.
  • the configuration bit file may designate one of the RPs as a "master" RP, and allocate to it certain high level responsibilities.
  • the bit file may configure all of the RPs to be identical instances of a dataflow graph or graph fragment.
  • the configuration bit file may configure some or all of the RPs with dissimilar dataflow graphs or graph fragments.
  • the processing node 3201 also includes a SmartNIC 3222, which has one port 3224 connected to the PCIe bus 3220, and a second port 3226 connected to a LAN 3228.
  • the LAN 3228 in the present embodiment is Ethernet, but in other embodiments it could be other types of LANs such as WiFi or InfiniBand.
  • the processing node includes more than one SmartNIC connected to the PCIe bus.
  • another embodiment may have one SmartNIC for each RP in the node, in which case each SmartNIC may be situated physically adjacent to (or on the same motherboard or the same chip as) its corresponding RP. See Figure 48 for example, discussed below.
  • each SmartNIC can have multiple PCIe ports and/or multiple LAN (or other) ports.
  • FIG 33 is a block diagram of SmartNIC 3222 in Figure 32.
  • SmartNICs 4822 may be the same, in appropriate embodiments. It includes a PCIe interface 3310 for communicating with the PCIe bus 120 and a LAN interface 3312 (shown in Figure 33 as more specifically an Ethernet interface) for communicating with LAN 3228.
  • the PCIe bus 3220 generally carries data and other signals according to a PCIe protocol
  • the SmartNIC 3222 includes an IP protocol processing facility 3231 to encapsulate such data according to IP standards and forward them through the LAN interface 3222 onto the LAN 3228.
  • the IP protocol processing facility 3231 also extracts such data from IP packets incoming through the LAN interface 3222, and forwards the extracted data onto the PCIe bus 3220 through the PCIe interface 3310.
  • the PCIe bus 3220 also carries messages according to a peer-to-peer (P2P) messaging protocol.
  • P2P messaging protocol is a protocol layer which facilitates message passing among peers, without requiring the use of central server computer or server software. Peers communicating with the P2P protocol are roughly equal participants in an application. A peer is both a supplier and a consumer of resources, in contrast with traditional client-server models in which the consumption and supply of resources is divided.
  • SmartNIC 3222 transparently extend the network of P2P peers over the LAN 3228 to another processing node.
  • SmartNIC 3222 therefore also includes an encapsulate/decapsulate facility 3230.
  • Encap sulate/decapsulate facility 3230 receives P2P messages from PCIe bus 3220, decapsulates them to extract the message, and re-encapsulates them in datagrams for transmission over the LAN 3228. It similarly receives P2P messages from the LAN 3228, decapsulates them to extract the P2P message, and re-encapsulates them in PCIe bus packets for the PCIe bus 3220.
  • Each data packet includes a destination address (which may be split into two or more components) which uniquely identifies any RP in the system, so encapsulate/decapsulate facility 3230 is able to determine from a packet incoming on the PCIe bus 3220, what Ethernet MAC address to apply to the repackaged datagram(s) it sends out on the LAN 3228 (and vice-versa). Encapsulate/decapsulate facility 3230 performs this address translation by referring to an address translation table that was previously programmed into the SmartNIC 3222 by the configuration bit file. In order to simplify and expedite LAN communication among processing nodes, the system may use Level 2 protocols of the OSI protocol stack, rather than Level 3/4 protocols such as IP or TCP.
  • SmartNIC 3222 also includes a dataflow offload controller 3232, a local memory pool 3234, and one or more FPGA cores 3236.
  • Dataflow offload controller 3232 contains buffers as set forth above with respect to the RPs, and includes logic to interact with other RPs via the P2P protocol in the same way as set forth above with respect to an RP. As such, it participates in the same P2P messaging conversations just as do all the RPs.
  • the SmartNIC 3222 (and any endpoints configured into the FPGA compute cores 3236, if any) also can be a destination for P2P messages from a local RP (via PCIe bus 3220) and from an external RP (via LAN 3228). It can also generate its own P2P messages destined for a local RP (via PCIe bus 3220) and from an external RP (via LAN 3228).
  • the SmartNIC 3222 (as well as any endpoints configured into the FPGA compute cores 3236, if any) are assigned addresses in the P2P protocol address space, and the uniqueness of RP addresses within that address space carries through to any endpoints in the SmartNIC as well.
  • the locations in the local memory pool 3234 are also assigned addresses which are unique within the P2P protocol address space. In embodiments herein there is a 1 : 1 relationship between addresses on the PCIe bus and addresses in the P2P protocol address space (though the specific address values may differ).
  • one or more of the encapsulate/decapsulate facility 3230, the IP protocol processing facility 3231, the and the dataflow offload controller 3232 can be implemented in hardware logic, including as part of the FPGA 3236. In an alternative embodiment, one or more of them can be implemented by way of a software or firmware instruction-controlled processor. Also, as used herein, the PCIe protocol, the Ethernet protocol, the P2P messaging protocol, P2P-over-PCIe, and P2P-over-Ethernet, are all considered to be different protocols. Protocols are considered herein to be "different" if they differ in at least one layer of their respective protocol stacks.
  • the FPGA compute cores 3236 may be integrated onto the same chip as one or more of the other components shown in Fig. 33, or it may be a separate integrated circuit device such as one or more of Intel Agilex F-Series and I-series FPGAs, and Xilinx Virtex Ultrascale+ FPGAs, as examples.
  • the local memory pool 3234 may for example include off- the-shelf DRAM arrays.
  • the SmartNIC 3222 itself has a unique address in the P2P protocol, so that P2P messages may be directed to the SmartNIC itself rather than another RP.
  • Dataflow Offload Controller 3234 determines from the destination address in incoming P2P packets (incoming from PCIe bus 3220 or from LAN 3228) whether the destination is the SmartNIC 3222 or the opposite interface.
  • Dataflow Offload Controller 3234 forwards the extracted message to the FPGA compute cores 3236.
  • the SmartNIC 3222, and more particularly the FPGA cores 3236, are realized as vertices or graph fragments in the same Dataflow graph that is used to configure the overall system for dataflow processing of an algorithm.
  • the coarse grained reconfigurable units in one embodiment are of two specific types: Pattern Compute Units (PCUs) and Pattern Memory Units (PMUs), arranged in a two dimensional array.
  • PCUs Pattern Compute Units
  • PMUs Pattern Memory Units
  • Each PCU has a reconfigurable pipeline with multiple stages of SIMD functional units, with support for cross-SIMD lane shifting and reduction.
  • Each PMU has a banked scratchpad memory and dedicated addressing logic and address decoders. These units communicate with each other through a pipelined static hybrid interconnect with separate bus-level and wordlevel data, and bit-level control networks.
  • a compiler is able to map inner loop computation to one PCU such that most operands are transferred directly between functional units without scratchpad accesses or inter-PCU communication.
  • the on-chip, banked scratchpads are configurable to support streaming and double buffered accesses.
  • the off-chip memory controllers support both streaming (burst) patterns and scatter/gather accesses.
  • the on-chip control logic is configurable to support nested patterns.
  • the hierarchy in the architecture is designed to simplify compiler mapping and improve execution efficiency of the specific common parallel patterns of Map, FlatMap, Fold, and HashReduce, though the device can also be configured to implement other patterns as well.
  • Figure 34 depicts conceptual examples of each of these patterns, where computation is shown operating on four indices simultaneously. Each of these patterns takes as input one or more functions and an index domain describing the range of values that the pattern operates over. Each of these patterns builds an output and reads from an arbitrary number of input collections.
  • Map creates a single output element per index using the function f, where each execution of f is guaranteed to be independent.
  • the number of output elements from Map is the same as the size of the input iteration domain.
  • Map can be configured by software to capture the behavior of a gather, a standard element-wise map, a zip, a windowed filter, or any combination thereof.
  • FlatMap produces an arbitrary number of elements per index using function g, where again function execution is independent.
  • the produced elements are concatenated into a flat output.
  • the functional components that implement FlatMap also can be configured to implement conditional data selection (e.g. WHERE in SQL, filter in Haskell or Scala) as well, since these functions are a special case of FlatMap where g produces zero or one elements.
  • Fold first acts as a Map, producing a single element per index using the function f, then reduces these elements using an associative combine function r.
  • HashReduce generates a hash key and a value for every index using functions k and v, respectively. Values with the same corresponding key are reduced on the fly into a single accumulator using an associative combine function r.
  • HashReduce may either be dense, where the space of keys is known ahead of time and all accumulators can be statically allocated, or sparse, where the pattern may generate an arbitrary number of keys at runtime.
  • Histogram creation is a common, simple example of HashReduce where the key function gives the histogram bin, the value function is defined to always be " 1 ", and the combine function is integer addition.
  • each PCU is designed to execute a single, innermost parallel pattern in an application.
  • the PCU data path is organized as a multi-stage, reconfigurable SIMD pipeline.
  • Each stage of each SIMD lane includes a functional unit (FU) and associated pipeline registers (PR).
  • FUs perform 32-bit word level arithmetic and binary operations, including support for floating point and integer operations.
  • PR pipeline registers
  • FUs perform 32-bit word level arithmetic and binary operations, including support for floating point and integer operations.
  • results from each FU are written to its associated register.
  • Pipeline registers in each lane are chained together across pipeline stages to allow live values to propagate between stages within the same lane.
  • Cross-lane communication between FUs is captured using two types of intra-PCU networks: a reduction tree network that allows reducing values from multiple lanes into a single scalar, and a shift network which allows using PRs as sliding windows across stages to exploit reuse in stencil applications. Both networks use dedicated registers within PRs to minimize hardware overhead.
  • PCUs interface with the global interconnect using three kinds of inputs and outputs (IO); scalar, vector, and control.
  • Scalar IO is used to communicate single words of data, such as the results of Folds.
  • Each vector IO allows communicating one word per lane in the PCU, and is used in cases such as reading and writing to scratchpads in PMUs and transmitting intermediate data across a long pipeline between multiple PCUs.
  • Each vector and scalar input is buffered using a small FIFO.
  • Control IO is used to communicate control signals such as the start or end of execution of a PCU, or to indicate backpressure.
  • a reconfigurable chain of counters generates pattern iteration indices and control signals to coordinate execution.
  • PCU execution begins when the control block enables one of the counters.
  • the program can configure the control block to combine multiple control signals from both local FIFOs and global control inputs to trigger PCU execution.
  • the control block is implemented using reconfigurable combinational logic and programmable up-down counters for state machines.
  • the PMUs are similarly configurable with coarse grained components. Each PMU contains a programmer-managed scratchpad memory coupled with a reconfigurable scalar data path that can be used for address calculation. PMUs are used to distribute on-chip memory throughout the device.
  • the scratchpads are built with multiple SRAM banks matching the number of PCU lanes.
  • Address decoding logic around the scratchpad can be configured by the program to operate in a striped banking mode, to support linear access patterns often found on dense data structures, or FIFO mode support, to support access patterns resembling a sliding window.
  • the address decoding logic can also be configured for a duplication mode, where the contents are duplicated across all memory banks. Duplication mode provides multiple read address channels to support parallelized on-chip gather operations.
  • each PMU scratchpad can be configured to operate as an N-buffer with any of the above banking modes.
  • N-buffers are implemented by partitioning the address space in each SRAM bank into N disjoint regions. Using write and read state information, an appropriate offset is added to each bank’s local address to access the correct data.
  • a programmable counter chain and control block triggers PMU execution similar to the PCU.
  • Each PMU typically contains write address calculation logic from the producer pattern, and read address calculation logic from the consumer pattern.
  • the control block can be configured to trigger the write address computation, read address computation, or both, by enabling the appropriate counters.
  • the interconnects among the FUs are also configurable by the program. More information on the structure and configurability of functional units in the CGRA embodiment can be found elsewhere herein, and in the above-incorporated publication by Prabhakar et al., entitled “Plasticine: A Reconfigurable Architecture for Parallel Patterns,” ISCA ’ 17, June 24- 28, 2017, Toronto, ON, Canada.
  • FPGAs can also be configured to implement the above patterns, but typically with much poorer energy efficiency.
  • FPGAs have much smaller and more generalized building blocks, and are therefore much more flexible in implementing functions that do not fit well into the programming paradigm for parallel computing as described above. FPGAs therefore are better at implementing such functions as simple arithmetic operations like addition, subtraction, multiplication and division, arbitrary state machines, moving of data to and from arbitrary locations in local memory, and/or other functions like bespoke encryption, compression, hash, transformation, or reduction functions that are not easily or efficiently implemented in a SIMD fashion in a CGRA.
  • the inclusion of FPGA compute cores and local memory on the SmartNIC 3222 enables a configuration program to assign execution of limited compute kernels to the SmartNIC, thereby offloading dataflow graph fragment from the RPs, or supplementing their execution on an RP.
  • the SmartNIC 3222 is capable of generating/responding to P2P traffic, offload kernels, fabric optimized engines, support Dataflow control/data requests, data flow offload, and orchestrate P2P control/data requests between FPGA compute cores and RPs.
  • the local memory pool 3234 is available as scratchpad for the compute cores 3236. It is also accessible by any device in system, including any RP and any other SmartNIC. In this manner, the CGRA and the FPGA can collaborate symbiotically to further optimize the execution of dataflow graphs.
  • Figure 35 illustrates a section from an example processing graph.
  • the processing graph is used to implement a neural network, such as a CNN, a FCNN, an RNN, a LSTM network, an autoencoder, a deep belief network, a GAN, and/or the like.
  • Fig. 35 illustrates one example section 3500 of a processing graph comprising processing nodes 3508, 3512 implementing convolution operations, and processing node 3516 implementing max-pooling operation.
  • the section 3500 of the processing graph comprises a sequence of processing nodes or layers. Individual processing nodes or layers perform a corresponding operation.
  • the layers in the sequence of layers include one or more of convolution layers, max pooling layers, min pooling layers, average pooling layers, non-linearity layers, normalization layers, dropout layers, concatenation layers, transpose convolution layers, fully connected layers, SoftMax layers, and/or loss layers.
  • the example section 3500 of Fig. 35 includes two example types of layers, such as convolution layers and a max-pool layer.
  • the terms “layer” implementing an operation and “processing node” implementing an operation are used interchangeably.
  • the sequence of processing nodes includes an input processing node 3508 configured to receive an input tensor 3502.
  • the input processing node 3508 of the section 3500 convolves the input tensor 3502 with a kernel (not illustrated), to generate an intermediate tensor 3510.
  • An intermediate processing node 3512 of the section 3500 convolves the intermediate tensor 3510 with another kernel (not illustrated), to generate another intermediate tensor 3514.
  • An output processing node 3516 of the section 3500 performs a pooling operation (such as a max-pooling operation) on the intermediate tensor 3514, to generate an output tensor 3520 and an index tensor 3522.
  • a pooling operation such as a max-pooling operation
  • the section 3500 is illustrated to include three processing nodes, in another example, the section 3500 can include a greater (or smaller) number of processing nodes. For example, the section 3500 can include a higher number of convolution layers.
  • convolution and max-pooling layers are illustrated in the section 3500 of the processing graph, other types of layers may also be included, such as layers implementing ReLU, average pooling, fully connected layers, and/or the like.
  • the processing graph may include simpler functions, such as simple arithmetic operations, reading and writing to memory, and so on.
  • Fig. 35 illustrates a single section of a processing graph of an application, a processing graph of an application can include multiple such sections, pipelined serially or in parallel, or both in combination.
  • a software compiler is used to generate dataflow graphs of the high- level programs of the applications.
  • the compiler transforms the input behavioral description of the high-level programs into an intermediate representation such as the dataflow graphs. This may include code optimization steps like false data dependency elimination, dead-code elimination, and constant folding.
  • the dataflow graphs encode the data and control dependencies of the high-level programs.
  • Figure 36 illustrates an overall structure that may be used for the compiler.
  • the compiler 3610 includes user entry points for common open-source, machinelearning frameworks, such as PyTorch and TensorFlow. Serialized graphs from other frameworks and tools also can be imported. These inputs are provided to a Dataflow Graph Analyzer 3612, which accepts models from the frameworks then analyzes the model to extract the dataflow graph. For each operator, the computation and communication requirements are determined, so the appropriate RP resources can be allocated later. The analyzer determines the most efficient mappings of the operators and communication patterns to the RP utilizing the spatial programming model. With knowledge of both the model architecture and the RP architecture, the analyzer can also perform high-level, domainspecific optimizations like node fusion.
  • machinelearning frameworks such as PyTorch and TensorFlow.
  • Serialized graphs from other frameworks and tools also can be imported. These inputs are provided to a Dataflow Graph Analyzer 3612, which accepts models from the frameworks then analyzes the model to extract the data
  • the analyzer also determines which fragments of the application should be allocated to an FPGA, if one will exist in the system.
  • the output of the Dataflow Graph Analyzer 3612 is an annotated Dataflow Graph 3614 that serves as the first intermediate representation (IR) passed to a Dataflow Compiler 3616.
  • a Template Compiler 3618 analyzes the operator and generates an optimized dataflow implementation for the RP, called a Spatial Template 3620.
  • the generated template includes bindings that enable the new operator to be used directly from application code in the same way as built-in framework operators. If the analyzer 3612 determined that the operator is to be executed on the FPGA rather than on an RP, the compilation for the operator is performed by an FPGA compiler 3622.
  • the Dataflow Optimizer, Compiler and Assembler 3616 receives annotated Dataflow Graphs and performs high-level transformations like meta-pipelining, multi-section support and parallelization. It also understands the RP hardware attributes and performs low-level transforms, primarily placing and routing by mapping the graph onto the physical RP and FPGA hardware and then outputting an executable runtime configuration bit file 3624.
  • the bit file may contain code for the RP as well as a pointer to the bit file for the FPGA hardware, since the FPGA hardware compilation process (P&R, fine-grain programming) may differ from the RP compilation process.
  • P&R fine-grain programming
  • a configuration bit file Once a configuration bit file is ready, it can be loaded into the system to configure all of the RPs and FPGAs to execute the application during runtime.
  • the configuration file is stored in memory, and transferred to the reconfigurable processors via a combination of parallel and serial techniques.
  • the configurable units include configuration data stores, implemented using for example serial chains of latches, to store unit files of configuration data.
  • the unit file particular to a configurable unit can comprise a plurality of sub-files of configuration data. In examples described herein, the sub-files consist of a “chunk” of data having a size suited to efficient distribution using the bus system.
  • Each of the configurable units can each include logic to execute a unit configuration load process, including receiving via the bus system, sub-files of a unit file particular to the configurable unit, and loading the received sub-files into the configuration store of the configurable unit.
  • configurable units in the plurality of configurable units use routes in the bus system during execution after configuration that are also used in the configuration load process.
  • the unit files can be organized to comprise a plurality of ordered sub-files.
  • the unit files particular to different configurable units may have different numbers of ordered sub-files in some embodiments.
  • the configuration file for an array of configurable units is arranged so that sub-files of the unit files are interleaved with other sub-files of the same order for other unit files, and arranged so that location of a sub-file in the configuration file implies the configurable unit in the array of the sub-file and its order in the unit file particular to the configurable unit.
  • the configuration data stores in configurable units can comprise serial chains, and the unit configuration load process can execute by receiving, in one bus cycle, all or part of a first sub-file of the unit file particular to the configurable unit from the bus system in one round of the distribution sequence, and beginning to push the received first sub-file into the serial chain during subsequent bus cycles before receiving a second sub-file in a next round of the distribution sequence, and receiving the second sub-file in the next round of the distribution sequence from the bus system in a later bus cycle, beginning to push the received second subfile into the serial chain during bus cycles after pushing earlier received sub-files into the serial chain.
  • the first sub-file is consumed by the unit configuration load process in the configurable unit before the second sub-file in the plurality of ordered sub-files is received by the configurable unit.
  • the system can include more than one type of configurable unit, and the unit files for different types of configurable units can include different numbers of sub-files of configuration data.
  • the unit files for a first type of configurable unit include Z1 chunks
  • the unit files for a second type of configurable unit include Z2 chunks, where Z1 is less than Z2.
  • the configuration load process can include retrieving segments of the configuration file including sub-file (i) of the unit files for all of the configurable units of a first type and the second type to be distributed in round R(i), for (i) going from 0 to Z 1-1, and then retrieving segments of the configuration file including sub-file (i) of the unit files for all of the configurable units of the second type to be distributed in round R(i), for (i) going from Z1 to Z2-1.
  • This protocol can be extended to any number of types of configurable units having different numbers of sub-files in their unit files.
  • a configuration load command identifying a location in memory of the configuration file can be received from a host process, and in response to the command, the process generates one or more memory access requests. As the requested portions of the configuration file are returned, the distribution sequence can be executed.
  • the sub-files of the plurality of unit files can be arranged in the configuration file in an interleaved fashion that matches the distribution sequence. This arrangement of the configuration files enables the configuration load process to imply the configurable unit, and the position in the plurality of ordered sub-files of each sub-file by the location of the sub-file in the configuration file.
  • the array configuration load process can include routing the subfiles to configurable units based on locations of the sub-files in the configuration file.
  • Configurable units in the array of configurable units in an example described herein include respective load complete status logic connected in a daisy chain starting and ending at the array configuration load logic.
  • the array configuration load logic forwards a configuration load complete signal on the daisy chain after the configuration file is distributed, and in each configurable unit in the array, the configuration load complete status logic forwards the configuration load complete signal on the daisy chain when the configuration load complete signal from a previous member of the chain is received and loading of its own unit file is completed.
  • the host is notified when the file has been completely loaded, and in response to such notification the host can send a start command to one or more of the configurable units to initiate the function to be executed by the machine.
  • the start command can be sent using the P2P protocol, and can be sent via the PCIe bus.
  • the bus system for carrying the configuration file segments can be the same as or substantially overlap the communication networks described elsewhere herein for carrying P2P messages.
  • Figure 37 illustrates one example organization of a configuration file. Other organizations can be used as well arranged as suits a particular protocol for loading and unloading configuration files.
  • configurable units in an array of configurable units include the Switch, PCU, PMU, AGCU and FPGA. Each of these configurable units contains a set of registers that represent either the setup or the sequence to run a program. These registers include data to define the operation of the configurable unit containing it, such as the number of nested loops, the limits of each loop iterator, the instructions to be executed for each stage, the source of the operands, the network parameters for the input and output interfaces. Additionally, each of the configuration files can include data to set context in a set of counters that track its progress in each nested loop.
  • FPGA units may not consist of sub-files, and their configuration load process may differ from the preceding.
  • a program executable contains a bit-stream representing the initial configuration, or starting state, of each of the configurable units that execute the program.
  • This bit-stream is referred to herein as a bit file, or as a configuration file.
  • Program load is the process of setting up the configuration stores in the configurable units based on the contents of the configuration file to allow all the configurable units to execute a program.
  • Program unload is the process of unloading the configuration stores from the configurable units, and assembling a bit-stream, called herein an unload configuration file.
  • the unload configuration file has, in examples described herein, the same arrangement chunks or sub-files and the configuration file used for program load.
  • the configuration file includes a plurality of chunks of configuration data for each configurable unit in an array of configurable units, the chunks being arranged in the configuration file in a fashion that matches the sequence in which they are to be distributed. This organization of the configuration file enables the array configuration load process to route the chunks to configurable units based on locations of the chunks in the configuration file.
  • M is seven, and the chunks are ordered from first to seventh (i.e. the first through the seventh chunks correspond with chunks (0) to (6) in this indexing).
  • the chunks are arranged so that all sub-files of order (i) for (i) going from 0 to M-l, for all the unit files in the load or unload configuration file are stored in a corresponding block (i) of address space in the memory, for (i) going from 0 to M-l.
  • the chunks of order (0) are stored in block (0) including addresses A0 to Al-1.
  • the chunks of order (0) for switch units in this example are in a group of contiguous addresses within block (0).
  • the chunks of order (0) for PCUs are in a group of contiguous addresses within block (0).
  • the chunks of order (0) for PMUs are in a group of contiguous addresses within block (0).
  • the chunks of order (0) for AGCUs are in a group of contiguous addresses.
  • the chunks of order (0) for FPGAs are in a group of contiguous addresses.
  • the chunks of order (1) are stored in block (1) including addresses Al to A2-1.
  • the chunks of order (1) for switch units in this example are stored in a group of contiguous addresses within block (1).
  • the chunks of order (1) for PCUs are in group of contiguous addresses within block (1).
  • the chunks of order (1) for PMUs are in group of contiguous addresses within block (1).
  • the chunks of order (1) for AGCUs are in group of contiguous addresses within block (1).
  • the chunks of order (1) for FPGAs are in group of contiguous addresses within block (1).
  • the chunks of orders 3 to 6 are arranged as seen in Figure 37, following the pattern in blocks (2) to (6).
  • the array can include more than one type of configurable unit, and the unit files for different types of configurable units can include different numbers of chunks of configuration data.
  • types of configurable units in the array can include Switch Units, PCU (Pattern Compute Units), PMU (Pattern Memory Units), AGCU (Address Generation and Coalescing Units) and FPGA (Field Programmable Gate Array) units.
  • the unit files can be organized to comprise a plurality of ordered chunks (or other sized sub-files).
  • the unit files particular to different configurable units may have different numbers of ordered chunks in some embodiments.
  • the configuration file for an array of configurable units is arranged so that chunks of the unit files are grouped with chunks of the same order for other unit files. Also, the configuration file is arranged so that location of a chunk in the configuration file implies the configurable unit in the array of the chunk and its order in the unit file particular to the configurable unit.
  • the chunks (0) of the unit files for all of the configurable units of the five types are retrieved in a first round, and the chunks (1) of the unit files for all of the configurable units of the five types are retrieved in a second round. After the first and second rounds, all (2) chunks of the unit files for all of the configurable units of the first type (Switch type) have been retrieved.
  • the unit files for all of the configurable units of the first, second, third, fourth and fifth types have 0, 1, 3, 4 and 5 chunks remaining to be retrieved, respectively.
  • the array configuration load process can then retrieve segments of the configuration file including chunk (i) of the unit files for all of the configurable units of the second, third, fourth and fifth types in a third round. After the third round, all (3) chunks of the unit files for all of the configurable units of the second type (PCU type) have been retrieved. The unit files for all of the configurable units of the first, second, third and fourth and fifth types have 0, 0, 2, 3 and 4 chunks remaining to be retrieved, respectively.
  • the array configuration load process can then retrieve segments of the configuration file including chunk (i) of the unit files for all of the configurable units of the third, fourth and fifth types in a fourth round. After the fourth round, all (4) chunks of the unit files for all of the configurable units of the third type (PMU type) have been retrieved.
  • the unit files for all of the configurable units of the first, second, third, fourth and fifth types have 0, 0, 1, 2 and 3 chunks remaining to be retrieved, respectively.
  • the array configuration load process can then retrieve segments of the configuration file including chunk (i) of the unit files for all of the configurable units of the third fourth and fifth types, in fifth, sixth and seventh rounds. After the sixth round, all (6) chunks of the unit files for all of the configurable units of the fourth type (AGCU type) have been retrieved.
  • the unit files for all of the configurable units of the first, second, third, fourth and fifth types have 0, 0, 0, 0 and 0 chunks remaining to be retrieved, respectively.
  • the array configuration load process can continue until the unit files for all of the configurable units of the first, second, third, fourth and fifth types have no chunks remaining to be retrieved.
  • the array configuration load process routes chunks of the configuration data to configurable units via the array level network using addresses implied by location of the chunks in the configuration file.
  • the chunks of the configuration file may be returned out of order to the configuration load controller from memory.
  • the location of the chunks in the configuration file can be used to route the chunk to the correct configurable unit. Because of the organization of the rounds in the distribution sequence, the configurable units are guaranteed to receive the chunks of their unit files in order.
  • the configuration load process may utilize the FPGA manufacturer's software and procedures rather than the mechanisms described herein for configuring the RPs.
  • a number of different kinds of applications can make use of the availability of an FPGA as set forth herein, to offload portions of a dataflow graph for increased efficiency or throughput. For example, if a deep learning application includes large numbers of parameters to be learned, and/or large numbers of data samples to use in training, it might be desirable to store these data in a storage cluster or an SQL database system. Accessing such data might require inordinate wait times if it were to be handled by an RP. With an FPGA, especially one built onto a SmartNIC, such I/O intensive tasks can be offloaded to the FPGA, leaving the powerful compute resources of the RP available for more appropriate tasks.
  • encryption and decryption tasks can be offloaded to the FPGA. If such tasks are peripheral to the primary purpose of the application, then offloading the encryption/ decry ption tasks to the FPGA would leave the RP's resources available for tasks more central to the application. Again, it is especially useful for the FPGA to be co-located with a SmartNIC which has interfaces to both the intra-node PCIe bus as well as an external LAN.
  • an application involves processing income stream data, such as audio, video or real time data acquisition data
  • income stream data such as audio, video or real time data acquisition data
  • co-location of the FPGA with the SmartNIC is advantageous because of its interfaces to both the PCIe bus and an external LAN.
  • the functions to be performed by the FPGA can be realized as part of the dataflow graph, assigned to the FPGA at compile time, and configured into the FPGA at the time of configuration load.
  • Figure 38 illustrates a simple deep learning application that is implemented with data parallelism across multiple RPs in a single compute node as previously set forth with respect to Figure 32.
  • the drawing illustrates two of the RP's designated RP0 and RPi, where the lower case subscript 'i' indicates that the component labeled RPi represents any number of RPs. All the RPs are configured with the same processing graph fragment, to learn the weights in a multi-layer neural network based on training data 3812.
  • the training data 3812 has been partitioned into multiple training data sets, each to be processed by a respective one of the RP's in parallel.
  • Partition 0 is to be processed by RP0 and each of the Partitions 'i' are to be processed by the respective RPi.
  • Each of the training data partitions are further subdivided into mini -batches of data samples as will become apparent below.
  • a component is given a subscript i, it will be understood to be representative of all of such components participating in the application execution.
  • each of the RP's is further in communication with the SmartNIC 3222, via a P2P messaging protocol carried over a PCIe bus 3220.
  • the FPGA in the SmartNIC 3222 has been configured by the configuration bit file in the manner described herein, to perform the functions described below. All of the RPs are configured with identical dataflow graph fragments, except that one of them, the one designated RP0 in Figure 38, may be considered a master.
  • the master is configured by the configuration bit file to perform certain high level dataflow functions for which performance by all the RPs would not be useful.
  • the application configured into the system of Fig. 38 is a neural network stochastic gradient decent (SGD) training application which has been further divided, longitudinally along a forward path of the network, into three graph "Sections" (or neural network layer sections) numbered 0, 1 and 2.
  • a “Section” of a dataflow graph refers to a portion of the graph that is allocated a time slice on the RP.
  • a “Section” can include multiple sequential layers of a neural network, and/or different branches of the neural network. Sectioning is performed by the compiler, with knowledge of the resources available to the RPs.
  • each of the sections may include internal branching, but they are divided such that each section requires the results of its upstream section for its own forward calculations (prediction calculation).
  • each section requires gradients calculated by its own downstream section (upstream when viewed according to the backward propagation direction) for each of the data samples in the mini-batch that the particular RP is processing, in order to perform its own backward calculations.
  • Back-propagation does not need to know the partial results of other instances of the dataflow, until a synchronization step at which the parameters of the model are updated based on the gradients calculated by all the graph instances operating in parallel on different mini -batches. At that point an average gradient is calculated with respect to each of the parameters, averaged across all of the mini -batches processed by all of the graph instances.
  • the SGD process is said above to calculate updated "parameters".
  • the parameters to be learned are the neural network "weights”, by which each neuron in an upstream layer influences each neuron in the next layer downstream.
  • many networks also include other parameters which can be made learnable, such as biases.
  • the term “parameters” is intended herein to be a generalization of the term “weights”. Nevertheless, the two terms are used interchangeably herein for clarity of the discussion.
  • FIG 39 illustrates the temporal progress resulting from a conventional implementation of data parallelism, in which the Data Parallel operations of Figure 38 are performed by GPUs orchestrated by a host. It can be seen that as time progresses, GPU0 performs forward processing on Section SO (abbreviated FWD0), then FWD1, then FWD2. At this point GPU0 calculates the loss (step not shown) for each of the data samples in the mini-batch, and begins the first back-propagation step BWD2. In parallel with these steps, each of the other GPU's participating in the data parallel operation also perform the same sequence of steps on their own mini -batches of training data.
  • FWD0 Section SO
  • BWD1 Back-propagation step
  • each of the GPU's may execute at a different speed, and thus may complete step BWD2 at different times.
  • each of the GPUs notify the host that they have completed their respective steps BWD2, and the host causes each of them to transmit their calculated gradients with respect to each of the parameters in Section S2, to all the other GPU's (SYNC2).
  • the host may write this command into a GPU control register or a storage location that the GPU is monitoring.
  • each of the GPUs performs its respective backward pass for Section SI (BWD1) and notifies the host when done.
  • the host causes each of the GPUs to transmit their calculated gradients with respect to each of the parameters in Section SI, to all the other GPU's (SYNC1).
  • each of the GPUs performs its respective backward pass for Section SO (BWD0) and notifies the host when done.
  • the host causes each of the GPUs to transmit its calculated gradients with respect to each of the parameters in Section 0, to all the other GPU's (SYNC0).
  • the GPUs cannot continue with the All-Reduce (AR) steps because typically they must wait for all the cross-transmissions to complete first. Thus one time step is lost in each GPU while it idles awaiting notification from the host that all the SYNC steps have been completed. Then each GPU can proceed with its All-Reduce operation to calculate the average gradient with respect to each of the parameters in Section S2 (step AR2 in time step 8). Similarly each GPU then performs its All-Reduce operation to calculate the average gradient with respect to each of the parameters in Section 1 (step ARI in time step 9), and then each GPU performs its All-Reduce operation to calculate the average gradient with respect to each of the parameters in Section 0 (step AR0 in time step 10).
  • AR All-Reduce
  • Figure 40 illustrates the temporal progress resulting from an improved implementation using an FPGA 3236 on the SmartNIC 3224 to both orchestrate the SYNC steps and to perform the All-Reduce steps, thereby offloading time consuming processing steps from the RPs.
  • the steps FWD0, FWD1, FWD2 and BWD2 occur as in Figure 39.
  • FWD0 processes an entire mini -batch before FWD1 begins, and so on for FWD2.
  • no host is involved at runtime to set up and trigger each of these steps.
  • the dataflow graph has been pre-configured into each of the RP's to notify the FPGA via P2P messages directly to the SmartNIC on the PCIe bus 3220. Then, while the RPs are calculating BWD1, the FPGA is reading the gradients calculated by each of the RP's directly from the RP's memory, to the local memory pool 3234. This too has been pre-configured into the FPGA by the dataflow graph and is triggered by the arrival of the data, not by any signal from a host CPU.
  • both the SYNC2 and AR2 steps are performed by the FPGA in parallel with the BWD1 step in the RP's, and both the SYNC1 and ARI steps are performed by the FPGA in parallel with the BWD0 step in the RP's.
  • the FPGA can even be configured to add up, with respect to each parameter in Section S2, the gradients received from the RP's on the fly, as they are received, thereby performing All- Reduce on the fly. No additional control and networking overhead is incurred by such an operation because it has already been pre-configured into the FPGA.
  • Figure 41 illustrates the temporal progress resulting from a further improved implementation, which can be advantageous if the FPGA is sufficiently faster than the RPs.
  • Figure 41 is similar to Figure 40 in that the dataflow pipelines configured into each of the RP's and to each of the FPGAs are able to perform SYNC2 and AR2 in the FPGA in parallel with the RPs performing BWD1.
  • Figure 41 is different, though, in that the FPGAs also perform an optimization step OPT2 while the RPs are performing BWD1. Optimizations are post-processing steps to adjust the learning rate, or to otherwise help the algorithm overcome saddle points, for example.
  • the dataflow graph has been pre-configured into each of the RP's to notify the FPGA via P2P messages directly to the SmartNIC on the PCIe bus 3220. Then, while the RPs are calculating BWD1, the FPGA is reading the gradients calculated by each of the RP's directly from the RP's memory, to the local memory pool 3234. This too has been pre-configured into the FPGA by the dataflow graph and is triggered by the arrival of the data, not by any signal from a host CPU.
  • FIG. 32 Note also that while a host 3210 is shown in Figure 32, no host is shown in Figure 38. This is because while host 3210 may be connected to the PCIe bus 3220, it does not control any of the operations of the RPs or the FPGA once they have begun. Nor do any of the messages or data passing among the RPs and/or SmartNICs pass through the host. All of the steps shown in 40 Figures 40 and 41 were pre-configured into the RPs and the FPGA, and are triggered by the receipt of sufficient data to begin the step, or an explicit completion token from the sender, but not by a command by any host.
  • Host 3210 may be involved before and after the steps of40 Figures 40 and 41, such as to record some of the parameter updates for later analysis, and in some embodiments to direct each RP to the next mini -batch to process.
  • Host 3210 also may be the configuration load controller that initiates the configuration load process. But once the process of 40 Figure 40 or 41 begins for a particular mini -batch, it runs to completion without further host intervention.
  • FIG 42 illustrates an example data center 4210 incorporating multiple processing nodes 3201 each as described above with respect to Figure 32.
  • Four processing nodes are shown, numbered 0-3.
  • each processing node 3201 includes a respective host 3210 and eight (for example) RPs 3212 numbered RP0 through RP7, all interconnected by way of a respective PCIe bus 3220.
  • RPs and other units within a single processing node are sometimes referred to herein as "local" to each other, whereas units that are in different processing nodes are sometimes referred to herein as "foreign" to each other.
  • Each processing node 3201 also includes a respective SmartNIC 4222, which has one port 4224 connected to the local PCIe bus 3220 in the respective processing node, and a second port 4226 connected to a LAN 3228.
  • the SmartNICs also are given subscripts in Figure 42 corresponding to the node number to which they belong (e.g. SmartNICO, SmartNIC 1, SmartNIC2 and SmartNIC3).
  • the LAN 3218 in Figure 42 is an Ethernet, but in other embodiments it could be other types of LANs such as WiFi or InfiniBand.
  • the LAN could be constructed with various topologies in different embodiments, including all interconnected by a single layer 2 switch.
  • the LAN is constructed of four separate segments, connected in a ring topology from one SmartNIC to the next.
  • Each of the Ethernet ports 4226 in Figure 42 is considered to have two sub-ports in order to support this topology. (Other implementations can have more or fewer sub-ports, as needed given the parameter size relative to minibatch execution time and throughput).
  • SmartNICO has one Ethernet sub-port connected to SmartNIC3 and another connected to SmartNIC 1; SmartNIC 1 has one Ethernet sub-port connected to SmartNICO and another connected to SmartNIC2; SmartNIC2 has one Ethernet sub-port connected to SmartNIC 1 and another connected to SmartNIC3; and SmartNIC3has one Ethernet sub-port connected to SmartNIC2 and another connected to SmartNICO.
  • All of the Ethernet segments in Figure 4210 are sometimes referred to herein collectively as a single LAN or Ethernet 4228.
  • the reconfigurable components in all of the processing nodes in the system 4210 are configured by a configuration load process as described above with respect to Figure 32.
  • one of the hosts 3210 acts as the configuration load controller for all processing nodes 3201, whereas in another embodiment each of the hosts 3210 acts as the configuration load controller for only those reconfigurable components that reside in its own processing node 3201.
  • a separate member not shown in Figure 42, acts as the configuration load controller for all of the processing nodes 3201.
  • Other variations will be apparent to the reader.
  • the configuration bit file may designate one of the hosts 3210 as a master host, and/or may designate one of the RPs in each processing node 3201 as a master RP for that node.
  • the configuration bit file may allocate certain high level responsibilities to such a master RP or master host.
  • the bit file may configure all of the RPs in one or more of the processing nodes to be identical instances of a dataflow graph or graph fragment.
  • the configuration bit file may configure some or all of the RPs hosts with dissimilar dataflow graphs or graph fragments. The hosts, too, may be programmed similarly or differently than the other hosts.
  • Figure 43 illustrates a SGD deep learning application similar to that of Figure 38, that is implemented with data parallelism across multiple RPs in the multiple processing nodes of Figure 42.
  • the drawing illustrates two of the processing nodes designated processing node 0 and processing node k, where the lower case subscript 'k' indicates that the component labeled processing node k represents any of the processing nodes.
  • All the RPs in all of the processing nodes 3201 are configured with the same processing graph fragment, to learn the weights in a multi-layer neural network based on training data 3812.
  • the training data 3812 has been partitioned into multiple training data sets, each to be processed by a respective one of the processing nodes in parallel. Each partition is further divided within a processing node for processing by respective RPs in that processing node.
  • the deep learning application of Figure 43 has the same time-saving benefits of SYNC/ AR offload as explained above with respect to Figure 40, the difference being that each of the SYNC/AR steps includes contributions from all the RPs in all the processing nodes 3201.
  • the application of Figure 43 operates roughly by the local SmartNICs each accumulating all gradients from all local RPs to the local SmartNIC's memory, and all the SmartNICs then participating in a Ring All-Reduce process. Updated weights (or other parameters) are then calculated independently by each of the FPGAs from the resulting average gradients, and broadcast to each SmartNICs local RPs for use in the next training epoch.
  • the process that takes place is illustrated in more detail in Figures 44-47, which can collectively be taken as an illustration of a dataflow graph configured into the various units in Figure 42.
  • Figure 44 illustrates a dataflow graph fragment that is configured into each of the RPs. (Alternatively, it can be considered to illustrate a dataflow graph fragment configured into virtual machines in the RPs.) Though graph fragments may be illustrated herein in the form of steps in a flow chart, and may be referred to as steps in the discussion below, it will be understood that they actually represent stages in a dataflow pipeline. They are therefore sometimes referred to herein interchangeably as steps or stages. Additionally, as used herein, as used herein, as used herein a pipeline can include one or more "sub-pipelines", which are themselves considered to be pipelines in their own right. Also, a sub-stage of a pipeline stage is also considered itself to constitute a stage of the pipeline.
  • the configuration load controller configures each of the RPs with the pipeline shown in the drawing.
  • the loaded configuration can, in some embodiments, also load an initial set of network parameters into each of the RPs, or configure an address from which they can be drawn.
  • the loaded configuration also can, in some embodiments, load into each RP's local memory all the data samples that the RP will train from, or configure an address from which they can be drawn.
  • step 4412 the RP retrieves the first mini -batch of data samples for use in local training.
  • step 4414 the RP processes all the data samples in the mini -batch through the first configured forward Section of the network, Section SO, to yield outputs from Section SO.
  • the outputs are then passed to step 4416, which processes all the data samples in the mini -batch through the second configured forward Section of the network, Section SI, to yield outputs from Section SI.
  • the outputs are then passed to step 4418, which processes all the data samples in the mini-batch through the third configured forward Section of the network, Section S2, to yield outputs from Section S2.
  • the dataflow calls for the RP to calculate the loss (error) for each of the data samples.
  • the dataflow graph can be configured to calculate the loss for a particular data sample as the sum of the squares of the differences between each predicted output and its respective target value as specified in the data sample.
  • Step 4420 also passes the output node deltas to backward pass Section S2, step 4422.
  • Step 4422 calculates the loss gradient with respect to each of the learnable parameters in Section S2 of the network, for each of the data samples in the mini -batch.
  • Step 4422 also calculates the gradient of the loss with respect to each of the Section SI outputs, for each of the data samples (the Section SI output node deltas).
  • the RP then follows the configured data flow and passes the Section S2 parameter gradients to Gradient SYNC2/AR2 step 4424.
  • the sync orchestration and reduction of the Section S2 gradients are actually performed by the local SmartNIC 4224 rather than the current RPi.
  • this step primarily involves sending the parameter gradients calculated in step 4422 to the local SmartNIC 4226 via P2P messages over the local PCIe bus 3220. Alternatively it can simply involve sending a P2P message via the local PCIe bus to the SmartNIC indicating that the parameter gradients are available, so the SmartNIC can fetch them from the RP.
  • Step 4422 proceeds to pass the calculated the Section SI output node deltas on to BWD Pass Section SI step 4426 of the pipeline, in parallel with the data collection and reduction taking place on the SmartNIC 4224.
  • Step 4426 uses the Section SI output node deltas to calculate the parameter gradient with respect to each of the learnable parameters in Section SI of the network, for each of the data samples in the mini -batch.
  • the RP then passes the Section SI parameter gradients to Gradient SYNC1/AR1 step 44284232 as described above with respect to Gradient SYNC2/AR2 step 4424.
  • BWD Pass Section SI step 4426 also calculates the gradient of the loss with respect to each of the Section SO outputs, for each of the data samples (the Section SO output node deltas) and proceeds to pass them on to BWD Pass Section SO step 4430 of the pipeline, in parallel with the data collection and reduction taking place on the SmartNIC 4224.
  • BWD Pass Section SO step 4430 uses the Section SO output node deltas to calculate the parameter gradients with respect to each of the learnable parameters in Section SO of the network, for each of the data samples in the minibatch.
  • the RP then passes the Section SO parameter gradients to Gradient SYNC0/AR0 step 12324232 as described above with respect to Gradient SYNC2/AR2 step 4424.
  • Step 4434 then awaits a message from the local SmartNIC 4226 via P2P messages over the local PCIe bus 3220 to indicate that updated weights calculated on the local SmartNIC have been broadcast to local scratchpad memory on the RP or off-chip memory attached to the RP.
  • the RP determines whether more mini -batches have been assigned to it to process, and if so, the pipeline repeats with step 4412.
  • step 4434 since the RPs operate on a dataflow model, there may not be any separate steps 4434 or 4436. Instead, since actions in the pipeline are triggered by receipt of data, an implementation may simply end the pipeline after step 4432, and step 4412 at the beginning of the pipeline is triggered to re-start, upon local receipt of updated parameters, if the RP has been assigned more mini -batches to process. Similarly, if the RP does not have any more mini -batches to process, then receipt of the updated parameters triggers step 4438 to report the final parameters to the local host, or to report completion of all assigned minibatches and epochs to the host.
  • one RP in the system needs to report the final parameter values to a host.
  • one of the processing nodes e.g. processing node 0 in Figure 42
  • one of the RPs e.g. RP0 in Figure 42
  • the pipeline of Figure 44 simply ends.
  • Figure 45 illustrates a dataflow graph fragment that is configured into the SmartNICs 4224 of each of the processing nodes 3201 in the system of Figure 42 by the configuration load controller.
  • the SmartNIC accumulates the gradients from all its local RPs, and then the SmartNIC cooperates with the SmartNICs of the other processing nodes in the system, to perform a Ring All-Reduce of the various local sums which results in each SmartNIC having a local copy of a final sum.
  • the final sum represents a total or average gradient for each of the learnable parameters in the network being trained.
  • Each of the SmartNICs then locally updates all the parameters from the average gradients, and broadcasts the updated weights to all its local RPs via its own local PCIe bus.
  • the graph fragment is expressed in the configuration bit file in the form of configuration bits to be written into the FPGA on the SmartNIC, which together cause the building blocks of the FPGA to form a state machine and associated structures to perform the steps shown in the drawing. Though the graph fragment is illustrated in the drawing in the flow chart form, it will be understood that it is implemented in FPGA hardware rather than in instruction code to be executed by a software-based instruction processor. As such, for appropriate operations, it typically can be made to execute much faster.
  • the graph fragment is configured into each of the SmartNICs by the configuration load controller, which also configures into the FPGAs a parameters buffer for all the learnable parameters, and initializes it to the same initial values for such parameters as are configured into the RPs in step 4410 ( Figure 44).
  • the bit file also configures into each of the FPGAs a gradient buffer for all the learnable parameters and initializes the gradient values to zero.
  • Step 4512 is triggered when the Section S2 parameter gradients become available to the SmartNIC in step 4424 ( Figure 44).
  • the Section S2 gradients are accumulated into the SmartNICs local memory pool 3234 from each of the local RPs, having been sent during the Gradient SYNC2/AR2 step 4424 of the RP's pipeline.
  • step 4514 the FPGA cooperates with the FPGA in each of the other SmartNICs to perform a collective which results in a local copy of a final sum of all the Section S2 gradients in the local memory pool 3234 of each of the SmartNICs. This step is described in more detail below.
  • step 4516 after all-reduce is complete with respect to the Section S2 gradients, the Section SI gradients are accumulated into the SmartNICs local memory pool 3234 from each of the local RPs, having been sent during the Gradient SYNC1/AR1 step 4428 of the RP's pipeline.
  • step 4518 the FPGA cooperates with the FPGA in each of the other SmartNICs to perform a collective which results in a local copy of a final sum of all the Section SI gradients in the local memory pool 3234 of each of the SmartNICs. Again, this step is described in more detail below.
  • step 4520 after all-reduce is complete with respect to the Section SI gradients, the Section SO gradients are accumulated into the SmartNICs local memory pool 3234 from each of the local RPs, having been sent during the Gradient SYNC0/AR0 step 4432 of the RP's pipeline.
  • step 4522 the FPGA cooperates with the FPGA in each of the other SmartNICs to perform a collective which results in a local copy of a final sum of all the Section SI gradients in the local memory pool 3234 of each of the SmartNICs. Again, this step is described in more detail below.
  • step 4524 the FPGA of the local SmartNIC updates all the parameter values in its local parameter buffer from the gradient buffer. This typically involves a simple multiplication of each of the average gradients by a predetermined learning rate factor. At this point each of the SmartNICs has a complete set of updated parameter values for use in the next epoch of training.
  • step 4526 the SmartNIC re-initializes its local gradient buffer to zero, and in step 4528 it sends the updated parameter values to all of its local RPs.
  • the FPGA then idles until Section S2 gradients for the next training epoch become available to the SmartNIC in step 4424, 32and the dataflow graph fragment repeats from step 4512.
  • Steps 4514, 4518 and 4522 in Figure 45 each call for an all-reduce of the gradients from the respective network segment.
  • Any type of collective can be configured into the SmartNICs in various embodiments to accomplish this, but a Ring All-Reduce with a Uni- Directional Ring collective is used in the embodiment described herein. Many other collective methodologies are known and could be used instead. Non-limiting examples include Broadcast, All-Gather, Reduce, All-Reduce, Reduce-Scatter, Scatter, Gather, All-To- All, tree-based, and Neighborhood.
  • NCCL-Woolley.pdf retrieved from https://images.nvidia.com/events/scl5/pdfs/NCCL-Woolley.pdf, visited 4/21/2021, incorporated by reference herein.
  • whichever collective is chosen is implemented using a message passing protocol, orchestrated by a control algorithm configured into an FPGA in one or more of the participating SmartNICs.
  • the dataflow is described in Figure 46 with reference to an index k (e.g. SmartNIC k, gradient buffer segment k), but the index k is used in the drawing only as a convenience to facilitate compactness of the description.
  • the SmartNICs do not need to know their ID number k or perform any calculations based on k in the present embodiment since the compiler performs such calculations in advance when preparing the configuration bit files.
  • bit files configured into the FPGA in each SmartNIC have hardcoded pointers to each particular gradient buffer segment to be read or written or accumulated by that particular SmartNIC in each stage of the pipeline. Additionally, since the compiler was also aware at compile time of which of the other SmartNICs is reachable via each of the two LAN sub-ports of a SmartNIC, both the source and destination SmartNICs for each gradient transmission also has already been preconfigured into the FPGA of each SmartNIC by hardcoded identification of which LAN subport to use. In an embodiment, these variations arising from the differing SmartNIC ID numbers k are the only differences between the ring all-reduce dataflows as configured into the SmartNICs.
  • the configuration bit file could configure into the FPGAs the number of processing nodes 3201, the ID number k for each of the participating SmartNICs, and/or an indication of which LAN sub-port is connected to which other one of the SmartNICs.
  • Each gradient buffer segment includes only 1/N of the gradients in the current Section Sj of the network.
  • the FPGAs could be configured to perform a single ring all-reduce process to include all gradients for all the parameters in the network, after all the local sums from all Sections j have been accumulated in all of the local SmartNICs.
  • NIC ID numbers and gradient buffer segments are modulo N (i.e. modulo 4 in this example).
  • the four SmartNICs are referred to herein as NICs 0, 1, 2 and 3, and the gradient buffer segments are referred to as segments 0, 1, 2 and 3.
  • NICs 0, 1, 2 and 3 the gradient buffer segments.
  • Figure 46 proceeds with a first phase (sometimes referred to herein as an accumulation phase), followed by a second step (sometimes referred to herein as a distribution phase). Generally the stages of the accumulation phase are shown in Figure 46A, whereas the stages of the distribution phase are shown in Figure 46B.
  • the NIC k sends the initial values in its k+3'th gradient buffer segment to NIC k 1.
  • NICO sends the values from its gradient buffer segment 3 to NIC3
  • NIC1 sends the values from its gradient buffer segment 0 to NICO
  • NIC2 sends the values from its gradient buffer segment 1 to NIC1
  • NIC3 sends the values from its gradient buffer segment 2 to NIC2.
  • This means that NIC k also receives the initial values from the k'th gradient buffer segment in NIC k+1 (stage 4612).
  • NIC k adds the received values to its local gradient buffer for the k'th gradient segment.
  • NIC k's gradient buffer for segment k now includes the sum of the corresponding gradients from NIC k and from NIC k+1.
  • NIC gradient buffers for all other segments still contain only the gradients from its own local RPs.
  • the NIC k sends the partial sums now in its k'th gradient buffer segment to NIC k 1.
  • NIC k also receives the partial sums from the k+l'th gradient buffer segment in NIC k+1 (stage 4616).
  • NIC k adds these values to its local gradient buffer for the k+l'th gradient segment.
  • NIC k's gradient buffer for segment k+1 now includes the sum of the corresponding gradients from NIC k, NIC k+1, and NIC k+2.
  • the NIC k sends the partial sums now in its k+l'th gradient buffer segment to NIC k 1.
  • NIC k also receives the partial sums from the k+2'th gradient buffer segment in NIC k+1 (stage 4620).
  • NIC k adds these values to its local gradient buffer for the k+2'th gradient segment.
  • NIC k's gradient buffer for segment k+2 now includes the sum of the corresponding gradients from all four SmartNICs.
  • each NIC k sends the values in its local gradient segment buffer k+2, which is the complete sum of the gradients of that segment, to NIC k 1.
  • NIC k writes this into its local gradient buffer for that segment, over-writing the values previously there.
  • the resulting state is shown in Figure 47F.
  • each NIC k sends the values in its local gradient segment buffer k+3, which is the complete sum of the gradients of that segment, to NIC k 1.
  • NIC k writes this into its local gradient buffer for that segment, over-writing the values previously there.
  • the resulting state is shown in Figure 47G.
  • each NIC k sends the values in its local gradient segment buffer k, which is the complete sum of the gradients of that segment, to NIC k 1.
  • NIC k writes this into its local gradient buffer for that segment, over-writing the values previously there.
  • the resulting state is shown in Figure 47H.
  • Each NIC now has complete sums of all the Section Sj gradients in its own gradient buffers.
  • Each NIC can now update all the parameters in its local parameters buffer, and send the updated parameters to all of its local RPs (stages 4524 and 4528 in Figure 45).
  • the FPGA for NIC k could be configured to perform both steps 4610 and 4612 at the same time, since the gradient buffer segments involved are different, and the LAN sub-ports over which data is sent/received are also different. The same is true for all the other corresponding pairs of stages in Figures 46A and 46B.
  • each k'th SmartNIC sends its initial or aggregated values from its local k+p-l'th gradient buffer segment, to the k-l'th SmartNIC via the Ethernet.
  • the k'th SmartNIC also receives via the Ethernet from the k+l'th SmartNIC, and aggregates in its local gradient buffer segment, values from the k+p'th gradient buffer segment.
  • each k'th SmartNIC sends its aggregated values from its local k+q-2'th gradient buffer segment to the k-l'th SmartNIC via Ethernet.
  • the k'th SmartNIC also receives via Ethernet, values from the k+q-l'th gradient buffer segment from the k+l'th SmartNIC, and writes it to its local gradient buffer segment.
  • the numbering system used in the above description is selected for purposes of this description, and serves only to identify in a compact way, what operations occur and in what sequence; the numbering system itself does not necessarily exist in the configured FPGA.
  • the FPGA is merely configured by the configuration bit file to include functional elements that will, when executed, perform the operations described herein in the sequence described herein.
  • Another description of the algorithm may well use a different numbering system, but still produce an FPGA configured to perform the operations described herein in the sequence described herein.
  • different hardware functional elements can be instantiated into the FPGA to perform the operations described herein in the sequence described herein. In one embodiment, for example, all of the individual operations can be implemented as a linear sequence of stages, with no iterations occurring.
  • applications having different values of N will be implemented in the FPGA with shorter or longer pipelines of stages, depending on N.
  • the configuration bit file may configure just a single stage for performing the operations described above for any provided value of p in the accumulation phase, and a 'p-counter' which iterates p and provides it to the stage to perform the arithmetic in order to iterate the correct sequence in which the operations are performed.
  • the configuration bit file may also configure another single stage for performing the operations described above for any provided value of q in the distribution phase, and a q-counter' which iterates q and provides it to the stage to perform the arithmetic in order to iterate the correct sequence in which the operations are performed in the distribution phase.
  • the pipeline lengths may not change with different values of N; the only difference among implementations for different values of N may be a single register in the FPGA pre-programmed with the value of N, used to initialize the two p- and q-counters.
  • N the only difference among implementations for different values of N may be a single register in the FPGA pre-programmed with the value of N, used to initialize the two p- and q-counters.
  • All-Reduce stage in a data-parallel back-propagation neural network learning algorithm should generate the average gradient calculated by all participating RPs for each learnable parameter.
  • the algorithm described herein generates the sum of such gradients rather than the average. Assuming the mini-batch size processed by all RPs are equal, this is not a problem because learning rate factor that the SmartNICs apply to each gradient value for updating the parameters in stage 4514 can simply be specified in the configuration bit file to be correspondingly smaller (e.g. divided by the total number of participating RPs).
  • the averaging performed by the SmartNIC in the present method should be a weighted average in which each RP’s calculated gradient value is weighted in proportion to the number of samples that the RP’s gradient values represent. This can be accomplished for example by configuring each RP to pre-divide its own calculated gradient values by the number of minibatches, or the number of data samples that the values represent, before forwarding them to its respective SmartNIC. These adaptations can be included in the configuration bit file, and do not require any coordination by any host at runtime.
  • step 4512 involves accumulating the Section S2 gradients from the local RPs into gradient memory of the local SmartNIC.
  • Steps 4516 and 4520 are similar, for the Section SI and Section SO gradients, respectively.
  • any type of collective can be configured into the RPs and their local SmartNIC in various embodiments to accomplish this.
  • the choice of collective may depend on the size of the parameters, and can be chosen to optimize parallelization of the reduce operation so it does not become the throughput bottleneck.
  • the collective is triggered by the RP when data is available, and then orchestrated by the SmartNIC.
  • each RP streams its gradients for the parameters in the current Section to the local SmartNIC's local memory pool 3234 via streaming P2P transactions.
  • the RP may issue a remote write via the P2P protocol, to a memory mapped SmartNIC buffer should a store and forward be required.
  • the RP can issue a P2P control to the SmartNIC which triggers the SmartNIC to issue a P2P remote read transaction to a buffer in RP device memory. Regardless of the mechanism by which the Section S2 gradients reach the local SmartNIC, they may be received asynchronously, but each one (or chunk) of them is received in association with an identification of the originating RP and the particular parameters included in the chunk.
  • the FPGA adds them up in its local gradient buffer.
  • the FPGA keeps track of which gradients have been received and from which RPs, and moves on to the next step upon receipt of all of the Section S2 gradients from all of the local RPs.
  • a direct streaming collective to a single local SmartNIC could cause a bottleneck as the number of RPs in the node increases, thereby stifling local scaling.
  • Each of the dedicated SmartNICs has its own FPGA. Only one of such SmartNICs is treated by the configuration bit file as the local master, and that is the one that is configured by the configuration bit file to communicate with the corresponding master SmartNICs in the other nodes in the second round of Ring All-Reduce.
  • the local Ring All-Reduce can take place over the local PCIe bus 3220, in which case none of the dedicated SmartNICs (other than the local master SmartNIC) need include the Ethernet interface 3312. Alternatively, if they do include the Ethernet interface 3312, then the configuration bit file can configure the FPGAs in the dedicated SmartNICs to perform the local Ring All-Reduce using the P2P protocol over a local LAN arranged in a ring topology. See Figure 48, which illustrates an RP processing node 4801 like that of Figure 42, but with local LAN segments 4830 added for the local ring.
  • Figure 48 illustrates an example processing node 4801 which includes a host 4810 and the eight RPs 4812 like RPs 3212 shown in Figure 32, all interconnected by way of a PCIe bus 4820.
  • the SmartNICs in Figure 48 are numbered as "NICk.i, where k is the node number ranging from 0 to N-l, N being the number of participating processing nodes, and where i is the SmartNIC number within the node k.
  • each RP 4812 is paired with its own SmartNICs 4822.
  • Each RP 4812 communicates with its respective SmartNIC via the PCIe bus 4820, though in another embodiment, each RP 4812 has a separate, dedicated PCIe bus (or other peripheral bus), separate from PCIe bus 4820, for communicating with its respective SmartNIC 4822.
  • Each SmartNIC 4822 has one port connected to the PCIe bus (or other bus) via which it communicates with its corresponding RP 4812, and a second port connected to a local LAN 4828.
  • the LAN 4828 in the present embodiment is Ethernet, but in other embodiments it could be other types of LANs such as WiFi or InfiniBand.
  • the SmartNIC labeled NICO.O in Figure 48 may be the one configured by the configuration bit file as the local master SmartNIC. It includes the two additional Ethernet sub-ports 4228 for communicating with the local master SmartNICs in the other processing nodes as set forth above with respect to Figure 42.
  • the LAN 4828 (or one segment of the LAN 4828) may include an Ethernet switch (not shown) which includes one or more additional ports for extending the LAN 4828 to other RP processing nodes like 4801.
  • the arrangement of Figure 48 can be configured to communicate among the RPs 4812 via the two disparate communication link types (PCIe and Ethernet) as needed in order to optimize processing.
  • data communication over the PCIe buses 3220 include buffers on both the sending and receiving ends of a data transmission.
  • the PCIe buses 3220 in the systems of Figs. 32 and 42 carry control and data packets according to a peer-to-peer (P2P) messaging protocol.
  • P2P peer-to-peer
  • the SmartNICs 3222 too, include logic to participate fully in such conversations.
  • the P2P protocol includes at least two types of transactions: one that transfers the data, and another deals with notifying the sender that the receiver has processed the data (e.g. by returning a credit).
  • the recipient of data can learn that sufficient data has been received in order to trigger the next step in the recipient's dataflow.
  • the sender of data can learn that all of it has been received by the consumer so the sender can trigger a subsequent pipeline stage which requires the memory that previously contained the data.
  • Such self-determination is a feature that enables the system to minimize runtime control of the dataflow by a host, and thus minimize the extensive control and networking overhead that can be incurred in conventional systems in which an outside entity such as a runtime host has to orchestrate all of the synchronization steps.
  • Such outside control overhead can stifle scaling of an architecture beyond just a few compute nodes.
  • the reporting of gradients by each RP to the local SmartNIC, in pipeline stages 4424, 4428 and 4432 can use a "push" operation defined in the P2P protocol, originated by the producer RP.
  • the protocol sequence informs the local SmartNIC when data transmission for the current network Section Sj is complete.
  • the SmartNIC In response to having received such notifications from all local RPs, the SmartNIC automatically proceeds with a subsequent pipeline stage to perform a collective across the LAN with the other SmartNICs in the data center (stages 4514, 4518 or 4522 in Figure 45). No control by any third party entity is required.
  • the P2P protocol sequence also notifies the producer RP when the consumer SmartNIC has received all the data that the RP transmitted to the SmartNIC for the current network Section Sj.
  • the producer RP's dataflow uses such notification as part of its logic to trigger a subsequent pipeline stage that will re-use its local gradient memory.
  • the RP may be configured to re-use local gradient memory in the BWD pass for the next network Section Sj.
  • the P2P protocol sequence notifies each SmartNIC k when the values from its local gradient buffer segment have been received by SmartNIC k-1 (stages 4610, 4614, 4618, 4630, 4636 and 4640 in Figure 46), and when all the values from SmartNIC k+1 have arrived (stages 4612, 4616, 4620, 4632, 4636 and 4640 in Figure 46), both of which the dataflow configured into the FPGA may need to occur in order to trigger a subsequent stage of the dataflow pipeline in the FPGA. Still further, each of the SmartNICs can use the same P2P protocol sequence, or a similar broadcast-type P2P protocol sequence, to transmit the updated parameters to all the local RPs over the local PCIe bus in stage 4528 ( Figure 45).
  • the RPs may be configured to trigger the next training epoch only upon notification, in accordance with the P2P protocol, that the RP has received all the updated parameters (stage 4434 in Figure 44).
  • the SmartNICs may be configured to include in its logic for triggering the next iteration of its pipeline stage 4512 ( Figure 45), the notification according to the P2P protocol that its transmission of the last set of updated parameters has completed to all of the local RPs.
  • control packets are exchanged between a producer (sender or source) of data packets and a consumer (receiver or destination) of the data to inform the consumer that all data for a particular transmission has been sent.
  • a producer can transmit data to a consumer either by a "push” operation controlled by the producer or by a "pull” operation controlled by the consumer.
  • a push operation the sequence of packets according to the P2P might be as follows:
  • the producer's AGCU In response to a request from the producer's internal dataflow operations, the producer's AGCU sends the consumer a Write Request token, specifying a destination FIFO address at the consumer.
  • the destination address can be within a DRAM at the consumer, which for a SmartNIC, can be in the Local Memory Pool 3234.
  • CTS Clear-to-Send
  • the producer's AGCU then sends the data packets as a series of one or more fixed size chunks, the last of which contains a token indicating "Last".
  • the consumer's AGCU then notifies the consumer's destination dataflow pipeline that the data have been received.
  • the producer's AGCU then notifies the internal dataflow operations that made the request, that the sending process is complete.
  • the producer's AGCU once triggered, operates independently of the producer's dataflow pipelines.
  • the producer's dataflow pipeline can be configured such that one of the requirements for triggering that stage to proceed is receipt of the completion notification from the producer's AGCU.
  • the consumer dataflow can continue operating in parallel with the receiving of the data, until a pipeline stage is reached which requires the data being received.
  • the consumer's dataflow pipeline can be configured such that one of the requirements for triggering that stage to proceed is receipt of the notification from the consumer's AGCU that the data has arrived.
  • the sequence of packets according to the P2P might be as follows:
  • the producer's AGCU In response to a request from the producer's internal dataflow operations, the producer's AGCU sends a P2P token to the consumer notifying of the availability of data for reading, the address where the data is stored, and the number of chunks to be read.
  • the consumer's AGCU When any required prior processing by the consumer is finished, the consumer's AGCU returns a P2P acknowledgement token and begins reading the data chunks, which arrive in P2P data packets. Alternatively, the consumer's AGCU can issue a remote read command from a memory mapped buffer either in SmartNIC local memory or in remote RP device memory.
  • the consumer's AGCU When the consumer has received all the data, the consumer's AGCU notifies the consumer's destination dataflow pipeline that the data have been received.
  • the consumer's AGCU also returns a Clear-to-Send (CTS) token back to the producer's AGCU.
  • CTS Clear-to-Send
  • the producer's AGCU interprets the CTS token as a completion message, and notifies the internal dataflow operations that made the request, that the reading process is complete.
  • the data transfer can take place mostly in parallel with continued dataflow operations at the producer and consumer.
  • the producer's dataflow pipeline can be configured such that one of the requirements for triggering a subsequent dataflow pipeline stage is receipt of the completion notification from the producer's AGCU.
  • the consumer's dataflow pipeline can be configured such that one of the requirements for triggering a subsequent dataflow pipeline is receipt of the notification from the consumer's AGCU that the data has arrived.
  • Transmission of data from a producer to consumer also can be requested by the consumer.
  • a P2P protocol sequence similar to the ones above can be used to transfer the data, either by push or pull, and the dataflow pipelines in either or both the producer and consumer can be configured to include a transmit completion signal in the conditions for triggering a subsequent dataflow pipeline stage.
  • other embodiments can provide other mechanisms in the message passing protocol by which producers and consumers can learn that a data transmission operation has completed and sufficient data is available for proceeding with a subsequent dataflow pipeline stage.
  • the originator can specify in the initial packet the length of the upcoming data transfer. Lengths may for example be specified in multiples of a fixed chunk size (such as 64 bytes), and the data packets themselves may be padded if necessary to fill out a full chunk. The consumer learns that it has received all data by monitoring writes to its local Memory Pool and detecting when all chunks have been written.
  • a P2P protocol may be used which does not natively include completion notifications, but the compiler may configure into the dataflows equivalent notifications to be sent to the transmission partner after data transmission or reception has completed.
  • the same P2P messaging protocol that is used within a processing node across the PCIe bus is also used across the Ethernet links in the system of Figure 42.
  • the P2P messages are encapsulated in Ethernet Datagrams rather than PCIe data packets.
  • P2P packets sent over the PCIe bus but addressed to RPs in a different processing node 3201 are re-encapsulated by the SmartNIC 3222, sent over the Ethernet to the SmartNIC of the destination processing node, re-encapsulated there, and forwarded on to the destination RP over that SmartNIC's own local PCIe bus.
  • the same P2P protocol is used for messaging through all processing nodes of the data center 4210, though encapsulated differently depending on the lower level protocol of the underlying transmission medium.
  • the technology disclosed can be practiced as a system, method, or article of manufacture.
  • One or more features of an implementation can be combined with the base implementation. Implementations that are not mutually exclusive are hereby taught to be combinable.
  • One or more features of an implementation can be combined with other implementations. This disclosure periodically reminds the user of these options. Omission from some implementations of recitations that repeat these options should not be taken as limiting the combinations taught in the preceding sections - these recitations are hereby incorporated forward by reference into each of the following implementations.
  • One or more implementations and aspects of the technology disclosed or elements thereof can be implemented in the form of a computer product, including a non-transitory computer readable storage medium with computer usable program code for performing the method steps indicated. Furthermore, one or more implementations and aspects of the technology disclosed or elements thereof can be implemented in the form of an apparatus including a memory and at least one processor that is coupled to the memory and operative to perform exemplary method steps.
  • one or more implementations and aspects of the technology disclosed or elements thereof can be implemented in the form of means for carrying out one or more of the method steps described herein; the means can include (i) hardware module(s), (ii) software module(s) executing on one or more hardware processors, or (iii) a combination of hardware and software modules; any of (i)-(iii) implement the specific techniques set forth herein, and the software modules are stored in a computer readable storage medium (or multiple such media).
  • aspects of Figure 36 can be implemented in these ways, as can operations that are handled by a host or CPU.
  • a given signal, event or value is "responsive" to a predecessor signal, event or value if the predecessor signal, event or value influenced the given signal, event or value. If there is an intervening processing element, step or time period, the given signal, event or value can still be “responsive” to the predecessor signal, event or value. If the intervening processing element or step combines more than one signal, event or value, the signal output of the processing element or step is considered “responsive" to each of the signal, event or value inputs. If the given signal, event or value is the same as the predecessor signal, event or value, this is merely a degenerate case in which the given signal, event or value is still considered to be “responsive” to the predecessor signal, event or value. "Dependency" of a given signal, event or value upon another signal, event or value is defined similarly.
  • the "identification" of an item of information does not necessarily require the direct specification of that item of information.
  • Information can be “identified” in a field by simply referring to the actual information through one or more layers of indirection, or by identifying one or more items of different information which are together sufficient to determine the actual item of information.
  • indicate is used herein to mean the same as “identify”.
  • a method for executing an application on a plurality of processors comprising: providing a system including a plurality of functional units, and an interconnect fabric by which data can be transmitted from a producing one of the units to a consuming one of the units using a peer-to-peer (P2P) message passing protocol layer, the plurality of functional units including a first reconfigurable unit being configurable at a first level of configurable granularity to implement a dataflow architecture, and a second reconfigurable unit being configurable at a second level of configurable granularity to implement a dataflow architecture, the first level of configurable granularity being different from the second level of configurable granularity; a host system writing an application into the functional units, including configuring the first reconfigurable unit at the first level of configurable granularity with one or more dataflow pipelines implementing a first dataflow segment of the application, and further configuring the second reconfigurable unit at the second level
  • P2P peer-to-peer
  • the first reconfigurable unit has a CGRA and the second reconfigurable processor comprises a NIC
  • the NIC includes: a first interface through which the NIC communicates on the interconnect fabric according to an interconnect fabric protocol; a second interface through which the NIC communicates on a network connection using a networking protocol different from that used on the interconnect fabric; an FPGA, the configuring of the second reconfigurable unit with the second dataflow segment comprising configuring the FPGA to execute at least part of the second dataflow segment; first translation logic which receives data packets arriving on the first interface and addressed to destinations reachable through the second interface, re-encapsulates them according to the networking protocol and forwards them through the second interface; second translation logic which receives data packets arriving on the second interface and addressed to destinations reachable through the first interface, re-encapsulates them according to the interconnect fabric protocol and forwards them through the first interface; first consumer logic which receives data packets arriving on the first interface and addressed to the FPGA, extracts first messages
  • the NIC further includes: second consumer logic which receives data packets arriving on the second interface and addressed to the FPGA, extracts second messages, and provides the second messages to the FPGA; and second producer logic which forwards onto the second interface messages originated by the FPGA and addressed to a reconfigurable unit reachable through the second interface.
  • the second dataflow segment includes a first particular pipeline stage to transmit intermediate data to a third one of the reconfigurable units reachable through the second interface, and wherein in the first particular pipeline stage the second producer logic of the second reconfigurable unit forwards the intermediate data to the third reconfigurable unit through the second interface.
  • Clause 6 The method of clause 5, wherein the second dataflow segment further includes a second particular pipeline stage that depends on data from a fourth reconfigurable unit reachable through the second interface, and wherein the second reconfigurable unit proceeds with the second particular pipeline stage only in response to receipt by the second reconfigurable unit of the data from the fourth reconfigurable unit.
  • Clause 7 The method of clause 4, wherein the second dataflow segment includes pipeline stages to write into a storage cluster through the second interface, data dependent upon the first intermediate result received from the first RP.
  • Clause 8 The method of clause 4, wherein the second dataflow segment includes pipeline stages to read data from a storage cluster through the second interface, the second intermediate result being dependent on the read data.
  • Clause 9 The method of clause 4, wherein the second dataflow segment includes pipeline stages to read data from an SQL database through the second interface, the second intermediate result being dependent on the read data.
  • Clause 10 The method of clause 9, wherein the application comprises training neural network parameters of a neural network using training data samples partitioned across a plurality of the RPs (the participating RPs), including the first RP, each of the data samples including a plurality of input values and a set of at least one target output value, and wherein the read data includes at least a subset of the training data samples.
  • Clause 11 The method of clause 9, wherein the application comprises training parameters of a neural network using training data samples partitioned across a plurality of the RPs (the participating RPs), including the first RP, each of the data samples including a plurality of input values and a set of at least one target output value, and wherein the read data includes values for at least a subset of the neural network parameters.
  • Clause 12 The method of clause 4, wherein the second dataflow segment includes pipeline stages to encrypt or decrypt data before transmission through the second interface, and to encrypt or decrypt arriving through the second interface.
  • Clause 13 The method of clause 4, wherein the second dataflow segment includes pipeline stages to pre-process audio or video stream data arriving through the second interface, and to transmit through the first interface data dependent upon the pre-processed audio or video stream data.
  • Clause 14 The method of clause 1, wherein the first reconfigurable unit has a Coarse- Grained Reconfigurable Architecture (CGRA). Clause 15. The method of clause 14, wherein the second reconfigurable unit comprises a Field-Programmable Gate Array (FPGA).
  • CGRA Coarse- Grained Reconfigurable Architecture
  • FPGA Field-Programmable Gate Array
  • Clause 16 The method of clause 1, wherein the first reconfigurable unit has word-level configurable granularity.
  • Clause 18 The method of clause 1, wherein the first reconfigurable unit has register transfer-level reconfigurability.
  • Clause 20 The method of clause 1, wherein the first reconfigurable unit uses word-wide registers to configure Issue Slots (ISs), Arithmetic Logic Units (ALUs), Functional Units , Processing Elements (PEs), Register Files (RFs), and interconnections, and wherein the second level of configurable granularity uses bit-wise Look-Up Tables (LUTs) to configure switches.
  • ISs Issue Slots
  • ALUs Arithmetic Logic Units
  • PEs Processing Elements
  • RFs Register Files
  • LUTs Look-Up Tables
  • Clause 21 The method of clause 20, wherein number of ISs used by the first reconfigurable unit is fewer than a number of the LUTs used by the second reconfigurable unit.
  • Clause 22 The method of clause 1, wherein a number of bits required to configure the first reconfigurable unit is at least one order of magnitude smaller than a number of bits required to configure the second reconfigurable unit.
  • the first dataflow segment includes a series of pipeline stages, wherein executing the first dataflow segment on the first reconfigurable unit is performed in response to a start command from the host system; and wherein the first reconfigurable unit forwards the first intermediate result as part of a particular one of the pipeline stages and not in response to any control signal from the host system subsequent to the start command from the host system.
  • Clause 25 The method of clause 24, wherein the host system sends the start command to at least one of the functional units in response to completion of the writing of the application into the functional units.
  • Clause 26 The method of clause 25, wherein the start command is sent via the P2P protocol.
  • Clause 27 The method of clause 24, wherein the application configured into the first reconfigurable unit further includes a third dataflow segment, and wherein the first reconfigurable unit executes the third dataflow segment at least partly in parallel with the second reconfigurable unit executing the second dataflow segment.
  • Clause 28 The method of clause 24, wherein the second dataflow segment includes a series of second pipeline stages, and wherein the second reconfigurable unit forwards the second intermediate result as part of a particular one of the second pipeline stages and not in response to any control signal from the host system subsequent to the start command from the host system.
  • Clause 29 The method of clause 1, wherein executing the second dataflow segment on the second reconfigurable unit in dependence upon the first intermediate result, comprises the second reconfigurable unit detecting arrival of the first intermediate result, and executing the second dataflow segment in response to the arrival.
  • Clause 30 The method of clause 1, wherein forwarding the first intermediate result to the second reconfigurable unit comprises the first reconfigurable unit sending a completion message to the second reconfigurable unit indicating completion of transmission of the first intermediate result to the second reconfigurable unit, and wherein executing the second dataflow segment on the second reconfigurable unit in dependence upon the first intermediate result, comprises the second reconfigurable unit detecting arrival of the completion message, and executing the second dataflow segment in response to the arrival.
  • forwarding the first intermediate result to the second reconfigurable unit includes a dataflow pipeline stage to trigger the forwarding of the first intermediate result
  • the application configured into the first reconfigurable unit further includes a third dataflow segment
  • the first reconfigurable unit executes the third dataflow segment in response to completion of the pipeline stage to trigger the forwarding of the first intermediate result.
  • Clause 32 The method of clause 1, wherein the application configured into the first reconfigurable unit further includes a third dataflow segment, and wherein the first reconfigurable unit executes the third dataflow segment upon detection by the first reconfigurable unit that all data in the first intermediate result has been sent to the second reconfigurable unit.
  • Clause 33 The method of clause 1, wherein the application configured into the first reconfigurable unit further includes a third dataflow segment, and wherein the first reconfigurable unit executes the third dataflow segment upon receipt by the first reconfigurable unit of a completion message from the second reconfigurable unit.
  • Clause 34 The method of clause 1, wherein executing the second dataflow segment on the second reconfigurable unit further produces a third intermediate result, the method further comprising: the second reconfigurable unit, in a pipeline stage of the second dataflow segment, forwarding the third intermediate result to an additional one of the reconfigurable units via the P2P protocol layer without passing through the host system; and executing an additional dataflow segment of the application on the additional reconfigurable unit in dependence upon the third intermediate result.
  • Clause 35 The method of clause 1, further comprising the host system writing a second application into the functional units, including reconfiguring the first and second reconfigurable units with respective dataflow segments of the second application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)

Abstract

Grossièrement, l'invention concerne un système comprenant une pluralité d'unités fonctionnelles qui exécutent différents segments d'un flux de données, et partagent des résultats intermédiaires par l'intermédiaire d'un protocole de messagerie pair à pair. Les unités fonctionnelles sont reconfigurables, différentes unités étant reconfigurables à différents niveaux de granularité. Le protocole de messagerie pair à pair comprend des jetons de commande ou d'autres mécanismes par lesquels le consommateur des résultats intermédiaires apprend que des données ont été transférées, et en réponse à cela, déclenche son segment de flux de données suivant. Un hôte ou contrôleur de configuration configure les unités de données avec leurs segments de flux de données respectifs, mais une fois que l'exécution du flux de données configuré a commencé, aucun hôte n'a besoin d'être impliqué dans l'orchestration de la synchronisation des données, du transfert de résultats intermédiaires, ou du déclenchement d'un traitement après réception des données. Le temps système de commande est ainsi réduit au minimum.
PCT/US2021/063733 2020-12-18 2021-12-16 Délestage de fonction de flux de données vers des processeurs reconfigurables WO2022133047A1 (fr)

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US17/127,929 US11182221B1 (en) 2020-12-18 2020-12-18 Inter-node buffer-based streaming for reconfigurable processor-as-a-service (RPaaS)
US17/127,929 2020-12-18
US17/379,921 2021-07-19
US17/379,924 US11237880B1 (en) 2020-12-18 2021-07-19 Dataflow all-reduce for reconfigurable processor systems
US17/379,924 2021-07-19
US17/379,921 US11392740B2 (en) 2020-12-18 2021-07-19 Dataflow function offload to reconfigurable processors

Publications (1)

Publication Number Publication Date
WO2022133047A1 true WO2022133047A1 (fr) 2022-06-23

Family

ID=80112245

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2021/063733 WO2022133047A1 (fr) 2020-12-18 2021-12-16 Délestage de fonction de flux de données vers des processeurs reconfigurables

Country Status (1)

Country Link
WO (1) WO2022133047A1 (fr)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115344525A (zh) * 2022-08-16 2022-11-15 江南信安(北京)科技有限公司 一种椭圆曲线点加硬件加速方法及装置
US11782760B2 (en) 2021-02-25 2023-10-10 SambaNova Systems, Inc. Time-multiplexed use of reconfigurable hardware
US11886931B2 (en) 2020-12-18 2024-01-30 SambaNova Systems, Inc. Inter-node execution of configuration files on reconfigurable processors using network interface controller (NIC) buffers
US12008417B2 (en) 2021-03-26 2024-06-11 SambaNova Systems, Inc. Interconnect-based resource allocation for reconfigurable processors
EP4394616A1 (fr) * 2022-12-29 2024-07-03 STMicroelectronics S.r.l. Contrôleur d'accélérateur matériel programmable

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170220499A1 (en) * 2016-01-04 2017-08-03 Gray Research LLC Massively parallel computer, accelerated computing clusters, and two-dimensional router and interconnection network for field programmable gate arrays, and applications
US20170315815A1 (en) * 2016-04-28 2017-11-02 Microsoft Technology Licensing, Llc Hybrid block-based processor and custom function blocks
US20200301898A1 (en) * 2018-06-25 2020-09-24 BigStream Solutions, Inc. Systems and methods for accelerating data operations by utilizing dataflow subgraph templates
US11182264B1 (en) 2020-12-18 2021-11-23 SambaNova Systems, Inc. Intra-node buffer-based streaming for reconfigurable processor-as-a-service (RPaaS)
US11182221B1 (en) 2020-12-18 2021-11-23 SambaNova Systems, Inc. Inter-node buffer-based streaming for reconfigurable processor-as-a-service (RPaaS)

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170220499A1 (en) * 2016-01-04 2017-08-03 Gray Research LLC Massively parallel computer, accelerated computing clusters, and two-dimensional router and interconnection network for field programmable gate arrays, and applications
US20170315815A1 (en) * 2016-04-28 2017-11-02 Microsoft Technology Licensing, Llc Hybrid block-based processor and custom function blocks
US20200301898A1 (en) * 2018-06-25 2020-09-24 BigStream Solutions, Inc. Systems and methods for accelerating data operations by utilizing dataflow subgraph templates
US11182264B1 (en) 2020-12-18 2021-11-23 SambaNova Systems, Inc. Intra-node buffer-based streaming for reconfigurable processor-as-a-service (RPaaS)
US11182221B1 (en) 2020-12-18 2021-11-23 SambaNova Systems, Inc. Inter-node buffer-based streaming for reconfigurable processor-as-a-service (RPaaS)

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
KOEPLINGER ET AL.: "Spatial: A Language And Compiler For Application Accelerators", PROCEEDINGS OF THE 39TH ACM SIGPLAN CONFERENCE ON PROGRAMMING LANGUAGE DESIGN AND IMPLEMENTATION (PLDI), PROCEEDINGS OF THE 43RD INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, 2018
PRABHAKAR ET AL.: "Plasticine: A Reconfigurable Architecture For Parallel Patterns", ISCA ' 17, 24 June 2017 (2017-06-24)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11886931B2 (en) 2020-12-18 2024-01-30 SambaNova Systems, Inc. Inter-node execution of configuration files on reconfigurable processors using network interface controller (NIC) buffers
US11782760B2 (en) 2021-02-25 2023-10-10 SambaNova Systems, Inc. Time-multiplexed use of reconfigurable hardware
US12008417B2 (en) 2021-03-26 2024-06-11 SambaNova Systems, Inc. Interconnect-based resource allocation for reconfigurable processors
CN115344525A (zh) * 2022-08-16 2022-11-15 江南信安(北京)科技有限公司 一种椭圆曲线点加硬件加速方法及装置
CN115344525B (zh) * 2022-08-16 2023-04-18 江南信安(北京)科技有限公司 一种椭圆曲线点加硬件加速方法及装置
EP4394616A1 (fr) * 2022-12-29 2024-07-03 STMicroelectronics S.r.l. Contrôleur d'accélérateur matériel programmable

Similar Documents

Publication Publication Date Title
US11893424B2 (en) Training a neural network using a non-homogenous set of reconfigurable processors
US11847395B2 (en) Executing a neural network graph using a non-homogenous set of reconfigurable processors
US11625283B2 (en) Inter-processor execution of configuration files on reconfigurable processors using smart network interface controller (SmartNIC) buffers
US11182264B1 (en) Intra-node buffer-based streaming for reconfigurable processor-as-a-service (RPaaS)
WO2022133047A1 (fr) Délestage de fonction de flux de données vers des processeurs reconfigurables
US11709664B2 (en) Anti-congestion flow control for reconfigurable processors
US12008417B2 (en) Interconnect-based resource allocation for reconfigurable processors
WO2022173821A1 (fr) Profilage d'instrumentation pour processeurs reconfigurables
TWI784845B (zh) 對可重配置處理器之資料流功能卸載
CN118043796A (zh) 存储器计算系统中的基于片块的结果缓冲
WO2022060929A1 (fr) Logique de temps de compilation pour détecter des motifs d'accès aux données compatibles avec la diffusion en continu et compatibles avec la diffusion
US11983141B2 (en) System for executing an application on heterogeneous reconfigurable processors
CN115705213A (zh) 封装条件分支操作
TWI792773B (zh) 用於可重配置處理器即服務(RPaaS)的節點內基於緩衝器的串流
WO2022133043A1 (fr) Exécution de phase d'exécution de fichiers de configuration sur des processeurs reconfigurables avec une granularité de configuration variable
US20230259477A1 (en) Dynamically-Sized Data Structures on Data Flow Architectures
US20230388373A1 (en) Load Balancing System for the Execution of Applications on Reconfigurable Processors
US20230297527A1 (en) Direct Access to Reconfigurable Processor Memory
US20240020265A1 (en) Operating a Cost Estimation Tool for Placing and Routing an Operation Unit Graph on a Reconfigurable Processor

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21844125

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21844125

Country of ref document: EP

Kind code of ref document: A1