US20180196698A1 - Modular offloading for computationally intensive tasks - Google Patents

Modular offloading for computationally intensive tasks Download PDF

Info

Publication number
US20180196698A1
US20180196698A1 US15/912,307 US201815912307A US2018196698A1 US 20180196698 A1 US20180196698 A1 US 20180196698A1 US 201815912307 A US201815912307 A US 201815912307A US 2018196698 A1 US2018196698 A1 US 2018196698A1
Authority
US
United States
Prior art keywords
offload
nodes
node
processor
region
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US15/912,307
Inventor
Hong Beng Mak
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Altera Corp
Original Assignee
Altera Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Altera Corp filed Critical Altera Corp
Priority to US15/912,307 priority Critical patent/US20180196698A1/en
Publication of US20180196698A1 publication Critical patent/US20180196698A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/509Offload
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms

Definitions

  • This disclosure relates to integrated circuit devices, such as field programmable gate array (FPGA) devices, and systems and methods for offloading computationally intensive tasks to offload regions on such devices.
  • FPGA field programmable gate array
  • PLDs Programmable logic devices
  • FPGAs field programmable gate arrays
  • CPLDs complex programmable logic devices
  • FPSCs field programmable system on a chips
  • PLDs Programmable logic devices
  • FPGAs field programmable gate arrays
  • CPLDs complex programmable logic devices
  • FPSCs field programmable system on a chips
  • PLDs generally include programmable logic blocks which may be configured to implement various operations.
  • Some PLDs also include configurable embedded hardware to support additional operations.
  • conventional approaches to configuring such embedded hardware are often cumbersome and unwieldy.
  • the present disclosure relates to a programmable integrated circuit device that includes an offload region with a flexible topology that can be configured at execution time.
  • a method of configuring a programmable integrated circuit device includes identifying, by a processor in a hard processor region of the programmable integrated circuit device, one or more tasks for assigning to an offload region of the programmable integrated circuit device.
  • the processor in the hard processor region transmits an instruction to the offload region, and a plurality of offload nodes in the offload region are configured to perform the one or more tasks.
  • the processor in the hard processor region is a first processor, both the first processor and a second processor in the offload region are configured to asynchronously access a memory in the hard processor region.
  • the configuring the plurality of offload nodes may include configuring one or more data flow paths through at least a subset of the plurality of offload nodes.
  • the one or more tasks include processing security content, and the processing of security content is assigned to the offload region to reduce a likelihood of a hacker attack on the programmable integrated circuit device.
  • the plurality of offload nodes in the offload region may be modified by adding a new offload node to the plurality of offload nodes, removing an offload node from the plurality of offload nodes, or replacing an offload node in the plurality of offload nodes.
  • a programmable integrated circuit device having a hard processor region and an offload region coupled to each other.
  • the hard processor region has a first processor that identifies one or more tasks that are assigned to the offload region and transmits an instruction to the offload region.
  • the offload region includes a plurality of offload nodes that are configured to perform the one or more tasks.
  • the hard processor region comprises a memory
  • the processor in the hard processor region another processor in the offload region are configured to asynchronously access the memory.
  • the instruction may include how to configure one or more data flow paths through at least a subset of the plurality of offload nodes.
  • the one or more tasks include processing security content, and the processing of security content is assigned to the offload region to reduce a likelihood of a hacker attack on the integrated circuit device.
  • partial reconfiguration of the offload nodes is used, such that the plurality of offload nodes in the offload region are configured to be modified by adding a new offload node to the plurality of offload nodes, removing an offload node from the plurality of offload nodes, or replacing an offload node in the plurality of offload nodes.
  • At least one of the plurality of offload nodes may be implemented as a hard intellectual property block, where the hard intellectual property block includes a layout of reusable hardware having a specified application function.
  • the specified application function may be selected from the group: cryptographic function, a frequency transform function, a prime factorization function, a compression or decompression function, a mathematical function, a hash function, and an Ethernet function.
  • At least one field programmable gate array is used to implement the offload region.
  • the offload region may further include a second memory that is accessible to each offload node in the offload region, and the second memory in the offload region may be partitioned in accordance with the instruction.
  • FIG. 1 shows a diagram of a system that assigns computationally intensive tasks to be performed at an offload region, in accordance with some embodiments of the present disclosure
  • FIG. 2 shows a diagram of an offload region having a set of offload nodes to perform assigned tasks, in accordance with some embodiments of the present disclosure
  • FIG. 3 shows a diagram of an offload region having a set of offload nodes to perform processing of security content, in accordance with some embodiments of the present disclosure
  • FIG. 4 shows a diagram of an offload region having a set of offload nodes to perform parallel mathematical processing, in accordance with some embodiments of the present disclosure
  • FIG. 5 shows an illustrative flow diagram of a process for configuring an offload region of a programmable integrated circuit device, in accordance with some embodiments of the present disclosure.
  • FIG. 6 is a simplified block diagram of an illustrative system employing a programmable logic device incorporating the present disclosure.
  • General purpose processors are not suitable for special purpose tasks because they commonly use general purples instruction sets. For example, performing a complicated hash on a large file may take a general purpose processor more than ten seconds. A similar task may be hardware accelerated on a modular offload engine (having a customized instruction set) in an FPGA or ASIC and may take a fraction of a second.
  • the present disclosure describes a heterogeneous many-core FPGA solution that may be used for servers and cloud data centers to dynamically customize applications and usage models, and to provide hardware acceleration for computing processes.
  • the present disclosure allows for the topology and node configuration of a heterogeneous system to be configurable at execution time. Such a configurable system allows for parallel processing and flexible pipe staging of various tasks.
  • FIG. 1 is a block diagram of a system 100 divided into two components represented by a hard processor system (HPS) domain 102 and an FPGA domain 104 .
  • the HPS domain 102 includes a memory 106 , a core CPU 108 , an offload manager 110 , and an offload driver 112 .
  • the FPGA domain 104 includes a processor 114 , a memory 116 , and a cluster of offload nodes 118 . Some tasks that are assigned to be performed by the system 100 may be computationally intensive for the HPS domain 102 to handle by itself. In this case, the HPS domain 102 may offload certain tasks to the cluster of offload nodes 118 in the FPGA fabric.
  • An offload node may include a hard intellectual property (IP) block that is configurable at execution time.
  • IP intellectual property
  • the topology of the offload nodes 118 may be configured at execution time.
  • the topology defines the various connections between pairs of the offload nodes 118 and defines the inputs and outputs of the connections.
  • the same set of offload nodes 118 may be able to be configured in multiple ways depending on the desired functionality. In this manner, the set of offload nodes 118 is flexible and can be used in many different situations.
  • the FPGA fabric may be hardware accelerated by using one or more hard IP blocks as the offload nodes 118 .
  • system 100 is shown and described as having an FPGA domain 104 , it should be understood that the system 100 and other systems discussed herein may have other types of integrated circuits (IC) instead of or in addition to one or more FPGAs. It should also be understood that the systems and methods discussed herein as applying to FPGAs may be equally applicable to ICs of other types, such as application-specific integrated circuits (ASICs), application specific standard products (ASSPs), and other programmable logic devices (PLDs).
  • ASICs application-specific integrated circuits
  • ASSPs application specific standard products
  • PLDs programmable logic devices
  • the system 100 may include ASIC and/or off-the-shelf ASSP dies. In some embodiments, a combination of FPGA and ASIC/ASSP may be used, assuming such FPGA and ASIC/ASSP dies have compatible electrical interfaces.
  • One or more components of the FPGA domain 104 may be implemented with hardware IP blocks, which may include a layout of reusable hardware having a specified application function.
  • One or more types of hard IP blocks may be implemented in the system 100 . Examples of these hard IP blocks that may be included in the set of offload nodes 118 are described in detail in relation to FIGS. 2-4 .
  • the FPGA domain 104 may be implemented using a system-on-chip (SoC) FPGA, whose hard IP may include an embedded multicore processor subsystem.
  • SoC system-on-chip
  • the core CPU 108 of the HPS domain 102 includes a number of nodes, each of which may correspond to an instance of an operating system.
  • the processor 114 of the FPGA domain 104 may include any number of embedded processors, such as a NIOS II processor. As shown in FIG. 1 , both the core CPU 108 and the processor 114 access the memory unit 106 . By allowing both processors in the HPS domain 102 and the FPGA domain 104 to access the same shared memory 106 , there is no need to pass physical copies of data stored in the memory 106 between the two domains. Instead, memory pointers may be passed between the two domains, thereby reducing transmission costs associated with passing of data between the HPS domain 102 and the FPGA domain 104 .
  • a mechanism to prevent data contention is used for the CPU 108 and the processor 114 .
  • a mutex locking mechanism may be used such that the CPU 108 and the processor 114 are prohibited from concurrently accessing the memory 106 .
  • the “zero copy” mechanism of the shared memory 106 avoids the need for computing nodes to pass physical copies back and forth in the pipeline, thereby improving overall system performance.
  • the offload manager 110 receives data from the core CPU 108 and handles construction of pipes and redirectors for forming connections between the various nodes in the cluster of offload nodes 118 .
  • the offload manager 110 also ensures that the constructed pipes and redirectors adhere to one or more offload policies.
  • a Fibonacci computing node in the cluster of offload nodes 118 may be configured to have a loopback connection, but other types of nodes may not be configured to handle loopback connections.
  • the offload manager 110 ensures that the connections between nodes in the cluster of offload nodes 118 comply with such policies.
  • the offload manager 110 also constructs the data flow path that defines the manner in which data flows through the offload nodes 118 or a subset thereof.
  • the offload manager 110 uses multi-core APIs to configure the connections for at least a subset of the nodes within the cluster of offload nodes 118 .
  • the offload manager 110 communicates instructions for configuring these connections to the offload driver, which essentially serves as an interface between the offload manager 110 in the HPS domain 102 and the processor 114 in the FPGA domain 104 , and also between the offload manager 110 and the cluster of offload nodes 118 in the FPGA domain.
  • This interface may include OpenCL APIs that pass along tasks to the offload nodes 118 in the FPGA domain 102 .
  • the offload manager 110 constructs the path over which data flows through the cluster of offload nodes 118 , while ensuring that the nodes adhere to one or more policies.
  • the offload manager 110 and the offload driver 112 are shown as two separate entities within the HPS domain 102 , but may be included in the same entity without departing from the scope of the present disclosure.
  • the offload driver 112 instructs the processor 114 to load the data for each of the offload nodes 118 that will be used in the desired configuration.
  • the processor 114 may be instructed by the offload driver 112 to load the appropriate data from the shared memory 106 into the memory 116 .
  • the instruction received from the offload driver 112 may refer to pointers to addresses in the shared memory 106 .
  • the processor 114 allows for the appropriate data to be read by the offload nodes 118 .
  • the memory 116 is partitioned and managed by the processor 114 .
  • the processor 114 may partition the memory 116 according to the set of offload nodes 118 or those offload nodes 118 that will be used in a particular configuration.
  • the memory 116 may be partitioned to reserve a portion of the memory to be dedicated for each offload node for its usage, such as configuration and data exchange.
  • the various offload nodes 118 may be chained and connected to efficiently perform a list of complicated tasks.
  • the flow of the tasks is configurable such that the offload nodes 118 may be piped together at execution time.
  • OpenCL may be used to allow the application layer to use the set of offload nodes 118 in the FPGA fabric for more computing power using task-based parallelism.
  • Example configurations of the offload nodes 118 are described in more detail in relation to FIGS. 2-4 .
  • FIG. 2 is a block diagram of a system 200 having an example configuration of a cluster of offload nodes 218 that may replace the offload nodes 118 in FIG. 1 .
  • a set of eight offload nodes are included in the cluster, and each node performs a specific function.
  • the connections between the various nodes of FIG. 2 are set at execution time by the offload manager 110 and the offload driver 112 .
  • the offload nodes 218 include a crypto node 230 , a Fibonacci node 232 , a Fast Fourier Transform (FFT) node 234 , an Ethernet MAC (EMAC) node 236 , a prime factor node 238 , a zip/unzip node 240 , a math node 242 , and a hash node 244 .
  • the offload manager 110 configures the various connections between the offload nodes 218 in FIG. 2 .
  • the offload manager 110 may set the different types of connections, such as data flow, memory access loop back flow, and back door connect types of connections.
  • the different types of connections may be set up by the offload manager 110 based on the time at which the offload nodes are to be used. For example, the crypto node 230 and the prime factor node 238 may use the back door connections so that the nodes 230 and 238 may communicate with each other for testing and debugging purposes.
  • each offload node 230 - 244 has access to the memory 116 over memory access connections, but not all the offload nodes 218 are used to process data.
  • Data is passed from the offload driver 112 to the crypto node 230 , to the prime factor node 238 , and finally to the hash node 244 .
  • the Fibonacci node 232 has a loop back flow connection, meaning that the Fibonacci node 232 has an input from its own output port.
  • the connections between crypto node 230 and the prime factor node 238 are backdoor connections.
  • the crypto node 230 may perform encryption and/or decryption of the incoming data using a public key and/or a private key.
  • the prime factor node 238 may be configured to generate public/private key pairs for the crypto node 230 to use. In this manner, the back door connections between the crypto node 230 and the prime factor node 238 may be used for testing and debugging.
  • the offload manager 110 keeps track of these different types of connections and which nodes should be connected or piped together in what manner.
  • the offload nodes 218 include multiple instances of an identical computing node.
  • the FFT node 234 may be replicated multiple times so as to provide parallel computing.
  • FIG. 2 provides an exemplary block diagram of the various connections that may be configured between the offload nodes 218 .
  • the connections in FIG. 2 may be dynamically configured and represent a general example of modular offloading for computationally intensive tasks.
  • FIG. 3 is a block diagram of a system 300 having an example configuration of a cluster of offload nodes 318 for processing security content.
  • the offload nodes 318 are the same as the offload nodes 218 shown in FIG. 2 (i.e., including a crypto node 330 , a Fibonacci node 332 , an FFT node 334 , an EMAC node 336 , a prime factor node 338 , a zip/unzip node 340 , a math node 342 , and a hash node 344 ), but the configuration of the offload nodes 318 of FIG. 3 is different from the configuration of the offload nodes 218 of FIG. 2 .
  • the same set of offload nodes may be used for both sets 218 and 318 , but depending on the desired functionality, the same set of offload nodes may be connected in different manners so as to execute different tasks.
  • the data flow is configured differently in FIG. 3 , in which, data flows from the offload driver 112 to the crypto node 330 , to the zip/unzip node 340 , to the hash node 344 , to the EMAC node 336 , and to the network 346 .
  • One application of the example configuration in FIG. 3 may involve the core CPU 108 assigning the processing of security content to the FPGA domain 104 .
  • the core CPU 108 may assign such a task to the FPGA domain 104 to free up the HPS domain 102 to handle other tasks, such as operating system tasks.
  • a tamper resistance system is used to prevent hacker attacks.
  • the offload nodes 318 are implemented at the level of hardware gates, which is more difficult to attack, compared to the software that may be implemented in the HPS domain 102 .
  • a hacker may simply use a powerful debugger to trace or step through the software functions, while attacking the hardware implementation in the FPGA domain 104 is more complex.
  • the crypto node 330 may include a hardware crypto engine that accelerates applications that need cryptographic functions.
  • the hash node 344 computes a hash function on the data.
  • the hash node 344 performs a cipher process, such as data encryption standard/advanced encryption standard, kasumi, snow3G, md5 (e.g., a md5sum), sha1 (e.g., a sha1sum), sha2, or any other process to calculate and verify a hash of the data.
  • a cipher process such as data encryption standard/advanced encryption standard, kasumi, snow3G, md5 (e.g., a md5sum), sha1 (e.g., a sha1sum), sha2, or any other process to calculate and verify a hash of the data.
  • the EMAC node 336 sends data for publishing to a network 346 , which may correspond to the world wide web, or any other suitable network.
  • FIG. 4 is a block diagram of a system 400 having an example configuration of a cluster of offload nodes 418 for performing mathematical operations.
  • the offload nodes 418 include a prime factor node 450 , a Fibonacci node 452 , an FFT node 454 , an EMAC node 456 , a Heaviside node 458 , a zip/unzip node 460 , a math node 462 , and a hash node 464 .
  • one or more of these nodes may be implemented as a hard IP block that is configurable at execution time.
  • the core CPU 108 determines that certain tasks that are CPU-intensive should be passed over to the FPGA domain 104 .
  • CPU-intensive tasks include but are not limited to prime factoring of a large integer, mathematic arctan and Heaviside step functions, and FFT computations.
  • any task that is computationally expensive or slow for the HPS domain 102 to handle on its own may be passed over to the FPGA domain 104 .
  • the offload driver 112 there are three parallel data flow paths from the offload driver 112 to the prime factor node 450 , the Heaviside node 458 , and the math node 462 , which may be a hard IP block configured to perform an arctan computation. Data flows out of each of these three nodes 450 , 458 , and 462 to the FFT node 454 .
  • the prime factor node 450 , the Heaviside node 458 , and the math node 462 may perform prime factor, Heaviside, and arctan computations in parallel because these computations are independent of one another. By allowing for parallel computations, the set of offload nodes 418 save significantly on time
  • partial reconfiguration is used to add, replace, or modify any of the offload nodes 118 , 218 , 318 , or 418 .
  • one or more hard IP blocks may be added to a set of existing offload nodes, or any of the existing hard IP blocks in the offload nodes may be modified or replaced.
  • partial reconfiguration refers the ability to reconfigure the logic on a region on a chip on the fly. In this way, partial reconfiguration allows for the modification to the set of offload nodes without necessarily requiring downtime from other components of the chip. Partial reconfiguration is especially useful if there is a limited amount of FPGA resource space in the FPGA domain 104 .
  • one potential disadvantage of using partial reconfiguration to address FPGA resource limitations is that there may be some penalty in the form of a wait time delay if the IP blocks are being reconfigured during run time.
  • FIG. 5 shows an illustrative flow diagram of a process 500 for configuring a set of offload nodes, according to an illustrative embodiment.
  • the process 500 may be performed on an integrated circuit device, such as an FPGA device, an ASIC device, an ASSP device, or a PLD.
  • the process 500 may be used for servers and cloud data centers to dynamically customize applications and usage models, and to provide hardware acceleration for computing processes.
  • the process 500 allows for the topology and node configuration of a heterogeneous system to be configurable at execution time. Such a configurable system allows for parallel processing and flexible pipe staging of various tasks.
  • a processor in a hard processor region of a programmable integrated circuit device identifies one or more tasks for assigning to an offload region of the programmable integrated circuit device.
  • some tasks that are assigned to be performed by a system with a hard processor region may be computationally intensive for the hard processor region to handle by itself.
  • the hard processor region may offload certain tasks to an offload region (e.g., the FPGA domain 104 , for example).
  • One example of a task that may be assigned from the hard processor region to the offload region is processing of security content. It may be desirable to offload processing of secure material so as to reduce a likelihood of a hacker attack on the integrated circuit device.
  • the processor in the hard processor region transmits an instruction to the offload region.
  • the instruction may be transmitted from the offload driver 112 to the processor 114 , and may include one or more pointers to memory locations in the shared memory 106 .
  • the processor 114 may be instructed by the offload driver 112 to load the appropriate data from the memory 106 into the memory 116 .
  • the memory 116 is partitioned in accordance with the instruction such that the offload nodes 118 access the desired data.
  • the offload driver 112 may transmit instruction data to the set of offload nodes 118 to configure the offload nodes 118 so that the desired connections are formed.
  • a plurality of offload nodes in the offload region are configured to perform the one or more tasks.
  • Configuring the offload nodes includes configuring the data flow paths through at least a subset of the offload nodes.
  • an offload node may include a hard intellectual property (IP) block that is configurable at execution time.
  • IP hard intellectual property
  • the topology of the offload nodes 118 , 218 , 318 , and 418 may be configured at execution time. The topology defines the various connections between pairs of the offload nodes and defines the inputs and outputs of the connections.
  • the same set of offload nodes may be able to be configured in multiple ways depending on the desired functionality.
  • the offload nodes are implemented as one or more hard IP blocks. These hard IP blocks may include a layout of reusable hardware having a specified application function. As was described in relation to FIGS. 1-4 , examples of such specified application functions include cryptographic functions, frequency transform (FFT) functions, prime factorization functions, compression or decompression functions, mathematical functions, hash functions, and/or Ethernet functions.
  • FFT frequency transform
  • the processor in the hard processor region and the processor in the offload region are configured to asynchronously access a memory in the hard processor region.
  • the core CPU 108 and the processor 114 both have access to the same shared memory 106 in the HPS domain 102 . Because of this, there is no need to pass physical copies of data stored in the memory 106 between the two domains. Instead, memory pointers may be passed between the two domains, thereby saving on transmission costs.
  • a mutex locking mechanism may be used such that the two processors cannot concurrently access the memory 106 . In this manner, the “zero copy” mechanism of the shared memory 106 avoids computing nodes to pass physical copies back and forth in the pipeline, thereby improving overall system performance.
  • FIG. 6 is a simplified block diagram of an illustrative system employing a programmable logic device (PLD) 1400 incorporating the present disclosure.
  • PLD programmable logic device
  • a PLD 1400 programmed according to the present disclosure may be used in many kinds of electronic devices.
  • One possible use is in a data processing system 1400 shown in FIG. 6 .
  • Data processing system 1400 may include one or more of the following components: a processor 1401 ; memory 1402 ; I/O circuitry 1403 ; and peripheral devices 1404 . These components are coupled together by a system bus 1405 and are populated on a circuit board 1406 which is contained in an end-user system 1407 .
  • PLD 1400 can be used in a wide variety of applications, such as computer networking, data networking, instrumentation, video processing, digital signal processing, or any other application where the advantage of using programmable or reprogrammable logic is desirable.
  • PLD 140 can be used to perform a variety of different logic functions.
  • PLD 1400 can be configured as a processor or controller that works in cooperation with processor 1401 .
  • PLD 1400 may also be used as an arbiter for arbitrating access to a shared resource in the system.
  • PLD 1400 can be configured as an interface between processor 1401 and one of the other components in the system. It should be noted that the system shown in FIG. 7 is only exemplary, and that the true scope and spirit of the invention should be indicated by the following claims.
  • the systems and methods of the present disclosure provide several benefits compared to existing systems.
  • First, the present disclosure provides effective use of multi-core and many-core processors, which extends the usage of FPGAs in heterogeneous environments, in both personal and cloud computing applications.
  • Second, dynamical runtime configuration of the modular offload nodes described herein allow for the main application CPU (i.e., the core CPU 108 ) to offload its computationally intensive tasks. This provides the flexibility needed to satisfy a wide variety of computing needs.
  • the hardware acceleration of pipelined tasks using the offload nodes significantly improves computational efficiency.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Advance Control (AREA)

Abstract

Systems and methods are provided for configuring a programmable integrated circuit device. A hard processor region of the programmable integrated circuit device includes a processor that identifies one or more tasks for assigning to an offload region of the programmable integrated circuit. The processor in the hard processor region transmits an instruction to the offload region. The plurality of offload nodes in the offload region are configured to perform the one or more tasks.

Description

    CROSS-REFERENCE TO RELATED APPLICATION
  • This is a continuation of U.S. patent application Ser. No. 14/624,951, entitled “Modular Offloading for Computationally Intensive Tasks” filed Feb. 18, 2015, the contents of which is incorporated by reference in its entirety for all purposes.
  • FIELD OF THE DISCLOSURE
  • This disclosure relates to integrated circuit devices, such as field programmable gate array (FPGA) devices, and systems and methods for offloading computationally intensive tasks to offload regions on such devices.
  • BACKGROUND OF THE DISCLOSURE
  • Many-core and multi-core devices provide a way to increase performance of a device without incurring the cost of increasing clock speeds. Many-core devices may include dedicated ASIC blocks for hardware specific functions that are often referred to as hardware accelerators. Programmable logic devices (PLDs) (e.g., field programmable gate arrays (FPGAs)), complex programmable logic devices (CPLDs), field programmable system on a chips (FPSCs), or other types of programmable devices) generally include programmable logic blocks which may be configured to implement various operations. Some PLDs also include configurable embedded hardware to support additional operations. However, conventional approaches to configuring such embedded hardware are often cumbersome and unwieldy.
  • One limitation of existing many-core and multi-core systems is that the topology and node configuration of the system is fixed. In these systems, tasks are run separately, and physical copies of data are passed between computing nodes and applications, which is inefficient. Accordingly, there is a need for an improved approach to configuring hardware resources of a PLD.
  • SUMMARY OF THE DISCLOSURE
  • In light of the above, the present disclosure relates to a programmable integrated circuit device that includes an offload region with a flexible topology that can be configured at execution time.
  • In accordance with embodiments of the present disclosure, there is provided a method of configuring a programmable integrated circuit device. The method includes identifying, by a processor in a hard processor region of the programmable integrated circuit device, one or more tasks for assigning to an offload region of the programmable integrated circuit device. The processor in the hard processor region transmits an instruction to the offload region, and a plurality of offload nodes in the offload region are configured to perform the one or more tasks.
  • In some embodiments, the processor in the hard processor region is a first processor, both the first processor and a second processor in the offload region are configured to asynchronously access a memory in the hard processor region. The configuring the plurality of offload nodes may include configuring one or more data flow paths through at least a subset of the plurality of offload nodes. In some embodiments, the one or more tasks include processing security content, and the processing of security content is assigned to the offload region to reduce a likelihood of a hacker attack on the programmable integrated circuit device. The plurality of offload nodes in the offload region may be modified by adding a new offload node to the plurality of offload nodes, removing an offload node from the plurality of offload nodes, or replacing an offload node in the plurality of offload nodes.
  • In accordance with embodiments of the present disclosure, there is provided a programmable integrated circuit device having a hard processor region and an offload region coupled to each other. The hard processor region has a first processor that identifies one or more tasks that are assigned to the offload region and transmits an instruction to the offload region. The offload region includes a plurality of offload nodes that are configured to perform the one or more tasks.
  • In some embodiments, the hard processor region comprises a memory, and the processor in the hard processor region another processor in the offload region are configured to asynchronously access the memory. The instruction may include how to configure one or more data flow paths through at least a subset of the plurality of offload nodes. In an example, the one or more tasks include processing security content, and the processing of security content is assigned to the offload region to reduce a likelihood of a hacker attack on the integrated circuit device.
  • In some embodiments, partial reconfiguration of the offload nodes is used, such that the plurality of offload nodes in the offload region are configured to be modified by adding a new offload node to the plurality of offload nodes, removing an offload node from the plurality of offload nodes, or replacing an offload node in the plurality of offload nodes. At least one of the plurality of offload nodes may be implemented as a hard intellectual property block, where the hard intellectual property block includes a layout of reusable hardware having a specified application function. The specified application function may be selected from the group: cryptographic function, a frequency transform function, a prime factorization function, a compression or decompression function, a mathematical function, a hash function, and an Ethernet function.
  • In some embodiments, at least one field programmable gate array is used to implement the offload region. The offload region may further include a second memory that is accessible to each offload node in the offload region, and the second memory in the offload region may be partitioned in accordance with the instruction.
  • BRIEF DESCRIPTION OF THE FIGURES
  • Further features of the disclosure, its nature and various advantages will be apparent upon consideration of the following detailed description, taken in conjunction with the accompanying drawings, in which like referenced characters refer to like parts throughout, and in which:
  • FIG. 1 shows a diagram of a system that assigns computationally intensive tasks to be performed at an offload region, in accordance with some embodiments of the present disclosure;
  • FIG. 2 shows a diagram of an offload region having a set of offload nodes to perform assigned tasks, in accordance with some embodiments of the present disclosure;
  • FIG. 3 shows a diagram of an offload region having a set of offload nodes to perform processing of security content, in accordance with some embodiments of the present disclosure;
  • FIG. 4 shows a diagram of an offload region having a set of offload nodes to perform parallel mathematical processing, in accordance with some embodiments of the present disclosure;
  • FIG. 5 shows an illustrative flow diagram of a process for configuring an offload region of a programmable integrated circuit device, in accordance with some embodiments of the present disclosure; and
  • FIG. 6 is a simplified block diagram of an illustrative system employing a programmable logic device incorporating the present disclosure.
  • DETAILED DESCRIPTION
  • To provide an overall understanding of the invention, certain illustrative embodiments will now be described. However, it will be understood by one of ordinary skill in the art that the systems and methods described herein may be adapted and modified as is appropriate for the application being addressed and that the systems and methods described herein may be employed in other suitable applications, and that such other additions and modifications will not depart from the scope hereof.
  • The figures described herein show illustrative embodiments; however, the figures may not necessarily not show and may not be intended to show the exact layout of the hardware components contained in the embodiments. The figures are provided merely to illustrate the high level conceptual layouts of the embodiments. The embodiments disclosed herein may be implemented with any suitable number of components and any suitable layout of components in accordance with principles known in the art.
  • General purpose processors are not suitable for special purpose tasks because they commonly use general purples instruction sets. For example, performing a complicated hash on a large file may take a general purpose processor more than ten seconds. A similar task may be hardware accelerated on a modular offload engine (having a customized instruction set) in an FPGA or ASIC and may take a fraction of a second.
  • The present disclosure describes a heterogeneous many-core FPGA solution that may be used for servers and cloud data centers to dynamically customize applications and usage models, and to provide hardware acceleration for computing processes. The present disclosure allows for the topology and node configuration of a heterogeneous system to be configurable at execution time. Such a configurable system allows for parallel processing and flexible pipe staging of various tasks.
  • FIG. 1 is a block diagram of a system 100 divided into two components represented by a hard processor system (HPS) domain 102 and an FPGA domain 104. The HPS domain 102 includes a memory 106, a core CPU 108, an offload manager 110, and an offload driver 112. The FPGA domain 104 includes a processor 114, a memory 116, and a cluster of offload nodes 118. Some tasks that are assigned to be performed by the system 100 may be computationally intensive for the HPS domain 102 to handle by itself. In this case, the HPS domain 102 may offload certain tasks to the cluster of offload nodes 118 in the FPGA fabric.
  • An offload node may include a hard intellectual property (IP) block that is configurable at execution time. In particular, the topology of the offload nodes 118 may be configured at execution time. The topology defines the various connections between pairs of the offload nodes 118 and defines the inputs and outputs of the connections. By allowing the topology to be configured at execution time, the same set of offload nodes 118 may be able to be configured in multiple ways depending on the desired functionality. In this manner, the set of offload nodes 118 is flexible and can be used in many different situations. Moreover, the FPGA fabric may be hardware accelerated by using one or more hard IP blocks as the offload nodes 118.
  • Although the system 100 is shown and described as having an FPGA domain 104, it should be understood that the system 100 and other systems discussed herein may have other types of integrated circuits (IC) instead of or in addition to one or more FPGAs. It should also be understood that the systems and methods discussed herein as applying to FPGAs may be equally applicable to ICs of other types, such as application-specific integrated circuits (ASICs), application specific standard products (ASSPs), and other programmable logic devices (PLDs). For example, in some embodiments, the system 100 may include ASIC and/or off-the-shelf ASSP dies. In some embodiments, a combination of FPGA and ASIC/ASSP may be used, assuming such FPGA and ASIC/ASSP dies have compatible electrical interfaces.
  • One or more components of the FPGA domain 104 may be implemented with hardware IP blocks, which may include a layout of reusable hardware having a specified application function. One or more types of hard IP blocks may be implemented in the system 100. Examples of these hard IP blocks that may be included in the set of offload nodes 118 are described in detail in relation to FIGS. 2-4. The FPGA domain 104 may be implemented using a system-on-chip (SoC) FPGA, whose hard IP may include an embedded multicore processor subsystem.
  • In some implementations, the core CPU 108 of the HPS domain 102 includes a number of nodes, each of which may correspond to an instance of an operating system. The processor 114 of the FPGA domain 104 may include any number of embedded processors, such as a NIOS II processor. As shown in FIG. 1, both the core CPU 108 and the processor 114 access the memory unit 106. By allowing both processors in the HPS domain 102 and the FPGA domain 104 to access the same shared memory 106, there is no need to pass physical copies of data stored in the memory 106 between the two domains. Instead, memory pointers may be passed between the two domains, thereby reducing transmission costs associated with passing of data between the HPS domain 102 and the FPGA domain 104. In some embodiments, a mechanism to prevent data contention is used for the CPU 108 and the processor 114. For example, a mutex locking mechanism may be used such that the CPU 108 and the processor 114 are prohibited from concurrently accessing the memory 106. In this manner, the “zero copy” mechanism of the shared memory 106 avoids the need for computing nodes to pass physical copies back and forth in the pipeline, thereby improving overall system performance.
  • The offload manager 110 receives data from the core CPU 108 and handles construction of pipes and redirectors for forming connections between the various nodes in the cluster of offload nodes 118. The offload manager 110 also ensures that the constructed pipes and redirectors adhere to one or more offload policies. In an example, a Fibonacci computing node in the cluster of offload nodes 118 may be configured to have a loopback connection, but other types of nodes may not be configured to handle loopback connections. By ensuring that appropriate nodes, such as Fibonacci nodes, have the proper types of connections, the offload manager 110 ensures that the connections between nodes in the cluster of offload nodes 118 comply with such policies. The offload manager 110 also constructs the data flow path that defines the manner in which data flows through the offload nodes 118 or a subset thereof.
  • In some embodiments, the offload manager 110 uses multi-core APIs to configure the connections for at least a subset of the nodes within the cluster of offload nodes 118. The offload manager 110 communicates instructions for configuring these connections to the offload driver, which essentially serves as an interface between the offload manager 110 in the HPS domain 102 and the processor 114 in the FPGA domain 104, and also between the offload manager 110 and the cluster of offload nodes 118 in the FPGA domain. This interface may include OpenCL APIs that pass along tasks to the offload nodes 118 in the FPGA domain 102. In this manner, the offload manager 110 constructs the path over which data flows through the cluster of offload nodes 118, while ensuring that the nodes adhere to one or more policies. As is shown in FIG. 1, the offload manager 110 and the offload driver 112 are shown as two separate entities within the HPS domain 102, but may be included in the same entity without departing from the scope of the present disclosure.
  • The offload driver 112 instructs the processor 114 to load the data for each of the offload nodes 118 that will be used in the desired configuration. In particular, the processor 114 may be instructed by the offload driver 112 to load the appropriate data from the shared memory 106 into the memory 116. The instruction received from the offload driver 112 may refer to pointers to addresses in the shared memory 106. By loading data from the shared memory 106 into the memory 116, the processor 114 allows for the appropriate data to be read by the offload nodes 118.
  • In some implementations, the memory 116 is partitioned and managed by the processor 114. For example, the processor 114 may partition the memory 116 according to the set of offload nodes 118 or those offload nodes 118 that will be used in a particular configuration. The memory 116 may be partitioned to reserve a portion of the memory to be dedicated for each offload node for its usage, such as configuration and data exchange.
  • The various offload nodes 118 may be chained and connected to efficiently perform a list of complicated tasks. The flow of the tasks is configurable such that the offload nodes 118 may be piped together at execution time. In an example, OpenCL may be used to allow the application layer to use the set of offload nodes 118 in the FPGA fabric for more computing power using task-based parallelism. Example configurations of the offload nodes 118 are described in more detail in relation to FIGS. 2-4.
  • FIG. 2 is a block diagram of a system 200 having an example configuration of a cluster of offload nodes 218 that may replace the offload nodes 118 in FIG. 1. As is shown in FIG. 2, a set of eight offload nodes are included in the cluster, and each node performs a specific function. As was described above, the connections between the various nodes of FIG. 2 are set at execution time by the offload manager 110 and the offload driver 112.
  • As is shown in FIG. 2, the offload nodes 218 include a crypto node 230, a Fibonacci node 232, a Fast Fourier Transform (FFT) node 234, an Ethernet MAC (EMAC) node 236, a prime factor node 238, a zip/unzip node 240, a math node 242, and a hash node 244. As was described in relation to FIG. 1, the offload manager 110 configures the various connections between the offload nodes 218 in FIG. 2. In particular, the offload manager 110 may set the different types of connections, such as data flow, memory access loop back flow, and back door connect types of connections. The different types of connections may be set up by the offload manager 110 based on the time at which the offload nodes are to be used. For example, the crypto node 230 and the prime factor node 238 may use the back door connections so that the nodes 230 and 238 may communicate with each other for testing and debugging purposes.
  • In FIG. 2, each offload node 230-244 has access to the memory 116 over memory access connections, but not all the offload nodes 218 are used to process data. Data is passed from the offload driver 112 to the crypto node 230, to the prime factor node 238, and finally to the hash node 244. The Fibonacci node 232 has a loop back flow connection, meaning that the Fibonacci node 232 has an input from its own output port. Moreover, the connections between crypto node 230 and the prime factor node 238 are backdoor connections. The crypto node 230 may perform encryption and/or decryption of the incoming data using a public key and/or a private key. The prime factor node 238 may be configured to generate public/private key pairs for the crypto node 230 to use. In this manner, the back door connections between the crypto node 230 and the prime factor node 238 may be used for testing and debugging. The offload manager 110 keeps track of these different types of connections and which nodes should be connected or piped together in what manner.
  • In some implementations the offload nodes 218 include multiple instances of an identical computing node. For example, the FFT node 234 may be replicated multiple times so as to provide parallel computing. FIG. 2 provides an exemplary block diagram of the various connections that may be configured between the offload nodes 218. The connections in FIG. 2 may be dynamically configured and represent a general example of modular offloading for computationally intensive tasks.
  • FIG. 3 is a block diagram of a system 300 having an example configuration of a cluster of offload nodes 318 for processing security content. The offload nodes 318 are the same as the offload nodes 218 shown in FIG. 2 (i.e., including a crypto node 330, a Fibonacci node 332, an FFT node 334, an EMAC node 336, a prime factor node 338, a zip/unzip node 340, a math node 342, and a hash node 344), but the configuration of the offload nodes 318 of FIG. 3 is different from the configuration of the offload nodes 218 of FIG. 2. In particular, the same set of offload nodes may be used for both sets 218 and 318, but depending on the desired functionality, the same set of offload nodes may be connected in different manners so as to execute different tasks. In FIG. 2, data flows from the offload driver 112 to the crypto node 230 to the prime factor node 238 and to the hash node 244. In contrast, the data flow is configured differently in FIG. 3, in which, data flows from the offload driver 112 to the crypto node 330, to the zip/unzip node 340, to the hash node 344, to the EMAC node 336, and to the network 346.
  • One application of the example configuration in FIG. 3 may involve the core CPU 108 assigning the processing of security content to the FPGA domain 104. The core CPU 108 may assign such a task to the FPGA domain 104 to free up the HPS domain 102 to handle other tasks, such as operating system tasks. In particular, it may be desirable to offload the processing of security content to the FPGA domain 104 to prevent hacker attacks. By using the offload nodes 318 to process the security content, a tamper resistance system is used to prevent hacker attacks. In particular, in the FPGA domain 104, the offload nodes 318 are implemented at the level of hardware gates, which is more difficult to attack, compared to the software that may be implemented in the HPS domain 102. To attack the software, a hacker may simply use a powerful debugger to trace or step through the software functions, while attacking the hardware implementation in the FPGA domain 104 is more complex.
  • Any or all of the offload nodes 318 of FIG. 3 may be implemented as hard IP blocks. In the example shown in FIG. 3, the crypto node 330 may include a hardware crypto engine that accelerates applications that need cryptographic functions. After the data has been compressed or uncompressed by the zip/unzip node 340, the hash node 344 computes a hash function on the data. In some implementations, the hash node 344 performs a cipher process, such as data encryption standard/advanced encryption standard, kasumi, snow3G, md5 (e.g., a md5sum), sha1 (e.g., a sha1sum), sha2, or any other process to calculate and verify a hash of the data. Then, the EMAC node 336 sends data for publishing to a network 346, which may correspond to the world wide web, or any other suitable network.
  • FIG. 4 is a block diagram of a system 400 having an example configuration of a cluster of offload nodes 418 for performing mathematical operations. The offload nodes 418 include a prime factor node 450, a Fibonacci node 452, an FFT node 454, an EMAC node 456, a Heaviside node 458, a zip/unzip node 460, a math node 462, and a hash node 464. As described above, one or more of these nodes may be implemented as a hard IP block that is configurable at execution time.
  • For example, the core CPU 108 determines that certain tasks that are CPU-intensive should be passed over to the FPGA domain 104. Examples of CPU-intensive tasks include but are not limited to prime factoring of a large integer, mathematic arctan and Heaviside step functions, and FFT computations. In general, any task that is computationally expensive or slow for the HPS domain 102 to handle on its own may be passed over to the FPGA domain 104.
  • In the example shown in FIG. 4, there are three parallel data flow paths from the offload driver 112 to the prime factor node 450, the Heaviside node 458, and the math node 462, which may be a hard IP block configured to perform an arctan computation. Data flows out of each of these three nodes 450, 458, and 462 to the FFT node 454. As is shown in FIG. 4, the prime factor node 450, the Heaviside node 458, and the math node 462 may perform prime factor, Heaviside, and arctan computations in parallel because these computations are independent of one another. By allowing for parallel computations, the set of offload nodes 418 save significantly on time
  • In some implementations, partial reconfiguration is used to add, replace, or modify any of the offload nodes 118, 218, 318, or 418. In particular, one or more hard IP blocks may be added to a set of existing offload nodes, or any of the existing hard IP blocks in the offload nodes may be modified or replaced. As used herein, partial reconfiguration refers the ability to reconfigure the logic on a region on a chip on the fly. In this way, partial reconfiguration allows for the modification to the set of offload nodes without necessarily requiring downtime from other components of the chip. Partial reconfiguration is especially useful if there is a limited amount of FPGA resource space in the FPGA domain 104. However, one potential disadvantage of using partial reconfiguration to address FPGA resource limitations is that there may be some penalty in the form of a wait time delay if the IP blocks are being reconfigured during run time.
  • FIG. 5 shows an illustrative flow diagram of a process 500 for configuring a set of offload nodes, according to an illustrative embodiment. The process 500 may be performed on an integrated circuit device, such as an FPGA device, an ASIC device, an ASSP device, or a PLD. The process 500 may be used for servers and cloud data centers to dynamically customize applications and usage models, and to provide hardware acceleration for computing processes. The process 500 allows for the topology and node configuration of a heterogeneous system to be configurable at execution time. Such a configurable system allows for parallel processing and flexible pipe staging of various tasks.
  • At 502, a processor in a hard processor region of a programmable integrated circuit device identifies one or more tasks for assigning to an offload region of the programmable integrated circuit device. In particular, some tasks that are assigned to be performed by a system with a hard processor region (e.g., the HPS domain 102, for example) may be computationally intensive for the hard processor region to handle by itself. In this case, the hard processor region may offload certain tasks to an offload region (e.g., the FPGA domain 104, for example).
  • One example of a task that may be assigned from the hard processor region to the offload region is processing of security content. It may be desirable to offload processing of secure material so as to reduce a likelihood of a hacker attack on the integrated circuit device.
  • At 504, the processor in the hard processor region transmits an instruction to the offload region. As was described in relation to FIG. 1, the instruction may be transmitted from the offload driver 112 to the processor 114, and may include one or more pointers to memory locations in the shared memory 106. In particular, the processor 114 may be instructed by the offload driver 112 to load the appropriate data from the memory 106 into the memory 116. In some implementations, the memory 116 is partitioned in accordance with the instruction such that the offload nodes 118 access the desired data. Moreover, the offload driver 112 may transmit instruction data to the set of offload nodes 118 to configure the offload nodes 118 so that the desired connections are formed.
  • At 506, a plurality of offload nodes in the offload region are configured to perform the one or more tasks. Configuring the offload nodes includes configuring the data flow paths through at least a subset of the offload nodes. As used herein, an offload node may include a hard intellectual property (IP) block that is configurable at execution time. As was described in relation to FIGS. 1-4, the topology of the offload nodes 118, 218, 318, and 418 may be configured at execution time. The topology defines the various connections between pairs of the offload nodes and defines the inputs and outputs of the connections. By allowing the topology to be configured at execution time, the same set of offload nodes may be able to be configured in multiple ways depending on the desired functionality. In some implementations, the offload nodes are implemented as one or more hard IP blocks. These hard IP blocks may include a layout of reusable hardware having a specified application function. As was described in relation to FIGS. 1-4, examples of such specified application functions include cryptographic functions, frequency transform (FFT) functions, prime factorization functions, compression or decompression functions, mathematical functions, hash functions, and/or Ethernet functions.
  • In some implementations, the processor in the hard processor region and the processor in the offload region are configured to asynchronously access a memory in the hard processor region. As was described in relation to FIG. 1, the core CPU 108 and the processor 114 both have access to the same shared memory 106 in the HPS domain 102. Because of this, there is no need to pass physical copies of data stored in the memory 106 between the two domains. Instead, memory pointers may be passed between the two domains, thereby saving on transmission costs. In some implementations, a mutex locking mechanism may be used such that the two processors cannot concurrently access the memory 106. In this manner, the “zero copy” mechanism of the shared memory 106 avoids computing nodes to pass physical copies back and forth in the pipeline, thereby improving overall system performance.
  • FIG. 6 is a simplified block diagram of an illustrative system employing a programmable logic device (PLD) 1400 incorporating the present disclosure. A PLD 1400 programmed according to the present disclosure may be used in many kinds of electronic devices. One possible use is in a data processing system 1400 shown in FIG. 6. Data processing system 1400 may include one or more of the following components: a processor 1401; memory 1402; I/O circuitry 1403; and peripheral devices 1404. These components are coupled together by a system bus 1405 and are populated on a circuit board 1406 which is contained in an end-user system 1407.
  • PLD 1400 can be used in a wide variety of applications, such as computer networking, data networking, instrumentation, video processing, digital signal processing, or any other application where the advantage of using programmable or reprogrammable logic is desirable. PLD 140 can be used to perform a variety of different logic functions. For example, PLD 1400 can be configured as a processor or controller that works in cooperation with processor 1401. PLD 1400 may also be used as an arbiter for arbitrating access to a shared resource in the system. In yet another example, PLD 1400 can be configured as an interface between processor 1401 and one of the other components in the system. It should be noted that the system shown in FIG. 7 is only exemplary, and that the true scope and spirit of the invention should be indicated by the following claims.
  • Various technologies can be used to implement PLDs 1400 as described above and incorporating this invention.
  • The systems and methods of the present disclosure provide several benefits compared to existing systems. First, the present disclosure provides effective use of multi-core and many-core processors, which extends the usage of FPGAs in heterogeneous environments, in both personal and cloud computing applications. Second, dynamical runtime configuration of the modular offload nodes described herein allow for the main application CPU (i.e., the core CPU 108) to offload its computationally intensive tasks. This provides the flexibility needed to satisfy a wide variety of computing needs. Third, the hardware acceleration of pipelined tasks using the offload nodes significantly improves computational efficiency.
  • The foregoing is merely illustrative of the principles of the embodiments and various modifications can be made by those skilled in the art without departing from the scope and spirit of the embodiments disclosed herein. The above described embodiments of the present disclosure are presented for purposes of illustration and not of limitation, and the present invention is limited only by the claims which follow.

Claims (1)

What is claimed is:
1. A method of configuring a programmable integrated circuit device, the method comprising:
identifying, by a processor in a hard processor region of the programmable integrated circuit device, one or more tasks for assigning to an offload region of the programmable integrated circuit device;
transmitting, by the processor in the hard processor region, an instruction to the offload region;
configuring a plurality of offload nodes in the offload region to perform the one or more tasks.
US15/912,307 2015-02-18 2018-03-05 Modular offloading for computationally intensive tasks Abandoned US20180196698A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/912,307 US20180196698A1 (en) 2015-02-18 2018-03-05 Modular offloading for computationally intensive tasks

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US14/624,951 US9910705B1 (en) 2015-02-18 2015-02-18 Modular offloading for computationally intensive tasks
US15/912,307 US20180196698A1 (en) 2015-02-18 2018-03-05 Modular offloading for computationally intensive tasks

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US14/624,951 Continuation US9910705B1 (en) 2015-02-18 2015-02-18 Modular offloading for computationally intensive tasks

Publications (1)

Publication Number Publication Date
US20180196698A1 true US20180196698A1 (en) 2018-07-12

Family

ID=61257264

Family Applications (2)

Application Number Title Priority Date Filing Date
US14/624,951 Active 2035-03-11 US9910705B1 (en) 2015-02-18 2015-02-18 Modular offloading for computationally intensive tasks
US15/912,307 Abandoned US20180196698A1 (en) 2015-02-18 2018-03-05 Modular offloading for computationally intensive tasks

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US14/624,951 Active 2035-03-11 US9910705B1 (en) 2015-02-18 2015-02-18 Modular offloading for computationally intensive tasks

Country Status (1)

Country Link
US (2) US9910705B1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10970119B2 (en) * 2017-03-28 2021-04-06 Intel Corporation Technologies for hybrid field-programmable gate array application-specific integrated circuit code acceleration

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2015408850B2 (en) * 2015-09-14 2020-06-25 Teleste Oyj A method for wireless data offload
US10489195B2 (en) * 2017-07-20 2019-11-26 Cisco Technology, Inc. FPGA acceleration for serverless computing
WO2020000136A1 (en) 2018-06-25 2020-01-02 Alibaba Group Holding Limited System and method for managing resources of a storage device and quantifying the cost of i/o requests
US10516649B1 (en) * 2018-06-27 2019-12-24 Valtix, Inc. High-performance computer security gateway for cloud computing platform
US11012475B2 (en) 2018-10-26 2021-05-18 Valtix, Inc. Managing computer security services for cloud computing platforms
US11061735B2 (en) 2019-01-02 2021-07-13 Alibaba Group Holding Limited System and method for offloading computation to storage nodes in distributed system
US11617282B2 (en) 2019-10-01 2023-03-28 Alibaba Group Holding Limited System and method for reshaping power budget of cabinet to facilitate improved deployment density of servers
US11461262B2 (en) * 2020-05-13 2022-10-04 Alibaba Group Holding Limited Method and system for facilitating a converged computation and storage node in a distributed storage system
US11556277B2 (en) 2020-05-19 2023-01-17 Alibaba Group Holding Limited System and method for facilitating improved performance in ordering key-value storage with input/output stack simplification
US11507499B2 (en) 2020-05-19 2022-11-22 Alibaba Group Holding Limited System and method for facilitating mitigation of read/write amplification in data compression
US11487465B2 (en) 2020-12-11 2022-11-01 Alibaba Group Holding Limited Method and system for a local storage engine collaborating with a solid state drive controller
US11734115B2 (en) 2020-12-28 2023-08-22 Alibaba Group Holding Limited Method and system for facilitating write latency reduction in a queue depth of one scenario
US11726699B2 (en) 2021-03-30 2023-08-15 Alibaba Singapore Holding Private Limited Method and system for facilitating multi-stream sequential read performance improvement with reduced read amplification

Citations (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030112467A1 (en) * 2001-12-17 2003-06-19 Mccollum Tim Apparatus and method for multimedia navigation
US20070283193A1 (en) * 2006-04-21 2007-12-06 Altera Corporation Soft error location and sensitivity detection for programmable devices
US20080065835A1 (en) * 2006-09-11 2008-03-13 Sun Microsystems, Inc. Offloading operations for maintaining data coherence across a plurality of nodes
US20090089794A1 (en) * 2007-09-27 2009-04-02 Hilton Ronald N Apparatus, system, and method for cross-system proxy-based task offloading
US7546441B1 (en) * 2004-08-06 2009-06-09 Xilinx, Inc. Coprocessor interface controller
US20100241758A1 (en) * 2008-10-17 2010-09-23 John Oddie System and method for hardware accelerated multi-channel distributed content-based data routing and filtering
US20110029691A1 (en) * 2009-08-03 2011-02-03 Rafael Castro Scorsi Processing System and Method
US20110167250A1 (en) * 2006-10-24 2011-07-07 Dicks Kent E Methods for remote provisioning of eletronic devices
US20120246052A1 (en) * 2010-12-09 2012-09-27 Exegy Incorporated Method and Apparatus for Managing Orders in Financial Markets
US20130086332A1 (en) * 2010-05-18 2013-04-04 Lsi Corporation Task Queuing in a Multi-Flow Network Processor Architecture
US20130089109A1 (en) * 2010-05-18 2013-04-11 Lsi Corporation Thread Synchronization in a Multi-Thread, Multi-Flow Network Communications Processor Architecture
US20130117486A1 (en) * 2011-11-04 2013-05-09 David A. Daniel I/o virtualization via a converged transport and related technology
US20130343407A1 (en) * 2012-06-21 2013-12-26 Jonathan Stroud High-speed cld-based tcp assembly offload
US8630829B1 (en) * 2007-07-19 2014-01-14 The Mathworks, Inc. Computer aided design environment with electrical and electronic features
US20140059390A1 (en) * 2010-10-20 2014-02-27 Netapp, Inc. Use of service processor to retrieve hardware information
US20140176187A1 (en) * 2012-12-23 2014-06-26 Advanced Micro Devices, Inc. Die-stacked memory device with reconfigurable logic
US20140298061A1 (en) * 2013-04-01 2014-10-02 Cleversafe, Inc. Power control in a dispersed storage network
US20150046679A1 (en) * 2013-08-07 2015-02-12 Qualcomm Incorporated Energy-Efficient Run-Time Offloading of Dynamically Generated Code in Heterogenuous Multiprocessor Systems
US20150046478A1 (en) * 2013-08-07 2015-02-12 International Business Machines Corporation Hardware implementation of a tournament tree sort algorithm
US20150234698A1 (en) * 2014-02-18 2015-08-20 Netapp, Inc. Methods for diagnosing hardware component failure and devices thereof
US20160094619A1 (en) * 2014-09-26 2016-03-31 Jawad B. Khan Technologies for accelerating compute intensive operations using solid state drives
US20160210167A1 (en) * 2013-09-24 2016-07-21 University Of Ottawa Virtualization of hardware accelerator

Patent Citations (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030112467A1 (en) * 2001-12-17 2003-06-19 Mccollum Tim Apparatus and method for multimedia navigation
US7546441B1 (en) * 2004-08-06 2009-06-09 Xilinx, Inc. Coprocessor interface controller
US20070283193A1 (en) * 2006-04-21 2007-12-06 Altera Corporation Soft error location and sensitivity detection for programmable devices
US20080065835A1 (en) * 2006-09-11 2008-03-13 Sun Microsystems, Inc. Offloading operations for maintaining data coherence across a plurality of nodes
US20110167250A1 (en) * 2006-10-24 2011-07-07 Dicks Kent E Methods for remote provisioning of eletronic devices
US8630829B1 (en) * 2007-07-19 2014-01-14 The Mathworks, Inc. Computer aided design environment with electrical and electronic features
US20090089794A1 (en) * 2007-09-27 2009-04-02 Hilton Ronald N Apparatus, system, and method for cross-system proxy-based task offloading
US20100241758A1 (en) * 2008-10-17 2010-09-23 John Oddie System and method for hardware accelerated multi-channel distributed content-based data routing and filtering
US20110029691A1 (en) * 2009-08-03 2011-02-03 Rafael Castro Scorsi Processing System and Method
US20130089109A1 (en) * 2010-05-18 2013-04-11 Lsi Corporation Thread Synchronization in a Multi-Thread, Multi-Flow Network Communications Processor Architecture
US20130086332A1 (en) * 2010-05-18 2013-04-04 Lsi Corporation Task Queuing in a Multi-Flow Network Processor Architecture
US20140059390A1 (en) * 2010-10-20 2014-02-27 Netapp, Inc. Use of service processor to retrieve hardware information
US20120246052A1 (en) * 2010-12-09 2012-09-27 Exegy Incorporated Method and Apparatus for Managing Orders in Financial Markets
US20130117486A1 (en) * 2011-11-04 2013-05-09 David A. Daniel I/o virtualization via a converged transport and related technology
US20130343407A1 (en) * 2012-06-21 2013-12-26 Jonathan Stroud High-speed cld-based tcp assembly offload
US20140176187A1 (en) * 2012-12-23 2014-06-26 Advanced Micro Devices, Inc. Die-stacked memory device with reconfigurable logic
US20140298061A1 (en) * 2013-04-01 2014-10-02 Cleversafe, Inc. Power control in a dispersed storage network
US20150046679A1 (en) * 2013-08-07 2015-02-12 Qualcomm Incorporated Energy-Efficient Run-Time Offloading of Dynamically Generated Code in Heterogenuous Multiprocessor Systems
US20150046478A1 (en) * 2013-08-07 2015-02-12 International Business Machines Corporation Hardware implementation of a tournament tree sort algorithm
US20160210167A1 (en) * 2013-09-24 2016-07-21 University Of Ottawa Virtualization of hardware accelerator
US20150234698A1 (en) * 2014-02-18 2015-08-20 Netapp, Inc. Methods for diagnosing hardware component failure and devices thereof
US20160094619A1 (en) * 2014-09-26 2016-03-31 Jawad B. Khan Technologies for accelerating compute intensive operations using solid state drives

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10970119B2 (en) * 2017-03-28 2021-04-06 Intel Corporation Technologies for hybrid field-programmable gate array application-specific integrated circuit code acceleration
US11372684B2 (en) 2017-03-28 2022-06-28 Intel Corporation Technologies for hybrid field-programmable gate array application-specific integrated circuit code acceleration
US11687375B2 (en) 2017-03-28 2023-06-27 Intel Corporation Technologies for hybrid field-programmable gate array application-specific integrated circuit code acceleration

Also Published As

Publication number Publication date
US9910705B1 (en) 2018-03-06

Similar Documents

Publication Publication Date Title
US20180196698A1 (en) Modular offloading for computationally intensive tasks
Shantharama et al. Hardware-accelerated platforms and infrastructures for network functions: A survey of enabling technologies and research studies
CN110088742B (en) Logical repository service using encrypted configuration data
Cerović et al. Fast packet processing: A survey
Zhang et al. {G-NET}: Effective {GPU} Sharing in {NFV} Systems
Kim et al. NBA (network balancing act) a high-performance packet processing framework for heterogeneous processors
EP3329413A1 (en) Techniques to secure computation data in a computing environment
Wassel et al. Networks on chip with provable security properties
US20180217823A1 (en) Tightly integrated accelerator functions
Al-Aghbari et al. Cloud-based FPGA custom computing machines for streaming applications
Ghasemi et al. Accelerating apache spark with fpgas
Dosanjh et al. Tail queues: a multi‐threaded matching architecture
Sklyarov et al. Fast regular circuits for network-based parallel data processing
Kosciuszkiewicz et al. Run-time management of reconfigurable hardware tasks using embedded linux
Bergmann et al. A process model for hardware modules in reconfigurable system-on-chip
Vu et al. Efficient hardware task migration for heterogeneous FPGA computing using HDL-based checkpointing
Rajan et al. Trojan aware network-on-chip routing
Li et al. FPGA overlays: hardware-based computing for the masses
Mentone et al. CUDA virtualization and remoting for GPGPU based acceleration offloading at the edge
Ghasemi A scalable heterogeneous dataflow architecture for big data analytics using fpgas
Jiang et al. Properties of self-timed ring architectures for deadlock-free and consistent configuration reaching maximum throughput
Liu et al. Lightweight secure processor prototype on FPGA
Huang et al. Virtualizable hardware/software design infrastructure for dynamically partially reconfigurable systems
Behera et al. An enhanced approach towards improving the performance of embedding memory management units into Network-on-Chip
Bai et al. A hybrid ARM‐FPGA cluster for cryptographic algorithm acceleration

Legal Events

Date Code Title Description
STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: ADVISORY ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STCV Information on status: appeal procedure

Free format text: NOTICE OF APPEAL FILED

STCV Information on status: appeal procedure

Free format text: EXAMINER'S ANSWER TO APPEAL BRIEF MAILED

STCV Information on status: appeal procedure

Free format text: ON APPEAL -- AWAITING DECISION BY THE BOARD OF APPEALS

STCV Information on status: appeal procedure

Free format text: BOARD OF APPEALS DECISION RENDERED

STCB Information on status: application discontinuation

Free format text: ABANDONED -- AFTER EXAMINER'S ANSWER OR BOARD OF APPEALS DECISION