US20180196698A1 - Modular offloading for computationally intensive tasks - Google Patents
Modular offloading for computationally intensive tasks Download PDFInfo
- Publication number
- US20180196698A1 US20180196698A1 US15/912,307 US201815912307A US2018196698A1 US 20180196698 A1 US20180196698 A1 US 20180196698A1 US 201815912307 A US201815912307 A US 201815912307A US 2018196698 A1 US2018196698 A1 US 2018196698A1
- Authority
- US
- United States
- Prior art keywords
- offload
- nodes
- node
- processor
- region
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/57—Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/5038—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/50—Indexing scheme relating to G06F9/50
- G06F2209/509—Offload
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2221/00—Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F2221/03—Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
Definitions
- This disclosure relates to integrated circuit devices, such as field programmable gate array (FPGA) devices, and systems and methods for offloading computationally intensive tasks to offload regions on such devices.
- FPGA field programmable gate array
- PLDs Programmable logic devices
- FPGAs field programmable gate arrays
- CPLDs complex programmable logic devices
- FPSCs field programmable system on a chips
- PLDs Programmable logic devices
- FPGAs field programmable gate arrays
- CPLDs complex programmable logic devices
- FPSCs field programmable system on a chips
- PLDs generally include programmable logic blocks which may be configured to implement various operations.
- Some PLDs also include configurable embedded hardware to support additional operations.
- conventional approaches to configuring such embedded hardware are often cumbersome and unwieldy.
- the present disclosure relates to a programmable integrated circuit device that includes an offload region with a flexible topology that can be configured at execution time.
- a method of configuring a programmable integrated circuit device includes identifying, by a processor in a hard processor region of the programmable integrated circuit device, one or more tasks for assigning to an offload region of the programmable integrated circuit device.
- the processor in the hard processor region transmits an instruction to the offload region, and a plurality of offload nodes in the offload region are configured to perform the one or more tasks.
- the processor in the hard processor region is a first processor, both the first processor and a second processor in the offload region are configured to asynchronously access a memory in the hard processor region.
- the configuring the plurality of offload nodes may include configuring one or more data flow paths through at least a subset of the plurality of offload nodes.
- the one or more tasks include processing security content, and the processing of security content is assigned to the offload region to reduce a likelihood of a hacker attack on the programmable integrated circuit device.
- the plurality of offload nodes in the offload region may be modified by adding a new offload node to the plurality of offload nodes, removing an offload node from the plurality of offload nodes, or replacing an offload node in the plurality of offload nodes.
- a programmable integrated circuit device having a hard processor region and an offload region coupled to each other.
- the hard processor region has a first processor that identifies one or more tasks that are assigned to the offload region and transmits an instruction to the offload region.
- the offload region includes a plurality of offload nodes that are configured to perform the one or more tasks.
- the hard processor region comprises a memory
- the processor in the hard processor region another processor in the offload region are configured to asynchronously access the memory.
- the instruction may include how to configure one or more data flow paths through at least a subset of the plurality of offload nodes.
- the one or more tasks include processing security content, and the processing of security content is assigned to the offload region to reduce a likelihood of a hacker attack on the integrated circuit device.
- partial reconfiguration of the offload nodes is used, such that the plurality of offload nodes in the offload region are configured to be modified by adding a new offload node to the plurality of offload nodes, removing an offload node from the plurality of offload nodes, or replacing an offload node in the plurality of offload nodes.
- At least one of the plurality of offload nodes may be implemented as a hard intellectual property block, where the hard intellectual property block includes a layout of reusable hardware having a specified application function.
- the specified application function may be selected from the group: cryptographic function, a frequency transform function, a prime factorization function, a compression or decompression function, a mathematical function, a hash function, and an Ethernet function.
- At least one field programmable gate array is used to implement the offload region.
- the offload region may further include a second memory that is accessible to each offload node in the offload region, and the second memory in the offload region may be partitioned in accordance with the instruction.
- FIG. 1 shows a diagram of a system that assigns computationally intensive tasks to be performed at an offload region, in accordance with some embodiments of the present disclosure
- FIG. 2 shows a diagram of an offload region having a set of offload nodes to perform assigned tasks, in accordance with some embodiments of the present disclosure
- FIG. 3 shows a diagram of an offload region having a set of offload nodes to perform processing of security content, in accordance with some embodiments of the present disclosure
- FIG. 4 shows a diagram of an offload region having a set of offload nodes to perform parallel mathematical processing, in accordance with some embodiments of the present disclosure
- FIG. 5 shows an illustrative flow diagram of a process for configuring an offload region of a programmable integrated circuit device, in accordance with some embodiments of the present disclosure.
- FIG. 6 is a simplified block diagram of an illustrative system employing a programmable logic device incorporating the present disclosure.
- General purpose processors are not suitable for special purpose tasks because they commonly use general purples instruction sets. For example, performing a complicated hash on a large file may take a general purpose processor more than ten seconds. A similar task may be hardware accelerated on a modular offload engine (having a customized instruction set) in an FPGA or ASIC and may take a fraction of a second.
- the present disclosure describes a heterogeneous many-core FPGA solution that may be used for servers and cloud data centers to dynamically customize applications and usage models, and to provide hardware acceleration for computing processes.
- the present disclosure allows for the topology and node configuration of a heterogeneous system to be configurable at execution time. Such a configurable system allows for parallel processing and flexible pipe staging of various tasks.
- FIG. 1 is a block diagram of a system 100 divided into two components represented by a hard processor system (HPS) domain 102 and an FPGA domain 104 .
- the HPS domain 102 includes a memory 106 , a core CPU 108 , an offload manager 110 , and an offload driver 112 .
- the FPGA domain 104 includes a processor 114 , a memory 116 , and a cluster of offload nodes 118 . Some tasks that are assigned to be performed by the system 100 may be computationally intensive for the HPS domain 102 to handle by itself. In this case, the HPS domain 102 may offload certain tasks to the cluster of offload nodes 118 in the FPGA fabric.
- An offload node may include a hard intellectual property (IP) block that is configurable at execution time.
- IP intellectual property
- the topology of the offload nodes 118 may be configured at execution time.
- the topology defines the various connections between pairs of the offload nodes 118 and defines the inputs and outputs of the connections.
- the same set of offload nodes 118 may be able to be configured in multiple ways depending on the desired functionality. In this manner, the set of offload nodes 118 is flexible and can be used in many different situations.
- the FPGA fabric may be hardware accelerated by using one or more hard IP blocks as the offload nodes 118 .
- system 100 is shown and described as having an FPGA domain 104 , it should be understood that the system 100 and other systems discussed herein may have other types of integrated circuits (IC) instead of or in addition to one or more FPGAs. It should also be understood that the systems and methods discussed herein as applying to FPGAs may be equally applicable to ICs of other types, such as application-specific integrated circuits (ASICs), application specific standard products (ASSPs), and other programmable logic devices (PLDs).
- ASICs application-specific integrated circuits
- ASSPs application specific standard products
- PLDs programmable logic devices
- the system 100 may include ASIC and/or off-the-shelf ASSP dies. In some embodiments, a combination of FPGA and ASIC/ASSP may be used, assuming such FPGA and ASIC/ASSP dies have compatible electrical interfaces.
- One or more components of the FPGA domain 104 may be implemented with hardware IP blocks, which may include a layout of reusable hardware having a specified application function.
- One or more types of hard IP blocks may be implemented in the system 100 . Examples of these hard IP blocks that may be included in the set of offload nodes 118 are described in detail in relation to FIGS. 2-4 .
- the FPGA domain 104 may be implemented using a system-on-chip (SoC) FPGA, whose hard IP may include an embedded multicore processor subsystem.
- SoC system-on-chip
- the core CPU 108 of the HPS domain 102 includes a number of nodes, each of which may correspond to an instance of an operating system.
- the processor 114 of the FPGA domain 104 may include any number of embedded processors, such as a NIOS II processor. As shown in FIG. 1 , both the core CPU 108 and the processor 114 access the memory unit 106 . By allowing both processors in the HPS domain 102 and the FPGA domain 104 to access the same shared memory 106 , there is no need to pass physical copies of data stored in the memory 106 between the two domains. Instead, memory pointers may be passed between the two domains, thereby reducing transmission costs associated with passing of data between the HPS domain 102 and the FPGA domain 104 .
- a mechanism to prevent data contention is used for the CPU 108 and the processor 114 .
- a mutex locking mechanism may be used such that the CPU 108 and the processor 114 are prohibited from concurrently accessing the memory 106 .
- the “zero copy” mechanism of the shared memory 106 avoids the need for computing nodes to pass physical copies back and forth in the pipeline, thereby improving overall system performance.
- the offload manager 110 receives data from the core CPU 108 and handles construction of pipes and redirectors for forming connections between the various nodes in the cluster of offload nodes 118 .
- the offload manager 110 also ensures that the constructed pipes and redirectors adhere to one or more offload policies.
- a Fibonacci computing node in the cluster of offload nodes 118 may be configured to have a loopback connection, but other types of nodes may not be configured to handle loopback connections.
- the offload manager 110 ensures that the connections between nodes in the cluster of offload nodes 118 comply with such policies.
- the offload manager 110 also constructs the data flow path that defines the manner in which data flows through the offload nodes 118 or a subset thereof.
- the offload manager 110 uses multi-core APIs to configure the connections for at least a subset of the nodes within the cluster of offload nodes 118 .
- the offload manager 110 communicates instructions for configuring these connections to the offload driver, which essentially serves as an interface between the offload manager 110 in the HPS domain 102 and the processor 114 in the FPGA domain 104 , and also between the offload manager 110 and the cluster of offload nodes 118 in the FPGA domain.
- This interface may include OpenCL APIs that pass along tasks to the offload nodes 118 in the FPGA domain 102 .
- the offload manager 110 constructs the path over which data flows through the cluster of offload nodes 118 , while ensuring that the nodes adhere to one or more policies.
- the offload manager 110 and the offload driver 112 are shown as two separate entities within the HPS domain 102 , but may be included in the same entity without departing from the scope of the present disclosure.
- the offload driver 112 instructs the processor 114 to load the data for each of the offload nodes 118 that will be used in the desired configuration.
- the processor 114 may be instructed by the offload driver 112 to load the appropriate data from the shared memory 106 into the memory 116 .
- the instruction received from the offload driver 112 may refer to pointers to addresses in the shared memory 106 .
- the processor 114 allows for the appropriate data to be read by the offload nodes 118 .
- the memory 116 is partitioned and managed by the processor 114 .
- the processor 114 may partition the memory 116 according to the set of offload nodes 118 or those offload nodes 118 that will be used in a particular configuration.
- the memory 116 may be partitioned to reserve a portion of the memory to be dedicated for each offload node for its usage, such as configuration and data exchange.
- the various offload nodes 118 may be chained and connected to efficiently perform a list of complicated tasks.
- the flow of the tasks is configurable such that the offload nodes 118 may be piped together at execution time.
- OpenCL may be used to allow the application layer to use the set of offload nodes 118 in the FPGA fabric for more computing power using task-based parallelism.
- Example configurations of the offload nodes 118 are described in more detail in relation to FIGS. 2-4 .
- FIG. 2 is a block diagram of a system 200 having an example configuration of a cluster of offload nodes 218 that may replace the offload nodes 118 in FIG. 1 .
- a set of eight offload nodes are included in the cluster, and each node performs a specific function.
- the connections between the various nodes of FIG. 2 are set at execution time by the offload manager 110 and the offload driver 112 .
- the offload nodes 218 include a crypto node 230 , a Fibonacci node 232 , a Fast Fourier Transform (FFT) node 234 , an Ethernet MAC (EMAC) node 236 , a prime factor node 238 , a zip/unzip node 240 , a math node 242 , and a hash node 244 .
- the offload manager 110 configures the various connections between the offload nodes 218 in FIG. 2 .
- the offload manager 110 may set the different types of connections, such as data flow, memory access loop back flow, and back door connect types of connections.
- the different types of connections may be set up by the offload manager 110 based on the time at which the offload nodes are to be used. For example, the crypto node 230 and the prime factor node 238 may use the back door connections so that the nodes 230 and 238 may communicate with each other for testing and debugging purposes.
- each offload node 230 - 244 has access to the memory 116 over memory access connections, but not all the offload nodes 218 are used to process data.
- Data is passed from the offload driver 112 to the crypto node 230 , to the prime factor node 238 , and finally to the hash node 244 .
- the Fibonacci node 232 has a loop back flow connection, meaning that the Fibonacci node 232 has an input from its own output port.
- the connections between crypto node 230 and the prime factor node 238 are backdoor connections.
- the crypto node 230 may perform encryption and/or decryption of the incoming data using a public key and/or a private key.
- the prime factor node 238 may be configured to generate public/private key pairs for the crypto node 230 to use. In this manner, the back door connections between the crypto node 230 and the prime factor node 238 may be used for testing and debugging.
- the offload manager 110 keeps track of these different types of connections and which nodes should be connected or piped together in what manner.
- the offload nodes 218 include multiple instances of an identical computing node.
- the FFT node 234 may be replicated multiple times so as to provide parallel computing.
- FIG. 2 provides an exemplary block diagram of the various connections that may be configured between the offload nodes 218 .
- the connections in FIG. 2 may be dynamically configured and represent a general example of modular offloading for computationally intensive tasks.
- FIG. 3 is a block diagram of a system 300 having an example configuration of a cluster of offload nodes 318 for processing security content.
- the offload nodes 318 are the same as the offload nodes 218 shown in FIG. 2 (i.e., including a crypto node 330 , a Fibonacci node 332 , an FFT node 334 , an EMAC node 336 , a prime factor node 338 , a zip/unzip node 340 , a math node 342 , and a hash node 344 ), but the configuration of the offload nodes 318 of FIG. 3 is different from the configuration of the offload nodes 218 of FIG. 2 .
- the same set of offload nodes may be used for both sets 218 and 318 , but depending on the desired functionality, the same set of offload nodes may be connected in different manners so as to execute different tasks.
- the data flow is configured differently in FIG. 3 , in which, data flows from the offload driver 112 to the crypto node 330 , to the zip/unzip node 340 , to the hash node 344 , to the EMAC node 336 , and to the network 346 .
- One application of the example configuration in FIG. 3 may involve the core CPU 108 assigning the processing of security content to the FPGA domain 104 .
- the core CPU 108 may assign such a task to the FPGA domain 104 to free up the HPS domain 102 to handle other tasks, such as operating system tasks.
- a tamper resistance system is used to prevent hacker attacks.
- the offload nodes 318 are implemented at the level of hardware gates, which is more difficult to attack, compared to the software that may be implemented in the HPS domain 102 .
- a hacker may simply use a powerful debugger to trace or step through the software functions, while attacking the hardware implementation in the FPGA domain 104 is more complex.
- the crypto node 330 may include a hardware crypto engine that accelerates applications that need cryptographic functions.
- the hash node 344 computes a hash function on the data.
- the hash node 344 performs a cipher process, such as data encryption standard/advanced encryption standard, kasumi, snow3G, md5 (e.g., a md5sum), sha1 (e.g., a sha1sum), sha2, or any other process to calculate and verify a hash of the data.
- a cipher process such as data encryption standard/advanced encryption standard, kasumi, snow3G, md5 (e.g., a md5sum), sha1 (e.g., a sha1sum), sha2, or any other process to calculate and verify a hash of the data.
- the EMAC node 336 sends data for publishing to a network 346 , which may correspond to the world wide web, or any other suitable network.
- FIG. 4 is a block diagram of a system 400 having an example configuration of a cluster of offload nodes 418 for performing mathematical operations.
- the offload nodes 418 include a prime factor node 450 , a Fibonacci node 452 , an FFT node 454 , an EMAC node 456 , a Heaviside node 458 , a zip/unzip node 460 , a math node 462 , and a hash node 464 .
- one or more of these nodes may be implemented as a hard IP block that is configurable at execution time.
- the core CPU 108 determines that certain tasks that are CPU-intensive should be passed over to the FPGA domain 104 .
- CPU-intensive tasks include but are not limited to prime factoring of a large integer, mathematic arctan and Heaviside step functions, and FFT computations.
- any task that is computationally expensive or slow for the HPS domain 102 to handle on its own may be passed over to the FPGA domain 104 .
- the offload driver 112 there are three parallel data flow paths from the offload driver 112 to the prime factor node 450 , the Heaviside node 458 , and the math node 462 , which may be a hard IP block configured to perform an arctan computation. Data flows out of each of these three nodes 450 , 458 , and 462 to the FFT node 454 .
- the prime factor node 450 , the Heaviside node 458 , and the math node 462 may perform prime factor, Heaviside, and arctan computations in parallel because these computations are independent of one another. By allowing for parallel computations, the set of offload nodes 418 save significantly on time
- partial reconfiguration is used to add, replace, or modify any of the offload nodes 118 , 218 , 318 , or 418 .
- one or more hard IP blocks may be added to a set of existing offload nodes, or any of the existing hard IP blocks in the offload nodes may be modified or replaced.
- partial reconfiguration refers the ability to reconfigure the logic on a region on a chip on the fly. In this way, partial reconfiguration allows for the modification to the set of offload nodes without necessarily requiring downtime from other components of the chip. Partial reconfiguration is especially useful if there is a limited amount of FPGA resource space in the FPGA domain 104 .
- one potential disadvantage of using partial reconfiguration to address FPGA resource limitations is that there may be some penalty in the form of a wait time delay if the IP blocks are being reconfigured during run time.
- FIG. 5 shows an illustrative flow diagram of a process 500 for configuring a set of offload nodes, according to an illustrative embodiment.
- the process 500 may be performed on an integrated circuit device, such as an FPGA device, an ASIC device, an ASSP device, or a PLD.
- the process 500 may be used for servers and cloud data centers to dynamically customize applications and usage models, and to provide hardware acceleration for computing processes.
- the process 500 allows for the topology and node configuration of a heterogeneous system to be configurable at execution time. Such a configurable system allows for parallel processing and flexible pipe staging of various tasks.
- a processor in a hard processor region of a programmable integrated circuit device identifies one or more tasks for assigning to an offload region of the programmable integrated circuit device.
- some tasks that are assigned to be performed by a system with a hard processor region may be computationally intensive for the hard processor region to handle by itself.
- the hard processor region may offload certain tasks to an offload region (e.g., the FPGA domain 104 , for example).
- One example of a task that may be assigned from the hard processor region to the offload region is processing of security content. It may be desirable to offload processing of secure material so as to reduce a likelihood of a hacker attack on the integrated circuit device.
- the processor in the hard processor region transmits an instruction to the offload region.
- the instruction may be transmitted from the offload driver 112 to the processor 114 , and may include one or more pointers to memory locations in the shared memory 106 .
- the processor 114 may be instructed by the offload driver 112 to load the appropriate data from the memory 106 into the memory 116 .
- the memory 116 is partitioned in accordance with the instruction such that the offload nodes 118 access the desired data.
- the offload driver 112 may transmit instruction data to the set of offload nodes 118 to configure the offload nodes 118 so that the desired connections are formed.
- a plurality of offload nodes in the offload region are configured to perform the one or more tasks.
- Configuring the offload nodes includes configuring the data flow paths through at least a subset of the offload nodes.
- an offload node may include a hard intellectual property (IP) block that is configurable at execution time.
- IP hard intellectual property
- the topology of the offload nodes 118 , 218 , 318 , and 418 may be configured at execution time. The topology defines the various connections between pairs of the offload nodes and defines the inputs and outputs of the connections.
- the same set of offload nodes may be able to be configured in multiple ways depending on the desired functionality.
- the offload nodes are implemented as one or more hard IP blocks. These hard IP blocks may include a layout of reusable hardware having a specified application function. As was described in relation to FIGS. 1-4 , examples of such specified application functions include cryptographic functions, frequency transform (FFT) functions, prime factorization functions, compression or decompression functions, mathematical functions, hash functions, and/or Ethernet functions.
- FFT frequency transform
- the processor in the hard processor region and the processor in the offload region are configured to asynchronously access a memory in the hard processor region.
- the core CPU 108 and the processor 114 both have access to the same shared memory 106 in the HPS domain 102 . Because of this, there is no need to pass physical copies of data stored in the memory 106 between the two domains. Instead, memory pointers may be passed between the two domains, thereby saving on transmission costs.
- a mutex locking mechanism may be used such that the two processors cannot concurrently access the memory 106 . In this manner, the “zero copy” mechanism of the shared memory 106 avoids computing nodes to pass physical copies back and forth in the pipeline, thereby improving overall system performance.
- FIG. 6 is a simplified block diagram of an illustrative system employing a programmable logic device (PLD) 1400 incorporating the present disclosure.
- PLD programmable logic device
- a PLD 1400 programmed according to the present disclosure may be used in many kinds of electronic devices.
- One possible use is in a data processing system 1400 shown in FIG. 6 .
- Data processing system 1400 may include one or more of the following components: a processor 1401 ; memory 1402 ; I/O circuitry 1403 ; and peripheral devices 1404 . These components are coupled together by a system bus 1405 and are populated on a circuit board 1406 which is contained in an end-user system 1407 .
- PLD 1400 can be used in a wide variety of applications, such as computer networking, data networking, instrumentation, video processing, digital signal processing, or any other application where the advantage of using programmable or reprogrammable logic is desirable.
- PLD 140 can be used to perform a variety of different logic functions.
- PLD 1400 can be configured as a processor or controller that works in cooperation with processor 1401 .
- PLD 1400 may also be used as an arbiter for arbitrating access to a shared resource in the system.
- PLD 1400 can be configured as an interface between processor 1401 and one of the other components in the system. It should be noted that the system shown in FIG. 7 is only exemplary, and that the true scope and spirit of the invention should be indicated by the following claims.
- the systems and methods of the present disclosure provide several benefits compared to existing systems.
- First, the present disclosure provides effective use of multi-core and many-core processors, which extends the usage of FPGAs in heterogeneous environments, in both personal and cloud computing applications.
- Second, dynamical runtime configuration of the modular offload nodes described herein allow for the main application CPU (i.e., the core CPU 108 ) to offload its computationally intensive tasks. This provides the flexibility needed to satisfy a wide variety of computing needs.
- the hardware acceleration of pipelined tasks using the offload nodes significantly improves computational efficiency.
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- Advance Control (AREA)
Abstract
Description
- This is a continuation of U.S. patent application Ser. No. 14/624,951, entitled “Modular Offloading for Computationally Intensive Tasks” filed Feb. 18, 2015, the contents of which is incorporated by reference in its entirety for all purposes.
- This disclosure relates to integrated circuit devices, such as field programmable gate array (FPGA) devices, and systems and methods for offloading computationally intensive tasks to offload regions on such devices.
- Many-core and multi-core devices provide a way to increase performance of a device without incurring the cost of increasing clock speeds. Many-core devices may include dedicated ASIC blocks for hardware specific functions that are often referred to as hardware accelerators. Programmable logic devices (PLDs) (e.g., field programmable gate arrays (FPGAs)), complex programmable logic devices (CPLDs), field programmable system on a chips (FPSCs), or other types of programmable devices) generally include programmable logic blocks which may be configured to implement various operations. Some PLDs also include configurable embedded hardware to support additional operations. However, conventional approaches to configuring such embedded hardware are often cumbersome and unwieldy.
- One limitation of existing many-core and multi-core systems is that the topology and node configuration of the system is fixed. In these systems, tasks are run separately, and physical copies of data are passed between computing nodes and applications, which is inefficient. Accordingly, there is a need for an improved approach to configuring hardware resources of a PLD.
- In light of the above, the present disclosure relates to a programmable integrated circuit device that includes an offload region with a flexible topology that can be configured at execution time.
- In accordance with embodiments of the present disclosure, there is provided a method of configuring a programmable integrated circuit device. The method includes identifying, by a processor in a hard processor region of the programmable integrated circuit device, one or more tasks for assigning to an offload region of the programmable integrated circuit device. The processor in the hard processor region transmits an instruction to the offload region, and a plurality of offload nodes in the offload region are configured to perform the one or more tasks.
- In some embodiments, the processor in the hard processor region is a first processor, both the first processor and a second processor in the offload region are configured to asynchronously access a memory in the hard processor region. The configuring the plurality of offload nodes may include configuring one or more data flow paths through at least a subset of the plurality of offload nodes. In some embodiments, the one or more tasks include processing security content, and the processing of security content is assigned to the offload region to reduce a likelihood of a hacker attack on the programmable integrated circuit device. The plurality of offload nodes in the offload region may be modified by adding a new offload node to the plurality of offload nodes, removing an offload node from the plurality of offload nodes, or replacing an offload node in the plurality of offload nodes.
- In accordance with embodiments of the present disclosure, there is provided a programmable integrated circuit device having a hard processor region and an offload region coupled to each other. The hard processor region has a first processor that identifies one or more tasks that are assigned to the offload region and transmits an instruction to the offload region. The offload region includes a plurality of offload nodes that are configured to perform the one or more tasks.
- In some embodiments, the hard processor region comprises a memory, and the processor in the hard processor region another processor in the offload region are configured to asynchronously access the memory. The instruction may include how to configure one or more data flow paths through at least a subset of the plurality of offload nodes. In an example, the one or more tasks include processing security content, and the processing of security content is assigned to the offload region to reduce a likelihood of a hacker attack on the integrated circuit device.
- In some embodiments, partial reconfiguration of the offload nodes is used, such that the plurality of offload nodes in the offload region are configured to be modified by adding a new offload node to the plurality of offload nodes, removing an offload node from the plurality of offload nodes, or replacing an offload node in the plurality of offload nodes. At least one of the plurality of offload nodes may be implemented as a hard intellectual property block, where the hard intellectual property block includes a layout of reusable hardware having a specified application function. The specified application function may be selected from the group: cryptographic function, a frequency transform function, a prime factorization function, a compression or decompression function, a mathematical function, a hash function, and an Ethernet function.
- In some embodiments, at least one field programmable gate array is used to implement the offload region. The offload region may further include a second memory that is accessible to each offload node in the offload region, and the second memory in the offload region may be partitioned in accordance with the instruction.
- Further features of the disclosure, its nature and various advantages will be apparent upon consideration of the following detailed description, taken in conjunction with the accompanying drawings, in which like referenced characters refer to like parts throughout, and in which:
-
FIG. 1 shows a diagram of a system that assigns computationally intensive tasks to be performed at an offload region, in accordance with some embodiments of the present disclosure; -
FIG. 2 shows a diagram of an offload region having a set of offload nodes to perform assigned tasks, in accordance with some embodiments of the present disclosure; -
FIG. 3 shows a diagram of an offload region having a set of offload nodes to perform processing of security content, in accordance with some embodiments of the present disclosure; -
FIG. 4 shows a diagram of an offload region having a set of offload nodes to perform parallel mathematical processing, in accordance with some embodiments of the present disclosure; -
FIG. 5 shows an illustrative flow diagram of a process for configuring an offload region of a programmable integrated circuit device, in accordance with some embodiments of the present disclosure; and -
FIG. 6 is a simplified block diagram of an illustrative system employing a programmable logic device incorporating the present disclosure. - To provide an overall understanding of the invention, certain illustrative embodiments will now be described. However, it will be understood by one of ordinary skill in the art that the systems and methods described herein may be adapted and modified as is appropriate for the application being addressed and that the systems and methods described herein may be employed in other suitable applications, and that such other additions and modifications will not depart from the scope hereof.
- The figures described herein show illustrative embodiments; however, the figures may not necessarily not show and may not be intended to show the exact layout of the hardware components contained in the embodiments. The figures are provided merely to illustrate the high level conceptual layouts of the embodiments. The embodiments disclosed herein may be implemented with any suitable number of components and any suitable layout of components in accordance with principles known in the art.
- General purpose processors are not suitable for special purpose tasks because they commonly use general purples instruction sets. For example, performing a complicated hash on a large file may take a general purpose processor more than ten seconds. A similar task may be hardware accelerated on a modular offload engine (having a customized instruction set) in an FPGA or ASIC and may take a fraction of a second.
- The present disclosure describes a heterogeneous many-core FPGA solution that may be used for servers and cloud data centers to dynamically customize applications and usage models, and to provide hardware acceleration for computing processes. The present disclosure allows for the topology and node configuration of a heterogeneous system to be configurable at execution time. Such a configurable system allows for parallel processing and flexible pipe staging of various tasks.
-
FIG. 1 is a block diagram of asystem 100 divided into two components represented by a hard processor system (HPS)domain 102 and anFPGA domain 104. TheHPS domain 102 includes amemory 106, acore CPU 108, anoffload manager 110, and anoffload driver 112. TheFPGA domain 104 includes aprocessor 114, amemory 116, and a cluster ofoffload nodes 118. Some tasks that are assigned to be performed by thesystem 100 may be computationally intensive for theHPS domain 102 to handle by itself. In this case, theHPS domain 102 may offload certain tasks to the cluster ofoffload nodes 118 in the FPGA fabric. - An offload node may include a hard intellectual property (IP) block that is configurable at execution time. In particular, the topology of the
offload nodes 118 may be configured at execution time. The topology defines the various connections between pairs of theoffload nodes 118 and defines the inputs and outputs of the connections. By allowing the topology to be configured at execution time, the same set ofoffload nodes 118 may be able to be configured in multiple ways depending on the desired functionality. In this manner, the set ofoffload nodes 118 is flexible and can be used in many different situations. Moreover, the FPGA fabric may be hardware accelerated by using one or more hard IP blocks as theoffload nodes 118. - Although the
system 100 is shown and described as having anFPGA domain 104, it should be understood that thesystem 100 and other systems discussed herein may have other types of integrated circuits (IC) instead of or in addition to one or more FPGAs. It should also be understood that the systems and methods discussed herein as applying to FPGAs may be equally applicable to ICs of other types, such as application-specific integrated circuits (ASICs), application specific standard products (ASSPs), and other programmable logic devices (PLDs). For example, in some embodiments, thesystem 100 may include ASIC and/or off-the-shelf ASSP dies. In some embodiments, a combination of FPGA and ASIC/ASSP may be used, assuming such FPGA and ASIC/ASSP dies have compatible electrical interfaces. - One or more components of the
FPGA domain 104 may be implemented with hardware IP blocks, which may include a layout of reusable hardware having a specified application function. One or more types of hard IP blocks may be implemented in thesystem 100. Examples of these hard IP blocks that may be included in the set ofoffload nodes 118 are described in detail in relation toFIGS. 2-4 . TheFPGA domain 104 may be implemented using a system-on-chip (SoC) FPGA, whose hard IP may include an embedded multicore processor subsystem. - In some implementations, the
core CPU 108 of theHPS domain 102 includes a number of nodes, each of which may correspond to an instance of an operating system. Theprocessor 114 of theFPGA domain 104 may include any number of embedded processors, such as a NIOS II processor. As shown inFIG. 1 , both thecore CPU 108 and theprocessor 114 access thememory unit 106. By allowing both processors in theHPS domain 102 and theFPGA domain 104 to access the same sharedmemory 106, there is no need to pass physical copies of data stored in thememory 106 between the two domains. Instead, memory pointers may be passed between the two domains, thereby reducing transmission costs associated with passing of data between theHPS domain 102 and theFPGA domain 104. In some embodiments, a mechanism to prevent data contention is used for theCPU 108 and theprocessor 114. For example, a mutex locking mechanism may be used such that theCPU 108 and theprocessor 114 are prohibited from concurrently accessing thememory 106. In this manner, the “zero copy” mechanism of the sharedmemory 106 avoids the need for computing nodes to pass physical copies back and forth in the pipeline, thereby improving overall system performance. - The
offload manager 110 receives data from thecore CPU 108 and handles construction of pipes and redirectors for forming connections between the various nodes in the cluster ofoffload nodes 118. Theoffload manager 110 also ensures that the constructed pipes and redirectors adhere to one or more offload policies. In an example, a Fibonacci computing node in the cluster ofoffload nodes 118 may be configured to have a loopback connection, but other types of nodes may not be configured to handle loopback connections. By ensuring that appropriate nodes, such as Fibonacci nodes, have the proper types of connections, theoffload manager 110 ensures that the connections between nodes in the cluster ofoffload nodes 118 comply with such policies. Theoffload manager 110 also constructs the data flow path that defines the manner in which data flows through theoffload nodes 118 or a subset thereof. - In some embodiments, the
offload manager 110 uses multi-core APIs to configure the connections for at least a subset of the nodes within the cluster ofoffload nodes 118. Theoffload manager 110 communicates instructions for configuring these connections to the offload driver, which essentially serves as an interface between theoffload manager 110 in theHPS domain 102 and theprocessor 114 in theFPGA domain 104, and also between theoffload manager 110 and the cluster ofoffload nodes 118 in the FPGA domain. This interface may include OpenCL APIs that pass along tasks to theoffload nodes 118 in theFPGA domain 102. In this manner, theoffload manager 110 constructs the path over which data flows through the cluster ofoffload nodes 118, while ensuring that the nodes adhere to one or more policies. As is shown inFIG. 1 , theoffload manager 110 and theoffload driver 112 are shown as two separate entities within theHPS domain 102, but may be included in the same entity without departing from the scope of the present disclosure. - The
offload driver 112 instructs theprocessor 114 to load the data for each of theoffload nodes 118 that will be used in the desired configuration. In particular, theprocessor 114 may be instructed by theoffload driver 112 to load the appropriate data from the sharedmemory 106 into thememory 116. The instruction received from theoffload driver 112 may refer to pointers to addresses in the sharedmemory 106. By loading data from the sharedmemory 106 into thememory 116, theprocessor 114 allows for the appropriate data to be read by theoffload nodes 118. - In some implementations, the
memory 116 is partitioned and managed by theprocessor 114. For example, theprocessor 114 may partition thememory 116 according to the set ofoffload nodes 118 or those offloadnodes 118 that will be used in a particular configuration. Thememory 116 may be partitioned to reserve a portion of the memory to be dedicated for each offload node for its usage, such as configuration and data exchange. - The
various offload nodes 118 may be chained and connected to efficiently perform a list of complicated tasks. The flow of the tasks is configurable such that theoffload nodes 118 may be piped together at execution time. In an example, OpenCL may be used to allow the application layer to use the set ofoffload nodes 118 in the FPGA fabric for more computing power using task-based parallelism. Example configurations of theoffload nodes 118 are described in more detail in relation toFIGS. 2-4 . -
FIG. 2 is a block diagram of asystem 200 having an example configuration of a cluster ofoffload nodes 218 that may replace theoffload nodes 118 inFIG. 1 . As is shown inFIG. 2 , a set of eight offload nodes are included in the cluster, and each node performs a specific function. As was described above, the connections between the various nodes ofFIG. 2 are set at execution time by theoffload manager 110 and theoffload driver 112. - As is shown in
FIG. 2 , theoffload nodes 218 include acrypto node 230, aFibonacci node 232, a Fast Fourier Transform (FFT)node 234, an Ethernet MAC (EMAC)node 236, aprime factor node 238, a zip/unzip node 240, amath node 242, and ahash node 244. As was described in relation toFIG. 1 , theoffload manager 110 configures the various connections between theoffload nodes 218 inFIG. 2 . In particular, theoffload manager 110 may set the different types of connections, such as data flow, memory access loop back flow, and back door connect types of connections. The different types of connections may be set up by theoffload manager 110 based on the time at which the offload nodes are to be used. For example, thecrypto node 230 and theprime factor node 238 may use the back door connections so that thenodes - In
FIG. 2 , each offload node 230-244 has access to thememory 116 over memory access connections, but not all theoffload nodes 218 are used to process data. Data is passed from theoffload driver 112 to thecrypto node 230, to theprime factor node 238, and finally to thehash node 244. TheFibonacci node 232 has a loop back flow connection, meaning that theFibonacci node 232 has an input from its own output port. Moreover, the connections betweencrypto node 230 and theprime factor node 238 are backdoor connections. Thecrypto node 230 may perform encryption and/or decryption of the incoming data using a public key and/or a private key. Theprime factor node 238 may be configured to generate public/private key pairs for thecrypto node 230 to use. In this manner, the back door connections between thecrypto node 230 and theprime factor node 238 may be used for testing and debugging. Theoffload manager 110 keeps track of these different types of connections and which nodes should be connected or piped together in what manner. - In some implementations the
offload nodes 218 include multiple instances of an identical computing node. For example, theFFT node 234 may be replicated multiple times so as to provide parallel computing.FIG. 2 provides an exemplary block diagram of the various connections that may be configured between theoffload nodes 218. The connections inFIG. 2 may be dynamically configured and represent a general example of modular offloading for computationally intensive tasks. -
FIG. 3 is a block diagram of asystem 300 having an example configuration of a cluster ofoffload nodes 318 for processing security content. Theoffload nodes 318 are the same as theoffload nodes 218 shown inFIG. 2 (i.e., including acrypto node 330, aFibonacci node 332, anFFT node 334, anEMAC node 336, aprime factor node 338, a zip/unzip node 340, amath node 342, and a hash node 344), but the configuration of theoffload nodes 318 ofFIG. 3 is different from the configuration of theoffload nodes 218 ofFIG. 2 . In particular, the same set of offload nodes may be used for bothsets FIG. 2 , data flows from theoffload driver 112 to thecrypto node 230 to theprime factor node 238 and to thehash node 244. In contrast, the data flow is configured differently inFIG. 3 , in which, data flows from theoffload driver 112 to thecrypto node 330, to the zip/unzip node 340, to thehash node 344, to theEMAC node 336, and to thenetwork 346. - One application of the example configuration in
FIG. 3 may involve thecore CPU 108 assigning the processing of security content to theFPGA domain 104. Thecore CPU 108 may assign such a task to theFPGA domain 104 to free up theHPS domain 102 to handle other tasks, such as operating system tasks. In particular, it may be desirable to offload the processing of security content to theFPGA domain 104 to prevent hacker attacks. By using theoffload nodes 318 to process the security content, a tamper resistance system is used to prevent hacker attacks. In particular, in theFPGA domain 104, theoffload nodes 318 are implemented at the level of hardware gates, which is more difficult to attack, compared to the software that may be implemented in theHPS domain 102. To attack the software, a hacker may simply use a powerful debugger to trace or step through the software functions, while attacking the hardware implementation in theFPGA domain 104 is more complex. - Any or all of the
offload nodes 318 ofFIG. 3 may be implemented as hard IP blocks. In the example shown inFIG. 3 , thecrypto node 330 may include a hardware crypto engine that accelerates applications that need cryptographic functions. After the data has been compressed or uncompressed by the zip/unzip node 340, thehash node 344 computes a hash function on the data. In some implementations, thehash node 344 performs a cipher process, such as data encryption standard/advanced encryption standard, kasumi, snow3G, md5 (e.g., a md5sum), sha1 (e.g., a sha1sum), sha2, or any other process to calculate and verify a hash of the data. Then, theEMAC node 336 sends data for publishing to anetwork 346, which may correspond to the world wide web, or any other suitable network. -
FIG. 4 is a block diagram of asystem 400 having an example configuration of a cluster ofoffload nodes 418 for performing mathematical operations. Theoffload nodes 418 include aprime factor node 450, aFibonacci node 452, anFFT node 454, anEMAC node 456, aHeaviside node 458, a zip/unzip node 460, amath node 462, and ahash node 464. As described above, one or more of these nodes may be implemented as a hard IP block that is configurable at execution time. - For example, the
core CPU 108 determines that certain tasks that are CPU-intensive should be passed over to theFPGA domain 104. Examples of CPU-intensive tasks include but are not limited to prime factoring of a large integer, mathematic arctan and Heaviside step functions, and FFT computations. In general, any task that is computationally expensive or slow for theHPS domain 102 to handle on its own may be passed over to theFPGA domain 104. - In the example shown in
FIG. 4 , there are three parallel data flow paths from theoffload driver 112 to theprime factor node 450, theHeaviside node 458, and themath node 462, which may be a hard IP block configured to perform an arctan computation. Data flows out of each of these threenodes FFT node 454. As is shown inFIG. 4 , theprime factor node 450, theHeaviside node 458, and themath node 462 may perform prime factor, Heaviside, and arctan computations in parallel because these computations are independent of one another. By allowing for parallel computations, the set ofoffload nodes 418 save significantly on time - In some implementations, partial reconfiguration is used to add, replace, or modify any of the
offload nodes FPGA domain 104. However, one potential disadvantage of using partial reconfiguration to address FPGA resource limitations is that there may be some penalty in the form of a wait time delay if the IP blocks are being reconfigured during run time. -
FIG. 5 shows an illustrative flow diagram of aprocess 500 for configuring a set of offload nodes, according to an illustrative embodiment. Theprocess 500 may be performed on an integrated circuit device, such as an FPGA device, an ASIC device, an ASSP device, or a PLD. Theprocess 500 may be used for servers and cloud data centers to dynamically customize applications and usage models, and to provide hardware acceleration for computing processes. Theprocess 500 allows for the topology and node configuration of a heterogeneous system to be configurable at execution time. Such a configurable system allows for parallel processing and flexible pipe staging of various tasks. - At 502, a processor in a hard processor region of a programmable integrated circuit device identifies one or more tasks for assigning to an offload region of the programmable integrated circuit device. In particular, some tasks that are assigned to be performed by a system with a hard processor region (e.g., the
HPS domain 102, for example) may be computationally intensive for the hard processor region to handle by itself. In this case, the hard processor region may offload certain tasks to an offload region (e.g., theFPGA domain 104, for example). - One example of a task that may be assigned from the hard processor region to the offload region is processing of security content. It may be desirable to offload processing of secure material so as to reduce a likelihood of a hacker attack on the integrated circuit device.
- At 504, the processor in the hard processor region transmits an instruction to the offload region. As was described in relation to
FIG. 1 , the instruction may be transmitted from theoffload driver 112 to theprocessor 114, and may include one or more pointers to memory locations in the sharedmemory 106. In particular, theprocessor 114 may be instructed by theoffload driver 112 to load the appropriate data from thememory 106 into thememory 116. In some implementations, thememory 116 is partitioned in accordance with the instruction such that theoffload nodes 118 access the desired data. Moreover, theoffload driver 112 may transmit instruction data to the set ofoffload nodes 118 to configure theoffload nodes 118 so that the desired connections are formed. - At 506, a plurality of offload nodes in the offload region are configured to perform the one or more tasks. Configuring the offload nodes includes configuring the data flow paths through at least a subset of the offload nodes. As used herein, an offload node may include a hard intellectual property (IP) block that is configurable at execution time. As was described in relation to
FIGS. 1-4 , the topology of theoffload nodes FIGS. 1-4 , examples of such specified application functions include cryptographic functions, frequency transform (FFT) functions, prime factorization functions, compression or decompression functions, mathematical functions, hash functions, and/or Ethernet functions. - In some implementations, the processor in the hard processor region and the processor in the offload region are configured to asynchronously access a memory in the hard processor region. As was described in relation to
FIG. 1 , thecore CPU 108 and theprocessor 114 both have access to the same sharedmemory 106 in theHPS domain 102. Because of this, there is no need to pass physical copies of data stored in thememory 106 between the two domains. Instead, memory pointers may be passed between the two domains, thereby saving on transmission costs. In some implementations, a mutex locking mechanism may be used such that the two processors cannot concurrently access thememory 106. In this manner, the “zero copy” mechanism of the sharedmemory 106 avoids computing nodes to pass physical copies back and forth in the pipeline, thereby improving overall system performance. -
FIG. 6 is a simplified block diagram of an illustrative system employing a programmable logic device (PLD) 1400 incorporating the present disclosure. APLD 1400 programmed according to the present disclosure may be used in many kinds of electronic devices. One possible use is in adata processing system 1400 shown inFIG. 6 .Data processing system 1400 may include one or more of the following components: aprocessor 1401;memory 1402; I/O circuitry 1403; andperipheral devices 1404. These components are coupled together by asystem bus 1405 and are populated on acircuit board 1406 which is contained in an end-user system 1407. -
PLD 1400 can be used in a wide variety of applications, such as computer networking, data networking, instrumentation, video processing, digital signal processing, or any other application where the advantage of using programmable or reprogrammable logic is desirable.PLD 140 can be used to perform a variety of different logic functions. For example,PLD 1400 can be configured as a processor or controller that works in cooperation withprocessor 1401.PLD 1400 may also be used as an arbiter for arbitrating access to a shared resource in the system. In yet another example,PLD 1400 can be configured as an interface betweenprocessor 1401 and one of the other components in the system. It should be noted that the system shown inFIG. 7 is only exemplary, and that the true scope and spirit of the invention should be indicated by the following claims. - Various technologies can be used to implement
PLDs 1400 as described above and incorporating this invention. - The systems and methods of the present disclosure provide several benefits compared to existing systems. First, the present disclosure provides effective use of multi-core and many-core processors, which extends the usage of FPGAs in heterogeneous environments, in both personal and cloud computing applications. Second, dynamical runtime configuration of the modular offload nodes described herein allow for the main application CPU (i.e., the core CPU 108) to offload its computationally intensive tasks. This provides the flexibility needed to satisfy a wide variety of computing needs. Third, the hardware acceleration of pipelined tasks using the offload nodes significantly improves computational efficiency.
- The foregoing is merely illustrative of the principles of the embodiments and various modifications can be made by those skilled in the art without departing from the scope and spirit of the embodiments disclosed herein. The above described embodiments of the present disclosure are presented for purposes of illustration and not of limitation, and the present invention is limited only by the claims which follow.
Claims (1)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/912,307 US20180196698A1 (en) | 2015-02-18 | 2018-03-05 | Modular offloading for computationally intensive tasks |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/624,951 US9910705B1 (en) | 2015-02-18 | 2015-02-18 | Modular offloading for computationally intensive tasks |
US15/912,307 US20180196698A1 (en) | 2015-02-18 | 2018-03-05 | Modular offloading for computationally intensive tasks |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/624,951 Continuation US9910705B1 (en) | 2015-02-18 | 2015-02-18 | Modular offloading for computationally intensive tasks |
Publications (1)
Publication Number | Publication Date |
---|---|
US20180196698A1 true US20180196698A1 (en) | 2018-07-12 |
Family
ID=61257264
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/624,951 Active 2035-03-11 US9910705B1 (en) | 2015-02-18 | 2015-02-18 | Modular offloading for computationally intensive tasks |
US15/912,307 Abandoned US20180196698A1 (en) | 2015-02-18 | 2018-03-05 | Modular offloading for computationally intensive tasks |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/624,951 Active 2035-03-11 US9910705B1 (en) | 2015-02-18 | 2015-02-18 | Modular offloading for computationally intensive tasks |
Country Status (1)
Country | Link |
---|---|
US (2) | US9910705B1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10970119B2 (en) * | 2017-03-28 | 2021-04-06 | Intel Corporation | Technologies for hybrid field-programmable gate array application-specific integrated circuit code acceleration |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU2015408850B2 (en) * | 2015-09-14 | 2020-06-25 | Teleste Oyj | A method for wireless data offload |
US10489195B2 (en) * | 2017-07-20 | 2019-11-26 | Cisco Technology, Inc. | FPGA acceleration for serverless computing |
WO2020000136A1 (en) | 2018-06-25 | 2020-01-02 | Alibaba Group Holding Limited | System and method for managing resources of a storage device and quantifying the cost of i/o requests |
US10516649B1 (en) * | 2018-06-27 | 2019-12-24 | Valtix, Inc. | High-performance computer security gateway for cloud computing platform |
US11012475B2 (en) | 2018-10-26 | 2021-05-18 | Valtix, Inc. | Managing computer security services for cloud computing platforms |
US11061735B2 (en) | 2019-01-02 | 2021-07-13 | Alibaba Group Holding Limited | System and method for offloading computation to storage nodes in distributed system |
US11617282B2 (en) | 2019-10-01 | 2023-03-28 | Alibaba Group Holding Limited | System and method for reshaping power budget of cabinet to facilitate improved deployment density of servers |
US11461262B2 (en) * | 2020-05-13 | 2022-10-04 | Alibaba Group Holding Limited | Method and system for facilitating a converged computation and storage node in a distributed storage system |
US11556277B2 (en) | 2020-05-19 | 2023-01-17 | Alibaba Group Holding Limited | System and method for facilitating improved performance in ordering key-value storage with input/output stack simplification |
US11507499B2 (en) | 2020-05-19 | 2022-11-22 | Alibaba Group Holding Limited | System and method for facilitating mitigation of read/write amplification in data compression |
US11487465B2 (en) | 2020-12-11 | 2022-11-01 | Alibaba Group Holding Limited | Method and system for a local storage engine collaborating with a solid state drive controller |
US11734115B2 (en) | 2020-12-28 | 2023-08-22 | Alibaba Group Holding Limited | Method and system for facilitating write latency reduction in a queue depth of one scenario |
US11726699B2 (en) | 2021-03-30 | 2023-08-15 | Alibaba Singapore Holding Private Limited | Method and system for facilitating multi-stream sequential read performance improvement with reduced read amplification |
Citations (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030112467A1 (en) * | 2001-12-17 | 2003-06-19 | Mccollum Tim | Apparatus and method for multimedia navigation |
US20070283193A1 (en) * | 2006-04-21 | 2007-12-06 | Altera Corporation | Soft error location and sensitivity detection for programmable devices |
US20080065835A1 (en) * | 2006-09-11 | 2008-03-13 | Sun Microsystems, Inc. | Offloading operations for maintaining data coherence across a plurality of nodes |
US20090089794A1 (en) * | 2007-09-27 | 2009-04-02 | Hilton Ronald N | Apparatus, system, and method for cross-system proxy-based task offloading |
US7546441B1 (en) * | 2004-08-06 | 2009-06-09 | Xilinx, Inc. | Coprocessor interface controller |
US20100241758A1 (en) * | 2008-10-17 | 2010-09-23 | John Oddie | System and method for hardware accelerated multi-channel distributed content-based data routing and filtering |
US20110029691A1 (en) * | 2009-08-03 | 2011-02-03 | Rafael Castro Scorsi | Processing System and Method |
US20110167250A1 (en) * | 2006-10-24 | 2011-07-07 | Dicks Kent E | Methods for remote provisioning of eletronic devices |
US20120246052A1 (en) * | 2010-12-09 | 2012-09-27 | Exegy Incorporated | Method and Apparatus for Managing Orders in Financial Markets |
US20130086332A1 (en) * | 2010-05-18 | 2013-04-04 | Lsi Corporation | Task Queuing in a Multi-Flow Network Processor Architecture |
US20130089109A1 (en) * | 2010-05-18 | 2013-04-11 | Lsi Corporation | Thread Synchronization in a Multi-Thread, Multi-Flow Network Communications Processor Architecture |
US20130117486A1 (en) * | 2011-11-04 | 2013-05-09 | David A. Daniel | I/o virtualization via a converged transport and related technology |
US20130343407A1 (en) * | 2012-06-21 | 2013-12-26 | Jonathan Stroud | High-speed cld-based tcp assembly offload |
US8630829B1 (en) * | 2007-07-19 | 2014-01-14 | The Mathworks, Inc. | Computer aided design environment with electrical and electronic features |
US20140059390A1 (en) * | 2010-10-20 | 2014-02-27 | Netapp, Inc. | Use of service processor to retrieve hardware information |
US20140176187A1 (en) * | 2012-12-23 | 2014-06-26 | Advanced Micro Devices, Inc. | Die-stacked memory device with reconfigurable logic |
US20140298061A1 (en) * | 2013-04-01 | 2014-10-02 | Cleversafe, Inc. | Power control in a dispersed storage network |
US20150046679A1 (en) * | 2013-08-07 | 2015-02-12 | Qualcomm Incorporated | Energy-Efficient Run-Time Offloading of Dynamically Generated Code in Heterogenuous Multiprocessor Systems |
US20150046478A1 (en) * | 2013-08-07 | 2015-02-12 | International Business Machines Corporation | Hardware implementation of a tournament tree sort algorithm |
US20150234698A1 (en) * | 2014-02-18 | 2015-08-20 | Netapp, Inc. | Methods for diagnosing hardware component failure and devices thereof |
US20160094619A1 (en) * | 2014-09-26 | 2016-03-31 | Jawad B. Khan | Technologies for accelerating compute intensive operations using solid state drives |
US20160210167A1 (en) * | 2013-09-24 | 2016-07-21 | University Of Ottawa | Virtualization of hardware accelerator |
-
2015
- 2015-02-18 US US14/624,951 patent/US9910705B1/en active Active
-
2018
- 2018-03-05 US US15/912,307 patent/US20180196698A1/en not_active Abandoned
Patent Citations (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030112467A1 (en) * | 2001-12-17 | 2003-06-19 | Mccollum Tim | Apparatus and method for multimedia navigation |
US7546441B1 (en) * | 2004-08-06 | 2009-06-09 | Xilinx, Inc. | Coprocessor interface controller |
US20070283193A1 (en) * | 2006-04-21 | 2007-12-06 | Altera Corporation | Soft error location and sensitivity detection for programmable devices |
US20080065835A1 (en) * | 2006-09-11 | 2008-03-13 | Sun Microsystems, Inc. | Offloading operations for maintaining data coherence across a plurality of nodes |
US20110167250A1 (en) * | 2006-10-24 | 2011-07-07 | Dicks Kent E | Methods for remote provisioning of eletronic devices |
US8630829B1 (en) * | 2007-07-19 | 2014-01-14 | The Mathworks, Inc. | Computer aided design environment with electrical and electronic features |
US20090089794A1 (en) * | 2007-09-27 | 2009-04-02 | Hilton Ronald N | Apparatus, system, and method for cross-system proxy-based task offloading |
US20100241758A1 (en) * | 2008-10-17 | 2010-09-23 | John Oddie | System and method for hardware accelerated multi-channel distributed content-based data routing and filtering |
US20110029691A1 (en) * | 2009-08-03 | 2011-02-03 | Rafael Castro Scorsi | Processing System and Method |
US20130089109A1 (en) * | 2010-05-18 | 2013-04-11 | Lsi Corporation | Thread Synchronization in a Multi-Thread, Multi-Flow Network Communications Processor Architecture |
US20130086332A1 (en) * | 2010-05-18 | 2013-04-04 | Lsi Corporation | Task Queuing in a Multi-Flow Network Processor Architecture |
US20140059390A1 (en) * | 2010-10-20 | 2014-02-27 | Netapp, Inc. | Use of service processor to retrieve hardware information |
US20120246052A1 (en) * | 2010-12-09 | 2012-09-27 | Exegy Incorporated | Method and Apparatus for Managing Orders in Financial Markets |
US20130117486A1 (en) * | 2011-11-04 | 2013-05-09 | David A. Daniel | I/o virtualization via a converged transport and related technology |
US20130343407A1 (en) * | 2012-06-21 | 2013-12-26 | Jonathan Stroud | High-speed cld-based tcp assembly offload |
US20140176187A1 (en) * | 2012-12-23 | 2014-06-26 | Advanced Micro Devices, Inc. | Die-stacked memory device with reconfigurable logic |
US20140298061A1 (en) * | 2013-04-01 | 2014-10-02 | Cleversafe, Inc. | Power control in a dispersed storage network |
US20150046679A1 (en) * | 2013-08-07 | 2015-02-12 | Qualcomm Incorporated | Energy-Efficient Run-Time Offloading of Dynamically Generated Code in Heterogenuous Multiprocessor Systems |
US20150046478A1 (en) * | 2013-08-07 | 2015-02-12 | International Business Machines Corporation | Hardware implementation of a tournament tree sort algorithm |
US20160210167A1 (en) * | 2013-09-24 | 2016-07-21 | University Of Ottawa | Virtualization of hardware accelerator |
US20150234698A1 (en) * | 2014-02-18 | 2015-08-20 | Netapp, Inc. | Methods for diagnosing hardware component failure and devices thereof |
US20160094619A1 (en) * | 2014-09-26 | 2016-03-31 | Jawad B. Khan | Technologies for accelerating compute intensive operations using solid state drives |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10970119B2 (en) * | 2017-03-28 | 2021-04-06 | Intel Corporation | Technologies for hybrid field-programmable gate array application-specific integrated circuit code acceleration |
US11372684B2 (en) | 2017-03-28 | 2022-06-28 | Intel Corporation | Technologies for hybrid field-programmable gate array application-specific integrated circuit code acceleration |
US11687375B2 (en) | 2017-03-28 | 2023-06-27 | Intel Corporation | Technologies for hybrid field-programmable gate array application-specific integrated circuit code acceleration |
Also Published As
Publication number | Publication date |
---|---|
US9910705B1 (en) | 2018-03-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20180196698A1 (en) | Modular offloading for computationally intensive tasks | |
Shantharama et al. | Hardware-accelerated platforms and infrastructures for network functions: A survey of enabling technologies and research studies | |
CN110088742B (en) | Logical repository service using encrypted configuration data | |
Cerović et al. | Fast packet processing: A survey | |
Zhang et al. | {G-NET}: Effective {GPU} Sharing in {NFV} Systems | |
Kim et al. | NBA (network balancing act) a high-performance packet processing framework for heterogeneous processors | |
EP3329413A1 (en) | Techniques to secure computation data in a computing environment | |
Wassel et al. | Networks on chip with provable security properties | |
US20180217823A1 (en) | Tightly integrated accelerator functions | |
Al-Aghbari et al. | Cloud-based FPGA custom computing machines for streaming applications | |
Ghasemi et al. | Accelerating apache spark with fpgas | |
Dosanjh et al. | Tail queues: a multi‐threaded matching architecture | |
Sklyarov et al. | Fast regular circuits for network-based parallel data processing | |
Kosciuszkiewicz et al. | Run-time management of reconfigurable hardware tasks using embedded linux | |
Bergmann et al. | A process model for hardware modules in reconfigurable system-on-chip | |
Vu et al. | Efficient hardware task migration for heterogeneous FPGA computing using HDL-based checkpointing | |
Rajan et al. | Trojan aware network-on-chip routing | |
Li et al. | FPGA overlays: hardware-based computing for the masses | |
Mentone et al. | CUDA virtualization and remoting for GPGPU based acceleration offloading at the edge | |
Ghasemi | A scalable heterogeneous dataflow architecture for big data analytics using fpgas | |
Jiang et al. | Properties of self-timed ring architectures for deadlock-free and consistent configuration reaching maximum throughput | |
Liu et al. | Lightweight secure processor prototype on FPGA | |
Huang et al. | Virtualizable hardware/software design infrastructure for dynamically partially reconfigurable systems | |
Behera et al. | An enhanced approach towards improving the performance of embedding memory management units into Network-on-Chip | |
Bai et al. | A hybrid ARM‐FPGA cluster for cryptographic algorithm acceleration |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: ADVISORY ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STCV | Information on status: appeal procedure |
Free format text: NOTICE OF APPEAL FILED |
|
STCV | Information on status: appeal procedure |
Free format text: EXAMINER'S ANSWER TO APPEAL BRIEF MAILED |
|
STCV | Information on status: appeal procedure |
Free format text: ON APPEAL -- AWAITING DECISION BY THE BOARD OF APPEALS |
|
STCV | Information on status: appeal procedure |
Free format text: BOARD OF APPEALS DECISION RENDERED |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- AFTER EXAMINER'S ANSWER OR BOARD OF APPEALS DECISION |