WO2008061162A1 - Hybrid computing platform having fpga components with embedded processors - Google Patents

Hybrid computing platform having fpga components with embedded processors Download PDF

Info

Publication number
WO2008061162A1
WO2008061162A1 PCT/US2007/084723 US2007084723W WO2008061162A1 WO 2008061162 A1 WO2008061162 A1 WO 2008061162A1 US 2007084723 W US2007084723 W US 2007084723W WO 2008061162 A1 WO2008061162 A1 WO 2008061162A1
Authority
WO
WIPO (PCT)
Prior art keywords
fpga
memory
microprocessor
server
computers
Prior art date
Application number
PCT/US2007/084723
Other languages
French (fr)
Inventor
Kent L. Gilson
James V. Yardley
Original Assignee
Star Bridge Systems, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Star Bridge Systems, Inc. filed Critical Star Bridge Systems, Inc.
Publication of WO2008061162A1 publication Critical patent/WO2008061162A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7867Architectures of general purpose stored program computers comprising a single central processing unit with reconfigurable architecture

Definitions

  • the present invention relates to systems and devices having a field-programmable gate array (FPGA), and more particularly to hybrid computing platform having FPGA components with embedded processors.
  • FPGA field-programmable gate array
  • An FPGA is an integrated circuit (IC) that can be programmed in the field after it has been manufactured. FPGAs have generally been used in communication applications, such as in mobile phone applications.
  • a configurable computer often is more versatile than a special purpose device, for example, an application-specific integrated circuit (ASIC), which may not be configurable to perform a wide range of tasks.
  • ASIC application-specific integrated circuit
  • the configurable computer or perhaps an array of programmable elements, often can be configured to perform specialized functions faster than a general purpose processor.
  • the configurable computer can be optimally configured for specific tasks.
  • a general purpose processor however, is suited to a wide variety of devices and often may not be optimized for a particular task.
  • FPGA Filed programmable Gate Array
  • processing elements e.g., FPGAs
  • the ability to reconfigure processing elements to perform different tasks generally requires the ability to also reconfigure communication among processing elements to meet the needs of the task at hand.
  • the following patents illustrate just a few prior solutions to the problem of reconfiguring communication among reconfigurable processing elements.
  • Processor discloses an interconnection scheme among processing elements (PEs) of a multiprocessor computing architecture; and means utilizing the unique interconnections for realizing, through PE reconfiguration, both fault tolerance and a wide variety of different overall topologies including binary trees and linear systolic arrays.
  • the reconfigurability allows many alternative PE network topologies to be grown or embedded in a PE lattice having identified PE or inter-PE connection faults.
  • PE configurations assembled as a binary tree have the advantageous property that if the number of PEs in the array is doubled, the layers through which communications must pass, increases by only one. This property, known as logarithmic communications radius, is desirable for large-scale PE arrays since it adds the least additional process time for initiating communications between Host and PEs.
  • the reconfigurable interconnect permits the digital interconnect permits the digital network realized on the interconnected chips to be changed at will, making the system well suited for a variety of purposes including simulation, prototyping, executing, and computing.
  • U.S. Pat. No. 5,684,980 issued to Casselman, entitled FPGA Virtual Computer for Executing a Sequence of Program Instructions by Successively Reconfiguring a Group of FPGA in Response to Those Instructions discloses an array of FPGAs whose configurations change successively during performance of successive algorithms or instruction, in a manner of a computer executing successive instructions.
  • adjacent FPGAs in the array are connected through external field programmable interconnection devices or cross-bar switches in order to relieve the internal resources of the FPGAs from any external connection tasks. This solved a perceived problem of having to employ 90% of the internal FPGA resources on external interconnection.
  • U.S. Pat. No. 5,956,518 issued to DeHon et al., entitled Intermediate-Grain Reconfigurable Processing Device discloses a programmable integrated circuit which utilizes a large number of intermediate-grain processing elements which are multibit processing elements arranged in a configurable mesh.
  • Configuration control data defines data paths through the interconnect, which can be address inputs to memories, data inputs to memories and logic units, and instruction inputs to logic units.
  • the interconnect is configurable to define an interdependent functionality of the functional units.
  • Programmable configuration storage stores the reconfiguration data.
  • a system includes at least two microprocessor cores wired in parallel with each other.
  • Each microprocessor core includes an FPGA with an embedded processor.
  • a memory element is electrically connected to the microprocessor core.
  • FPGA fabric is in communication with the FPGA including a portion of a programming code for the system.
  • An operating system is in communication with each of the microprocessor cores to run a remaining portion of the programming code.
  • the memory is embedded in the microprocessor core or is connected to the processor with a dual port.
  • the operating system further includes built in logic analyzer functions combined with standard software debug tools to provide mixed mode debugging.
  • the system further includes a graphical interface to create a schematic to mimic the FPGA and functional operations.
  • the graphical interface can display the schematic in a relational form of the actual FPGA used in the operating system with the functional operations.
  • the graphical interface can convert the relational form of the schematic into binary code.
  • the system can be part of multiple hybrid computers, the hybrid computers being connected together to from a hybrid computing server.
  • the hybrid computing server can be configured to form a multilevel memory model.
  • the multilevel memory model can include a plurality of memory layer levels.
  • a server system in another aspect of the invention, includes at least two computers interconnected through a network connection.
  • Each of the computers includes at least two microprocessor cores wired in parallel with each other.
  • Each microprocessor core includes an FPGA with an embedded processor.
  • a memory element is electrically connected to the microprocessor core.
  • FPGA fabric is in communication with the FPGA including a portion of a programming code for the system.
  • An operating system is in communication with each of the microprocessor cores in each of the computers to run a remaining portion of the programming code.
  • Figure l is a schematic illustration of a microprocessor core system in accordance with the present invention.
  • FIG. 2 is a schematic illustration of a single FPGA configuration in accordance with the present invention.
  • Figure 3 is a schematic illustration of a high performance FPGA database appliance in accordance with the present invention
  • Figure 4 is a schematic illustration of a high performance FPGA database appliance in accordance with the present invention
  • FIG. 5 is a schematic illustration of a processing element cluster in accordance with the present invention.
  • FIG. 6 is a schematic illustration of a processing element connected to memory in accordance with the present invention.
  • Figure 7 is a schematic illustration of FPGA clusters in accordance with the present invention
  • Figure 8 is a schematic illustration of a single FPGA configuration in accordance with the present invention
  • FIG. 9 is a schematic illustration of level two subsystem in accordance with the present invention.
  • FIG. 10 is a schematic illustration of a processing element in accordance with the present invention.
  • FIG 11 is a schematic illustration of a HC-62 FPGA system in accordance with the present invention.
  • FIG. 12 is a schematic illustration of a hybrid computing server in accordance with the present invention. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
  • the various embodiments of the present invention include a hybrid computing platform having FPGA components with embedded processors.
  • the computing platform can be part of a system, such as a hybrid computer system.
  • the system integrates cluster computing in the FPGA, instead of arranging FPGAs in the cluster as implemented by others.
  • This arrangement provides FPGAs in a unique architecture to provide a combination of a cluster, such as a Linux cluster, computing tightly coupled with highly parallel FPGA fabric.
  • This concept achieves a very high level of heterogeneous computing with the tightly coupled integration of serial processors and FPGAs.
  • Software code such as Legacy C or Fortran, can be easily profiled to find areas of compute intensive algorithms. Those areas can be programmed through enterprise programming code to run in FPGA fabric to achieve extreme acceleration in compute time.
  • Vector computing, cluster computing, and reconfigurable computing are mixed and matched to create a customized hardware/software performance enhanced system.
  • the system is useful in a variety of applications. For instance, the system can be used to decode DNA or calculate weather patterns.
  • the code does not need to be rewritten to adapt it to the system. Instead, a portion of the existing code is taken from the program and placed in the FPGA. Table lookup functions, for example, can be processed by the
  • I/O bandwidth problems can be significantly reduced with 36 independent memory channels of 1 gigabyte or 2 gigabyte memory capacity per channel.
  • I/O bandwidth can be solved with 144 high speed I/O interfaces that can be programmed for customized or industry standard I/O. Interfaces such as Infiniband, PCI Express, Fibre
  • Serial Rapid I/O, 1 Gb/s Ethernet channels and other industry standard interfaces can be programmed for interfacing with SANS storage, or other methods of data storage, high speed I/O and other peripheral communications.
  • the FPGA fabric combines for over 1.0 billion gates of programmable logic for high speed parallel computing.
  • a high level object oriented programming language for FPGAs can interface for combining C code running on the Linux clusters and effective and efficient use of parallelism in the FPGA.
  • the programming code has been found to be an effective tool for controlling multiple functions spread across multiple FPGAs and multiple processors. It provides an efficient programming model for Linux Cluster systems, and heterogeneous computing.
  • mixed mode debugging is facilitated through use of built in logic analyzer functions combined with standard software debug tools.
  • Processors are embedded in the FPGAs to make the hypercomputer powerful. While the hypercomputers are discussed, any general purpose platform can be used. The hardware for parallel computing is robust. And multiple hardware configurations are available. The user can program the system to execute C code without reprogramming the existing system.
  • the software can be used to program the FPGAs in the platform.
  • the software can use a graphic interface to make the programming easy.
  • the interface is intuitive for the user to program the circuits.
  • the FPGAs are reconfigurable to allow parallel computing.
  • the parallel computing reduces latency in the system by executing certain commands through multiple FPGAs at the same time. Depending on how many FPGAs that are used, the computing speed is doubled many times over using the parallel architecture.
  • the hybrid computing includes flexibility of a microprocessor and parallel computing of an FPGA.
  • the system uses C code and optimized code acceleration. With this arrangement the execution speed of the application is accelerated. In pervious systems, the latency has been a big issue and has prevented the model from being successful. There was delay in passing data from the processor to FPGA. The data became bottlenecked.
  • the hybrid computer includes a microprocessor embedded in the FPGA.
  • the microprocessor is tightly coupled to the FPGA fabric.
  • the architecture virtually eliminates the latency in the system.
  • the microprocessor embedded in the FPGA allows code profiling, algorithms and/or subroutines to be implemented in the FPGA fabric, and increases efficiency.
  • a system 10 in accordance with the various embodiments of the invention includes a microprocessor 12 at the core as illustrated in Fig. 1.
  • the FPGA fabric 14 is connected to the microprocessor 12 through dual port memory 16.
  • the FPGA fabric 14 includes math libraries, acceleration functions, and/or other code.
  • Dynamic random access memory (DRAM) 18 is electrically connected to the microprocessor 12 to provide a memory function for the system 10.
  • DRAM Dynamic random access memory
  • a block RAM loader 20 is also connected to the microprocessor 12 to help manage the memory associated with the microprocessor 12.
  • An operating system 22 is tied into the microprocessor 12 to execute the software programs and run the hardware associated with the system 10.
  • a hardware socket layer TCP/IP 24 can be electrically connected to the microprocessor 12 to connect the system 10 to a network.
  • FIG. 2 illustrates another embodiment of the invention having a single FPGA configuration.
  • a system 30 includes memory 32 such as a 10 port shared memory chip.
  • Microprocessor cores 34, 36, 38, and 40 are connected to the memory 32.
  • DRAM 42, 44, 46, and 48 are connected to each of the microprocessor cores 34, 36, 38, and 40, respectively.
  • the memory 32 is connected to an external bus FIFO 50, a router bus FIFO 52, a cross point FIFO 54, an A bus FIFO 56, a B bus FIFO 58, and a C bus FIFO 60 to route the memory 32 to other components associated with the system 30.
  • the architecture in the system is configured to meet various user needs.
  • Software codes such as C, C++, Fortran, VHDL, and the like can be adapted to the system.
  • the FPGAs are programmed with a portion of the code to speed up the processing time.
  • the system can communicate with Vector IP, Cluster IP, and switches using high speed storage with lower power requirements.
  • the cluster is embedded in the FPGA in the system.
  • the FPGA was embedded in the cluster.
  • This architecture provides many advantages over the related art.
  • the system includes the flexibility of a microprocessor, speed of FPGA hardware, and no latency.
  • the microprocessors are placed in parallel with FPGA hardware to accelerate code execution.
  • the programs are written in both environments — enterprise software and the FPGA. Both environments can be executed simultaneously.
  • the system takes advantage of the versatility of the microprocessor. It also takes advantage of the parallel hardware architecture to compute intensive subroutines. Based on the user's needs and application requirements the system can be optimized for application performance.
  • the system uses Amdahl's law to speed up the processing time of a given code.
  • the processing time of the algorithm can be increased, for example, 4 times, 8 times, 16 times, 32 times, or other configured times faster within the limits of the algorithm.
  • FIG. 3 an example of a high performance FPGA database appliance is illustrated.
  • Two HC62 computers 62, 64 are connected together through an 8*32-bit bus 66.
  • Each of the HC62 computers 62, 64 is connected to a HC Host 68, 70 through Peripheral Component Interconnect Extended (PCI-X) bus technology 72, 74.
  • PCI-X Peripheral Component Interconnect Extended
  • Each of the HC Hosts 68, 70 is connected to a switch through a 2* IGb TCP/IP 76, 78.
  • the HC62 computers are also each connected to the switch 80 through 32* 100Mb TCP/IP 82, 84.
  • a storage area network (SAN) 86 interconnects the system to the HC62 computers 62, 64 through TCP/IP 88 to the switch 80.
  • SAN storage area network
  • a SAN 90 is connected to disk cache 92 and DRAM cache 94.
  • the disk cache 92 is connected to an HC host cache controller 96.
  • the DRAM cache 94 is connected to an HC62 cache controller 98.
  • the HC host cache controller 96 and the HC62 cache controller 98 are connected together.
  • the HC host cache controller 96 is also connected to an HC host command router 100 that is connected to a router 102.
  • the HC62 cache controller 98 is connected to an HC62 command router 104.
  • the HC62 command router 104 is also connected to the router 102.
  • a Structured Query Language (SQL) request 106 can then be sent to and from the router 102 with respect to the SAN 90.
  • SQL Structured Query Language
  • the processing element 110 includes an embedded microprocessor and is connected to four memory cards, such as high speed 2 Gb memory 112, 114, 116, and 118.
  • the processing element 110 is connected to 4 high speed I/O channels, such as Infiniband, Gbit Ethernet, and the like.
  • the FPGA 110 is also connected to a 74 bit bus to interconnect the FPGA 110 to FPGA fabric.
  • the FPGA includes a cluster of two power PC processors.
  • the processing element 110 includes a multiplicity of external connection pins about its perimeter as shown. Intersecting dashed lines demarcate in conceptual terms four different regions 120, 122, 124, and 126 of the processing element 110. Each of the memory resources 120, 122, 124, and 126 is interconnected with processing element pins adjacent to a corresponding processing element region. For instance, memory resource 112 is interconnected with pins that carry address, data, and control information between memory resource 112 and the processing element 110. The memory resource 112 also is interconnected with the processing element 110 through pins adjacent to processing element region 112. The memory resource 114 in a like manner is interconnected to processing element 110 by pins adjacent to processing element region 114.
  • Memory resource 116 is interconnected to processing element 110 by pins adjacent to processing element region 116.
  • Memory resource 118 is interconnected to processing element 110 by pins adjacent to processing element region 118.
  • a first set of external connection pins generally disposed about a first side of the processing element 110 are grouped together. This first group shall be referred to herein as the Group A Connections 128.
  • a second group of external connection pins generally disposed along a second side of the processing element 110 are grouped together as a second group which shall be referred to herein as the Group B Connections 130.
  • the Group A Connections 128 and the Group B Connections 130 are generally disposed along opposite sides of the processing element.
  • a third group of external connection pins is distributed about the periphery of the processing element.
  • Group C Connections 132 This third group shall be referred to herein as the Group C Connections 132.
  • the Group A, B and C external pin connections are labeled accordingly in Fig. 9.
  • the large arrows associated with the Group A and B connections are intended to indicate that each of these is generally disposed as a group along opposite sides of the processing element 110.
  • Clock signal external connections CL can be provided to external connection pins disposed near the four corners of the processing element 110.
  • the processing element 110 comprises a field programmable gate array (FPGA), and the memory resources 112, 114, 116, and 118 comprise dynamic random access memory. More specifically in a current implementation, the processing element 110 is a Virtex-4 FX FPGA produced Xilinx, Inc. having offices in San Jose, California.
  • An FPGA device comprises hardware resources that can be configured to perform the functions required by virtually any application.
  • FPGAs produced by Xilinx, Inc. comprise combination logic blocks (CLB) resources that can be configured to perform different functions.
  • FPGAs produced by National Semiconductor Corporation, for example, include "cells" that can be configured to perform different functions.
  • FPGAs produced by Altera, Inc. include logic array blocks (LABs) that can be configured to perform different functions. These are just a few examples of different types of FPGAs.
  • Memory can be used to store, or to assign, a value for a single bit of information.
  • a computation unit e.g. CLB, Cell, LAB
  • a processor ALU may operate on more bits at a time.
  • an FPGA may employ lookup table memory to implement a compute function.
  • a processor may be programmed to operate on one or two bit wide data.
  • a processor for example, is quite effective at what might be termed sequence division multiplexed operations. Basically, a series of instructions cause the processor to change the state of the processor system from time to time so that the processor compute resources can be re-used by different instructions. Thus, as a sequence of instructions is provided to the processor and the processors' state changes so that the same processor hardware can be re-used to perform different functions.
  • An FPGA-type processing element might, from time to time be configured to operate as a non-multiplexed device. That is, it may be configured so that the compute resources do not change state. That same processing element later might be reconfigured so as to operate more like a processor in which compute resources change state so that they can be multiplexed and re-used for a variety of different functions.
  • a cluster of 4 processors can be interconnecting in the system.
  • the FPGAs are partitioned to 3 processors each for a total of 12 processors and 8 gigabytes of RAM.
  • the FPGAs can have a soft core and can be configured to act like a processor.
  • Each processor can be partitioned to include as many processors as needed for a given system, for example, 4 to 16 processors.
  • the FPGAs can be configured to be homogeneous or heterogeneous.
  • the FPGA can be configured to have 5 to 10 processors that operate at different speeds depending on the application requirements.
  • the example in Fig. 7 includes clusters of 4 FPGAs.
  • Each FPGA includes 4 channels to provide a total of 16 high speed I/O channels.
  • the FPGAs also can include dual ported, shared memory.
  • the memory and the processor are printed on the same chip. When the memory is embedded on the same chip, the FPGA does not require a memory controller.
  • the embedded memory speeds up communication between the processor and the
  • a single FPGA configuration is shown.
  • the configuration includes 4 parallel memory channels with 8 gigabytes of DRAM.
  • the configuration also includes 4 embedded processors and 8 embedded power PC processors.
  • the chip has shared memory access for communications and 16 high speed I/O channels.
  • the chip includes a 32 bit A-bus, a 32-bit B-bus, and a 32-bit C-bus to the FPGA in quad.
  • the configuration also includes a 32-bit external bus for expansion, a 25-bit bus to cross point FPGA, and a 64-bit bus to router FPGA.
  • Fig. 9 shows a block diagram of a Level Two Subsystem 134 in accordance with an embodiment of the invention.
  • the Level Two Subsystem 134 includes a first Level Two processing unit 136, a second Level Two processing unit 138 and a Level Two communication and processing unit 140.
  • the first Level Two processing unit 136 comprises a network of processing elements like processing element 110 of Figs. 5 and 6. More specifically, the first Level Two processing unit 136 includes processing elements 142, 144, 146, and 148. Each of these processing elements is interconnected with memory resources like the interconnection of memory resources 112, 114, 116, and 118 with processing element 110 as shown in Figs. 5 and 6.
  • the processing elements 142, 144, 146, and 148 can also include embedded memory in each of the processing elements.
  • the second Level Two processing unit 138 comprises a network of processing elements 150, 152, 154, and 156.
  • the group A external connections of each of the processing elements 142, 144, 146, and 148 of the first Level Two processing unit 136 are interconnected with a first Level Two intra-connection lines (Al) 166, which interconnect processing elements 142, 144, 146, and 148 and the communication and processing unit 140.
  • processing element 142 includes Group A external connections 158-165 that are interconnected with the first Level Two intra-connection lines 166.
  • processing elements 150, 152, 154, and 156 include respective Group A connections 162-165 that are interconnected with the first Level Two intra-connection lines 166.
  • second Level Two intra-connection lines (A2) 168 interconnect the Group A external connections of processing elements 150, 152, 154, and 156 with the communications and processing unit 140. More specifically, the respective Group A external connections 162 of processing element 1150 are interconnected with the second Level Two intra-connection lines (A2) 168. Similarly, the respective Group A connections 163-165 of respective processing elements 152, 154, and 156 are interconnected with the second Level Two intra-connection lines 168.
  • the processing elements 142, 144, 146, and 148 of the first Level Two processing unit 136 and the processing units 150, 152, 154, and 156 of the second Level Two processing unit 138 have their respective group B external connections capable of communication external to the Level Two subsystem 134. More specifically, the group B connections 170 of processing element 142 interconnect with first Level Two processing unit external connection lines (Bl). Similarly, the respective group B external connections 171-173 of respective processing elements 144, 146, and 148 interconnect with respective second, third and fourth Level Two processing unit external connection lines (B2, B3, B4). Each of the first, second, third and fourth first Level Two processing unit external connection lines communicates with a first external connector 174, which provides communication external to the Level Two Subsystem 134.
  • the processing elements 150, 152, 154, and 156 of the second Level Two processing unit 138 are similarly interconnected through a second external connector 176 for communication external to the Level Two Subsystem 134. More specifically, the Group B connections 178 of processing element 150 interconnect with first and second Level Two processing unit external connection lines (B5). Likewise, the respective group B connections 179-181 of processing elements 152, 154, and 156, respectively interconnect with second, third and fourth second Level Two processing unit external connection lines (B6, B7, B8). The first, second, third and fourth second Level Two processing unit external connections lines interconnect with the second external connector 176 to provide communication external to the Level Two Subsystem 134.
  • an FPGA in another embodiment, includes embedded memory and power PCs as shown in Fig. 10.
  • the FPGA includes dual ported shared memory 182, 184 and 2 Power PCs 186, 188.
  • the FPGA includes 4 channels such as Infiniband Ethernet to provide about 10 Gb/sec connection speed.
  • the FPGA can be connected in parallel with other FPGAs.
  • the system 200 includes 8 FPGA clusters 202-209.
  • the FPGAs 202-209 are interconnected as illustrated in Fig. 9.
  • Each of the FPGAs 202-209 is also connected to eight independent 32-bit external I/O buses 210.
  • PCIe bus DMA channels 212 are also connected to each FPGA to provide 3.2 Gb/sec bandwidth.
  • the system 200 includes 8 independent 64-bit buses 214 to and from the router FPGA to provide 8.5 Gb/sec bandwidth.
  • Eight independent 32-bit buses 216 are connected to the FPGAs to route data to and from the cross point FPGA to provide 3.3 Gb/sec bandwidth.
  • the system 200 includes PCIX bus DMA channels 218 to each FPGA to provide 250 Mb/sec bandwidth and 144 high speed I/O ports 220.
  • the FPGAs includes 36 parallel memory channels having 72 gigabyte DRAM.
  • the FPGAs also include 12 independent 32-bit buses between each of the FPGAs to provide a 6.4 Gb/sec bandwidth.
  • the system can include dual Xeon processors, a 1.5 terabyte hard drive, and 16 gigabyte DRAM.
  • the FPGAs include 144 high speed I/O channels at 18 to 45 Gb/sec bandwidth and 8 independent 32-bit external I/O buses to provide 3.2 Gb/sec bandwidth.
  • Each of the FPGAs are connected together using dual channel G-bit Ethernet connections at 250 Mb/sec bandwidth.
  • a hybrid computing server 222 includes HC62 computers 224-228 as previously discussed having FPGAs with embedded processors.
  • the server has a multilevel memory model.
  • the computers 224-228 include a memory layer level 1 with 36 gigabytes per HC62 and 36 memory channels to and from the FPGAs.
  • a memory layer level 2 includes 16 gigabytes per HC62.
  • a memory layer level 3 includes 1.5 terabyte hard drive storage per HC62.
  • a memory layer level 4 includes a multi-terabyte per SAN.
  • the HC62 computers 224-228 are connected together with buses each having a 6
  • the computers are also connected to a router 230 using an 18 Gb/sec bandwidth Ethernet connection.
  • the router 230 is connected to the SAN 232 with a 10 Gb/sec bandwidth to complete the server connection.
  • An advantage of a communication and processing unit as disclosed is that there is an approximate balance in which there can be linear scaling of communications capability, compute capability and granularity of compute resources so that linear scaling of each can enable an overall system comprising many elements of the general type described above to meet the communications, compute and granularity demands of increasingly complex algorithms.
  • the ability to stack communication and processing units in three dimensions can reduce the distance between adjacent processing elements. As a result, the time required to communicate information between different communication and processing elements can be reduced. By providing many intra- connection lines and many external connection lines there can be a relatively high volume of communication between processing elements.
  • this high volume of communication makes possible the improved cooperation among processing elements in performing a computation task.
  • the large amount of interconnection resources and other interconnections defined elsewhere herein permit the scaling up of the basic architecture of the communication and processing unit to a much larger scale which, in turn, permits higher granularity (i.e., more bits of information to be processed together) so that more complex operations can be performed efficiently.

Abstract

A system includes at least two microprocessor cores wired in parallel with eac other. Each microprocessor core includes an FPGA with an embedded processor. memory element is electrically connected to the microprocessor core. FPGA fabric is i communication with the FPGA including a portion of a programming code for t system. An operating system is in communication with each of the microprocessor cor to run a remaining portion of the programming code.

Description

HYBRID COMPUTING PLATFORM HAVING FPGA COMPONENTS
WITH EMBEDDED PROCESSORS
BACKGROUND OF THE INVENTION
1. The Field of the Invention
The present invention relates to systems and devices having a field-programmable gate array (FPGA), and more particularly to hybrid computing platform having FPGA components with embedded processors.
2. The Relevant Technology
An FPGA is an integrated circuit (IC) that can be programmed in the field after it has been manufactured. FPGAs have generally been used in communication applications, such as in mobile phone applications. A configurable computer often is more versatile than a special purpose device, for example, an application-specific integrated circuit (ASIC), which may not be configurable to perform a wide range of tasks. The configurable computer, or perhaps an array of programmable elements, often can be configured to perform specialized functions faster than a general purpose processor. The configurable computer can be optimally configured for specific tasks. A general purpose processor, however, is suited to a wide variety of devices and often may not be optimized for a particular task.
U.S. Pat. Nos. 5,361,373 and 5,600845, both issued to Gilson, entitled Integrated Circuit Computing Device Comprising Dynamically Configurable Gate Array Having a Microprocessor and Reconfigurable Instruction Execution Means and Method Therefor, discloses an integrated circuit computing device comprised of a dynamically configurable Filed programmable Gate Array (FPGA). This gate array is configured to implement a RISC processor and a Reconfigurable Instruction Execution Unit.
A challenge in developing computer systems in general, and in reconfigurable computing systems in particular, is communication among processing elements (e.g., FPGAs) in the system. The ability to reconfigure processing elements to perform different tasks generally requires the ability to also reconfigure communication among processing elements to meet the needs of the task at hand. The following patents illustrate just a few prior solutions to the problem of reconfiguring communication among reconfigurable processing elements. U.S. Pat. No. 5,020,059, issued to Gorin et al., entitled Reconfigurable Signal
Processor, discloses an interconnection scheme among processing elements (PEs) of a multiprocessor computing architecture; and means utilizing the unique interconnections for realizing, through PE reconfiguration, both fault tolerance and a wide variety of different overall topologies including binary trees and linear systolic arrays. The reconfigurability allows many alternative PE network topologies to be grown or embedded in a PE lattice having identified PE or inter-PE connection faults.
PE configurations assembled as a binary tree, for example, have the advantageous property that if the number of PEs in the array is doubled, the layers through which communications must pass, increases by only one. This property, known as logarithmic communications radius, is desirable for large-scale PE arrays since it adds the least additional process time for initiating communications between Host and PEs.
U.S. Pat. No. 5,661,662 issued to Butts et al., entitled Structures And Methods For Adding Stimulus and Response Functions To a Circuit Design Undergoing Emulation, discloses a plurality of electronically reconfigurable gate array logic chips interconnected via a reconfigurable interconnect, and electronic representations of large digital networks that are converted to take temporary operating hardware formed on the interconnected chips. The reconfigurable interconnect permits the digital interconnect permits the digital network realized on the interconnected chips to be changed at will, making the system well suited for a variety of purposes including simulation, prototyping, executing, and computing.
U.S. Pat. No. 5,684,980 issued to Casselman, entitled FPGA Virtual Computer for Executing a Sequence of Program Instructions by Successively Reconfiguring a Group of FPGA in Response to Those Instructions, discloses an array of FPGAs whose configurations change successively during performance of successive algorithms or instruction, in a manner of a computer executing successive instructions. In one aspect of the Casselman disclosure, adjacent FPGAs in the array are connected through external field programmable interconnection devices or cross-bar switches in order to relieve the internal resources of the FPGAs from any external connection tasks. This solved a perceived problem of having to employ 90% of the internal FPGA resources on external interconnection.
U.S. Pat. No. 5,956,518 issued to DeHon et al., entitled Intermediate-Grain Reconfigurable Processing Device, discloses a programmable integrated circuit which utilizes a large number of intermediate-grain processing elements which are multibit processing elements arranged in a configurable mesh. Configuration control data defines data paths through the interconnect, which can be address inputs to memories, data inputs to memories and logic units, and instruction inputs to logic units. The interconnect is configurable to define an interdependent functionality of the functional units. Programmable configuration storage stores the reconfiguration data. Despite advances in reconfigurable communications among processing elements in reconfigurable computer systems, there continues to be a need for improvements in the interplay between reconfigurable processing elements and reconfigurable communication resources that interconnect such processing elements. There also exists a need to effectively apply the characteristics of fractals, which are ubiquitous in nature, to the design of computer systems. That is, there is a need for an improved computer system which exhibits fractal-like qualities, namely a meaningful degree of self-similarity on reducing scale, like the self- similarity that is manifest in nature.
BRIEF SUMMARY OF THE INVENTION A system includes at least two microprocessor cores wired in parallel with each other. Each microprocessor core includes an FPGA with an embedded processor. A memory element is electrically connected to the microprocessor core. FPGA fabric is in communication with the FPGA including a portion of a programming code for the system. An operating system is in communication with each of the microprocessor cores to run a remaining portion of the programming code. In another aspect of the invention, the memory is embedded in the microprocessor core or is connected to the processor with a dual port.
In a further aspect of the invention, the operating system further includes built in logic analyzer functions combined with standard software debug tools to provide mixed mode debugging. In a further aspect of the invention, the system further includes a graphical interface to create a schematic to mimic the FPGA and functional operations. The graphical interface can display the schematic in a relational form of the actual FPGA used in the operating system with the functional operations. The graphical interface can convert the relational form of the schematic into binary code. In a further aspect of the invention, the system can be part of multiple hybrid computers, the hybrid computers being connected together to from a hybrid computing server. The hybrid computing server can be configured to form a multilevel memory model. The multilevel memory model can include a plurality of memory layer levels. In another aspect of the invention, a server system includes at least two computers interconnected through a network connection. Each of the computers includes at least two microprocessor cores wired in parallel with each other. Each microprocessor core includes an FPGA with an embedded processor. A memory element is electrically connected to the microprocessor core. FPGA fabric is in communication with the FPGA including a portion of a programming code for the system. An operating system is in communication with each of the microprocessor cores in each of the computers to run a remaining portion of the programming code.
These and other objects and features of the present invention will become more fully apparent from the following description and appended claims, or may be learned by the practice of the invention as set forth hereinafter.
BRIEF DESCRIPTION OF THE DRAWINGS
To further clarify the above and other advantages and features of the present invention, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. It is appreciated that these drawings depict only typical embodiments of the invention and are therefore not to be considered limiting of its scope. The invention will be described and explained with additional specificity and detail through the use of the accompanying drawings in which:
Figure l is a schematic illustration of a microprocessor core system in accordance with the present invention;
Figure 2 is a schematic illustration of a single FPGA configuration in accordance with the present invention;
Figure 3 is a schematic illustration of a high performance FPGA database appliance in accordance with the present invention; Figure 4 is a schematic illustration of a high performance FPGA database appliance in accordance with the present invention;
Figure 5 is a schematic illustration of a processing element cluster in accordance with the present invention;
Figure 6 is a schematic illustration of a processing element connected to memory in accordance with the present invention;
Figure 7 is a schematic illustration of FPGA clusters in accordance with the present invention; Figure 8 is a schematic illustration of a single FPGA configuration in accordance with the present invention;
Figure 9 is a schematic illustration of level two subsystem in accordance with the present invention;
Figure 10 is a schematic illustration of a processing element in accordance with the present invention;
Figure 11 is a schematic illustration of a HC-62 FPGA system in accordance with the present invention; and
Figure 12 is a schematic illustration of a hybrid computing server in accordance with the present invention. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
The various embodiments of the present invention include a hybrid computing platform having FPGA components with embedded processors. The computing platform can be part of a system, such as a hybrid computer system. The system integrates cluster computing in the FPGA, instead of arranging FPGAs in the cluster as implemented by others. This arrangement provides FPGAs in a unique architecture to provide a combination of a cluster, such as a Linux cluster, computing tightly coupled with highly parallel FPGA fabric. This concept achieves a very high level of heterogeneous computing with the tightly coupled integration of serial processors and FPGAs. Software code, such as Legacy C or Fortran, can be easily profiled to find areas of compute intensive algorithms. Those areas can be programmed through enterprise programming code to run in FPGA fabric to achieve extreme acceleration in compute time. Vector computing, cluster computing, and reconfigurable computing are mixed and matched to create a customized hardware/software performance enhanced system.
The system is useful in a variety of applications. For instance, the system can be used to decode DNA or calculate weather patterns. The code does not need to be rewritten to adapt it to the system. Instead, a portion of the existing code is taken from the program and placed in the FPGA. Table lookup functions, for example, can be processed by the
FPGA. Moving the code to the FPGA reduces latency.
Memory bandwidth problems can be significantly reduced with 36 independent memory channels of 1 gigabyte or 2 gigabyte memory capacity per channel. I/O bandwidth can be solved with 144 high speed I/O interfaces that can be programmed for customized or industry standard I/O. Interfaces such as Infiniband, PCI Express, Fibre
Channel, Serial Rapid I/O, 1 Gb/s Ethernet channels and other industry standard interfaces can be programmed for interfacing with SANS storage, or other methods of data storage, high speed I/O and other peripheral communications.
The FPGA fabric combines for over 1.0 billion gates of programmable logic for high speed parallel computing. A high level object oriented programming language for FPGAs can interface for combining C code running on the Linux clusters and effective and efficient use of parallelism in the FPGA. The programming code has been found to be an effective tool for controlling multiple functions spread across multiple FPGAs and multiple processors. It provides an efficient programming model for Linux Cluster systems, and heterogeneous computing. In addition, mixed mode debugging is facilitated through use of built in logic analyzer functions combined with standard software debug tools.
Processors are embedded in the FPGAs to make the hypercomputer powerful. While the hypercomputers are discussed, any general purpose platform can be used. The hardware for parallel computing is robust. And multiple hardware configurations are available. The user can program the system to execute C code without reprogramming the existing system.
Software can be used to program the FPGAs in the platform. The software can use a graphic interface to make the programming easy. The interface is intuitive for the user to program the circuits. The FPGAs are reconfigurable to allow parallel computing. The parallel computing reduces latency in the system by executing certain commands through multiple FPGAs at the same time. Depending on how many FPGAs that are used, the computing speed is doubled many times over using the parallel architecture.
The hybrid computing includes flexibility of a microprocessor and parallel computing of an FPGA. The system uses C code and optimized code acceleration. With this arrangement the execution speed of the application is accelerated. In pervious systems, the latency has been a big issue and has prevented the model from being successful. There was delay in passing data from the processor to FPGA. The data became bottlenecked.
In the present invention, the hybrid computer includes a microprocessor embedded in the FPGA. The microprocessor is tightly coupled to the FPGA fabric. The architecture virtually eliminates the latency in the system. The microprocessor embedded in the FPGA allows code profiling, algorithms and/or subroutines to be implemented in the FPGA fabric, and increases efficiency. A system 10 in accordance with the various embodiments of the invention includes a microprocessor 12 at the core as illustrated in Fig. 1. The FPGA fabric 14 is connected to the microprocessor 12 through dual port memory 16. The FPGA fabric 14 includes math libraries, acceleration functions, and/or other code. Dynamic random access memory (DRAM) 18 is electrically connected to the microprocessor 12 to provide a memory function for the system 10. A block RAM loader 20 is also connected to the microprocessor 12 to help manage the memory associated with the microprocessor 12. An operating system 22 is tied into the microprocessor 12 to execute the software programs and run the hardware associated with the system 10. A hardware socket layer TCP/IP 24 can be electrically connected to the microprocessor 12 to connect the system 10 to a network.
Fig. 2 illustrates another embodiment of the invention having a single FPGA configuration. A system 30 includes memory 32 such as a 10 port shared memory chip. Microprocessor cores 34, 36, 38, and 40 are connected to the memory 32. DRAM 42, 44, 46, and 48 are connected to each of the microprocessor cores 34, 36, 38, and 40, respectively. The memory 32 is connected to an external bus FIFO 50, a router bus FIFO 52, a cross point FIFO 54, an A bus FIFO 56, a B bus FIFO 58, and a C bus FIFO 60 to route the memory 32 to other components associated with the system 30.
The architecture in the system is configured to meet various user needs. Software codes, such as C, C++, Fortran, VHDL, and the like can be adapted to the system. The FPGAs are programmed with a portion of the code to speed up the processing time. The system can communicate with Vector IP, Cluster IP, and switches using high speed storage with lower power requirements.
The cluster is embedded in the FPGA in the system. Previously, the FPGA was embedded in the cluster. This architecture provides many advantages over the related art. The system, for example, includes the flexibility of a microprocessor, speed of FPGA hardware, and no latency. The microprocessors are placed in parallel with FPGA hardware to accelerate code execution. The programs are written in both environments — enterprise software and the FPGA. Both environments can be executed simultaneously. The system takes advantage of the versatility of the microprocessor. It also takes advantage of the parallel hardware architecture to compute intensive subroutines. Based on the user's needs and application requirements the system can be optimized for application performance. The system uses Amdahl's law to speed up the processing time of a given code.
Typically, about 90 percent of the processing time can be attributed to 10 percent of the code. This processing-intensive portion of the code is identified and moved to be accelerated through the FPGA hardware. Then the processing time is reevaluated and the next largest processing-intensive portion of the code is located. This second portion of the code is executed in parallel on the FPGA hardware. This process can be repeated numerous times to reduce the processing-intensive portion of the code. Multiple portions of the existing code can be accelerated through the FPGA hardware in parallel. In addition, the same algorithm can be run in parallel through multiple processors to speed up the processing time until the process reaches the limits of the algorithm. In this manner, the processing time of the algorithm can be increased, for example, 4 times, 8 times, 16 times, 32 times, or other configured times faster within the limits of the algorithm.
In Fig. 3, an example of a high performance FPGA database appliance is illustrated. Two HC62 computers 62, 64 are connected together through an 8*32-bit bus 66. Each of the HC62 computers 62, 64 is connected to a HC Host 68, 70 through Peripheral Component Interconnect Extended (PCI-X) bus technology 72, 74. Each of the HC Hosts 68, 70 is connected to a switch through a 2* IGb TCP/IP 76, 78. The HC62 computers are also each connected to the switch 80 through 32* 100Mb TCP/IP 82, 84. A storage area network (SAN) 86 interconnects the system to the HC62 computers 62, 64 through TCP/IP 88 to the switch 80.
Another example of a high performance FPGA database appliance is shown in Fig. 4. A SAN 90 is connected to disk cache 92 and DRAM cache 94. The disk cache 92 is connected to an HC host cache controller 96. The DRAM cache 94 is connected to an HC62 cache controller 98. The HC host cache controller 96 and the HC62 cache controller 98 are connected together. The HC host cache controller 96 is also connected to an HC host command router 100 that is connected to a router 102. The HC62 cache controller 98 is connected to an HC62 command router 104. The HC62 command router 104 is also connected to the router 102. A Structured Query Language (SQL) request 106 can then be sent to and from the router 102 with respect to the SAN 90. An example of a FPGA configuration is shown in Figs. 5 and 6. The processing element 110 includes an embedded microprocessor and is connected to four memory cards, such as high speed 2 Gb memory 112, 114, 116, and 118. The processing element 110 is connected to 4 high speed I/O channels, such as Infiniband, Gbit Ethernet, and the like. The FPGA 110 is also connected to a 74 bit bus to interconnect the FPGA 110 to FPGA fabric. The FPGA includes a cluster of two power PC processors.
The processing element 110 includes a multiplicity of external connection pins about its perimeter as shown. Intersecting dashed lines demarcate in conceptual terms four different regions 120, 122, 124, and 126 of the processing element 110. Each of the memory resources 120, 122, 124, and 126 is interconnected with processing element pins adjacent to a corresponding processing element region. For instance, memory resource 112 is interconnected with pins that carry address, data, and control information between memory resource 112 and the processing element 110. The memory resource 112 also is interconnected with the processing element 110 through pins adjacent to processing element region 112. The memory resource 114 in a like manner is interconnected to processing element 110 by pins adjacent to processing element region 114. Memory resource 116 is interconnected to processing element 110 by pins adjacent to processing element region 116. Memory resource 118 is interconnected to processing element 110 by pins adjacent to processing element region 118. A first set of external connection pins generally disposed about a first side of the processing element 110 are grouped together. This first group shall be referred to herein as the Group A Connections 128. A second group of external connection pins generally disposed along a second side of the processing element 110 are grouped together as a second group which shall be referred to herein as the Group B Connections 130. The Group A Connections 128 and the Group B Connections 130 are generally disposed along opposite sides of the processing element. A third group of external connection pins is distributed about the periphery of the processing element. This third group shall be referred to herein as the Group C Connections 132. The Group A, B and C external pin connections are labeled accordingly in Fig. 9. The large arrows associated with the Group A and B connections are intended to indicate that each of these is generally disposed as a group along opposite sides of the processing element 110. Clock signal external connections CL can be provided to external connection pins disposed near the four corners of the processing element 110.
In a present embodiment of the invention, the processing element 110 comprises a field programmable gate array (FPGA), and the memory resources 112, 114, 116, and 118 comprise dynamic random access memory. More specifically in a current implementation, the processing element 110 is a Virtex-4 FX FPGA produced Xilinx, Inc. having offices in San Jose, California. An FPGA device comprises hardware resources that can be configured to perform the functions required by virtually any application. For example, FPGAs produced by Xilinx, Inc. comprise combination logic blocks (CLB) resources that can be configured to perform different functions. FPGAs produced by National Semiconductor Corporation, for example, include "cells" that can be configured to perform different functions. Similarly, FPGAs produced by Altera, Inc. include logic array blocks (LABs) that can be configured to perform different functions. These are just a few examples of different types of FPGAs.
Memory can be used to store, or to assign, a value for a single bit of information. A computation unit (e.g. CLB, Cell, LAB) of an FPGA typically operates on a few bits of information at a time. A processor ALU may operate on more bits at a time. Of course, there is no clear line to be drawn between a memory, an FPGA or a processor. For instance, an FPGA may employ lookup table memory to implement a compute function. A processor may be programmed to operate on one or two bit wide data.
A processor, for example, is quite effective at what might be termed sequence division multiplexed operations. Basically, a series of instructions cause the processor to change the state of the processor system from time to time so that the processor compute resources can be re-used by different instructions. Thus, as a sequence of instructions is provided to the processor and the processors' state changes so that the same processor hardware can be re-used to perform different functions. An FPGA-type processing element might, from time to time be configured to operate as a non-multiplexed device. That is, it may be configured so that the compute resources do not change state. That same processing element later might be reconfigured so as to operate more like a processor in which compute resources change state so that they can be multiplexed and re-used for a variety of different functions. As illustrated in Fig. 7, a cluster of 4 processors can be interconnecting in the system. In this example, the FPGAs are partitioned to 3 processors each for a total of 12 processors and 8 gigabytes of RAM. The FPGAs can have a soft core and can be configured to act like a processor. Each processor can be partitioned to include as many processors as needed for a given system, for example, 4 to 16 processors. The FPGAs can be configured to be homogeneous or heterogeneous. For example, the FPGA can be configured to have 5 to 10 processors that operate at different speeds depending on the application requirements. The example in Fig. 7 includes clusters of 4 FPGAs. Each FPGA includes 4 channels to provide a total of 16 high speed I/O channels. The FPGAs also can include dual ported, shared memory. The memory and the processor are printed on the same chip. When the memory is embedded on the same chip, the FPGA does not require a memory controller. The embedded memory speeds up communication between the processor and the memory.
In Fig. 8, a single FPGA configuration is shown. The configuration includes 4 parallel memory channels with 8 gigabytes of DRAM. The configuration also includes 4 embedded processors and 8 embedded power PC processors. The chip has shared memory access for communications and 16 high speed I/O channels. The chip includes a 32 bit A-bus, a 32-bit B-bus, and a 32-bit C-bus to the FPGA in quad. The configuration also includes a 32-bit external bus for expansion, a 25-bit bus to cross point FPGA, and a 64-bit bus to router FPGA.
Fig. 9 shows a block diagram of a Level Two Subsystem 134 in accordance with an embodiment of the invention. The Level Two Subsystem 134 includes a first Level Two processing unit 136, a second Level Two processing unit 138 and a Level Two communication and processing unit 140. The first Level Two processing unit 136 comprises a network of processing elements like processing element 110 of Figs. 5 and 6. More specifically, the first Level Two processing unit 136 includes processing elements 142, 144, 146, and 148. Each of these processing elements is interconnected with memory resources like the interconnection of memory resources 112, 114, 116, and 118 with processing element 110 as shown in Figs. 5 and 6. The processing elements 142, 144, 146, and 148 can also include embedded memory in each of the processing elements. Similarly, the second Level Two processing unit 138 comprises a network of processing elements 150, 152, 154, and 156. The group A external connections of each of the processing elements 142, 144, 146, and 148 of the first Level Two processing unit 136 are interconnected with a first Level Two intra-connection lines (Al) 166, which interconnect processing elements 142, 144, 146, and 148 and the communication and processing unit 140. More particularly, processing element 142 includes Group A external connections 158-165 that are interconnected with the first Level Two intra-connection lines 166. Similarly, processing elements 150, 152, 154, and 156 include respective Group A connections 162-165 that are interconnected with the first Level Two intra-connection lines 166. Likewise, second Level Two intra-connection lines (A2) 168 interconnect the Group A external connections of processing elements 150, 152, 154, and 156 with the communications and processing unit 140. More specifically, the respective Group A external connections 162 of processing element 1150 are interconnected with the second Level Two intra-connection lines (A2) 168. Similarly, the respective Group A connections 163-165 of respective processing elements 152, 154, and 156 are interconnected with the second Level Two intra-connection lines 168.
The processing elements 142, 144, 146, and 148 of the first Level Two processing unit 136 and the processing units 150, 152, 154, and 156 of the second Level Two processing unit 138 have their respective group B external connections capable of communication external to the Level Two subsystem 134. More specifically, the group B connections 170 of processing element 142 interconnect with first Level Two processing unit external connection lines (Bl). Similarly, the respective group B external connections 171-173 of respective processing elements 144, 146, and 148 interconnect with respective second, third and fourth Level Two processing unit external connection lines (B2, B3, B4). Each of the first, second, third and fourth first Level Two processing unit external connection lines communicates with a first external connector 174, which provides communication external to the Level Two Subsystem 134.
The processing elements 150, 152, 154, and 156 of the second Level Two processing unit 138 are similarly interconnected through a second external connector 176 for communication external to the Level Two Subsystem 134. More specifically, the Group B connections 178 of processing element 150 interconnect with first and second Level Two processing unit external connection lines (B5). Likewise, the respective group B connections 179-181 of processing elements 152, 154, and 156, respectively interconnect with second, third and fourth second Level Two processing unit external connection lines (B6, B7, B8). The first, second, third and fourth second Level Two processing unit external connections lines interconnect with the second external connector 176 to provide communication external to the Level Two Subsystem 134.
In another embodiment of the invention, an FPGA includes embedded memory and power PCs as shown in Fig. 10. The FPGA includes dual ported shared memory 182, 184 and 2 Power PCs 186, 188. The FPGA includes 4 channels such as Infiniband Ethernet to provide about 10 Gb/sec connection speed. The FPGA can be connected in parallel with other FPGAs.
An example of an HC-62 FPGA system is illustrated in Fig. 11. The system 200 includes 8 FPGA clusters 202-209. The FPGAs 202-209 are interconnected as illustrated in Fig. 9. Each of the FPGAs 202-209 is also connected to eight independent 32-bit external I/O buses 210. PCIe bus DMA channels 212 are also connected to each FPGA to provide 3.2 Gb/sec bandwidth. The system 200 includes 8 independent 64-bit buses 214 to and from the router FPGA to provide 8.5 Gb/sec bandwidth. Eight independent 32-bit buses 216 are connected to the FPGAs to route data to and from the cross point FPGA to provide 3.3 Gb/sec bandwidth. The system 200 includes PCIX bus DMA channels 218 to each FPGA to provide 250 Mb/sec bandwidth and 144 high speed I/O ports 220. The FPGAs includes 36 parallel memory channels having 72 gigabyte DRAM. The FPGAs also include 12 independent 32-bit buses between each of the FPGAs to provide a 6.4 Gb/sec bandwidth.
In another embodiment of the invention, the system can include dual Xeon processors, a 1.5 terabyte hard drive, and 16 gigabyte DRAM. The FPGAs include 144 high speed I/O channels at 18 to 45 Gb/sec bandwidth and 8 independent 32-bit external I/O buses to provide 3.2 Gb/sec bandwidth. Each of the FPGAs are connected together using dual channel G-bit Ethernet connections at 250 Mb/sec bandwidth.
Multiple hybrid computers can be connected together to create a hybrid computing server as shown in Fig. 12. A hybrid computing server 222 includes HC62 computers 224-228 as previously discussed having FPGAs with embedded processors. The server has a multilevel memory model. The computers 224-228 include a memory layer level 1 with 36 gigabytes per HC62 and 36 memory channels to and from the FPGAs. A memory layer level 2 includes 16 gigabytes per HC62. A memory layer level 3 includes 1.5 terabyte hard drive storage per HC62. A memory layer level 4 includes a multi-terabyte per SAN. The HC62 computers 224-228 are connected together with buses each having a 6
Gb/sec bandwidth. The computers are also connected to a router 230 using an 18 Gb/sec bandwidth Ethernet connection. The router 230 is connected to the SAN 232 with a 10 Gb/sec bandwidth to complete the server connection.
An advantage of a communication and processing unit as disclosed is that there is an approximate balance in which there can be linear scaling of communications capability, compute capability and granularity of compute resources so that linear scaling of each can enable an overall system comprising many elements of the general type described above to meet the communications, compute and granularity demands of increasingly complex algorithms. The ability to stack communication and processing units in three dimensions can reduce the distance between adjacent processing elements. As a result, the time required to communicate information between different communication and processing elements can be reduced. By providing many intra- connection lines and many external connection lines there can be a relatively high volume of communication between processing elements.
Moreover, this high volume of communication makes possible the improved cooperation among processing elements in performing a computation task. Moreover, as explained more fully below, the large amount of interconnection resources and other interconnections defined elsewhere herein permit the scaling up of the basic architecture of the communication and processing unit to a much larger scale which, in turn, permits higher granularity (i.e., more bits of information to be processed together) so that more complex operations can be performed efficiently.
The present invention may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.

Claims

CLAIMSWhat is claimed is:
1. A system comprising: at least two microprocessor cores wired in parallel with each other, each microprocessor core including an FPGA with an embedded processor; a memory element electrically connected to the microprocessor core;
FPGA fabric being in communication with the FPGA including a portion of a programming code for the system; and an operating system being in communication with each of the microprocessor cores to run a remaining portion of the programming code.
2. The system of claim 1, wherein the memory is embedded in the microprocessor core.
3. The system of claim 1, wherein the memory is connected to the processor with a dual port.
4. The system of claim 1, wherein the operating system further comprises built in logic analyzer functions combined with standard software debug tools to provide mixed mode debugging.
5. The system of claim 1, further comprising a graphical interface to create a schematic to mimic the FPGA and functional operations.
6. The system of claim 5, wherein the graphical interface displays the schematic in a relational form of the actual FPGA used in the operating system with the functional operations.
7. The system of claim 6, wherein the graphical interface converts the relational form of the schematic into binary code.
8. The system of claim 1, wherein the system is part of multiple hybrid computers, the hybrid computers being connected together to from a hybrid computing server.
9. The system of claim 8, wherein the hybrid computing server is configured to form a multilevel memory model.
10. The system of claim 9, wherein the multilevel memory model includes a plurality of memory layer levels.
11. A server system comprising: at least two computers interconnected through a network connection; each of the computers including at least two microprocessor cores wired in parallel with each other, each microprocessor core including an FPGA with an embedded processor; a memory element electrically connected to the microprocessor core;
FPGA fabric being in communication with the FPGA including a portion of a programming code for the system; and an operating system being in communication with each of the microprocessor cores in each of the computers to run a remaining portion of the programming code.
12. The server system of claim 11, wherein the memory is embedded in the microprocessor core.
13. The server system of claim 11, wherein the memory is connected to the processor with a dual port.
14. The server system of claim 11, wherein the operating system further comprises built in logic analyzer functions combined with standard software debug tools to provide mixed mode debugging.
15. The server system of claim 11, further comprising a graphical interface to create a schematic using symbols and connectors to mimic the FPGA and functional operations.
16. The server system of claim 15, wherein the graphical interface displays the schematic in a relational form of the actual FPGA used in the operating system with the functional operations.
17. The server system of claim 16, wherein the graphical interface converts the relational form of the schematic into binary code.
18. The server system of claim 11, wherein the computers are hybrid computers connected together to from a hybrid computing server.
19. The server system of claim 18, wherein the hybrid computing server is configured to form a multilevel memory model.
20. The server system of claim 19, wherein the multilevel memory model includes a plurality of memory layer levels.
PCT/US2007/084723 2006-11-14 2007-11-14 Hybrid computing platform having fpga components with embedded processors WO2008061162A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US86575606P 2006-11-14 2006-11-14
US60/865,756 2006-11-14

Publications (1)

Publication Number Publication Date
WO2008061162A1 true WO2008061162A1 (en) 2008-05-22

Family

ID=39402010

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2007/084723 WO2008061162A1 (en) 2006-11-14 2007-11-14 Hybrid computing platform having fpga components with embedded processors

Country Status (1)

Country Link
WO (1) WO2008061162A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11150948B1 (en) 2011-11-04 2021-10-19 Throughputer, Inc. Managing programmable logic-based processing unit allocation on a parallel data processing platform
CN116248705A (en) * 2022-11-29 2023-06-09 宜昌测试技术研究所 Multichannel image transmission and processing system of miniature photoelectric pod
US11915055B2 (en) 2013-08-23 2024-02-27 Throughputer, Inc. Configurable logic platform with reconfigurable processing circuitry

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5892962A (en) * 1996-11-12 1999-04-06 Lucent Technologies Inc. FPGA-based processor
US6339819B1 (en) * 1997-12-17 2002-01-15 Src Computers, Inc. Multiprocessor with each processor element accessing operands in loaded input buffer and forwarding results to FIFO output buffer
US20030212853A1 (en) * 2002-05-09 2003-11-13 Huppenthal Jon M. Adaptive processor architecture incorporating a field programmable gate array control element having at least one embedded microprocessor core

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5892962A (en) * 1996-11-12 1999-04-06 Lucent Technologies Inc. FPGA-based processor
US6339819B1 (en) * 1997-12-17 2002-01-15 Src Computers, Inc. Multiprocessor with each processor element accessing operands in loaded input buffer and forwarding results to FIFO output buffer
US20030212853A1 (en) * 2002-05-09 2003-11-13 Huppenthal Jon M. Adaptive processor architecture incorporating a field programmable gate array control element having at least one embedded microprocessor core

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ARUN PATEL ET AL.: "A Scalable FPGA-based Multiprocessor", FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES, 2006. FCCM '06. 14TH ANNUAL IEEE SYMPOSIUM, April 2006 (2006-04-01), pages 111 - 120, XP031022165 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11150948B1 (en) 2011-11-04 2021-10-19 Throughputer, Inc. Managing programmable logic-based processing unit allocation on a parallel data processing platform
US11928508B2 (en) 2011-11-04 2024-03-12 Throughputer, Inc. Responding to application demand in a system that uses programmable logic components
US11915055B2 (en) 2013-08-23 2024-02-27 Throughputer, Inc. Configurable logic platform with reconfigurable processing circuitry
CN116248705A (en) * 2022-11-29 2023-06-09 宜昌测试技术研究所 Multichannel image transmission and processing system of miniature photoelectric pod
CN116248705B (en) * 2022-11-29 2024-03-15 宜昌测试技术研究所 Multichannel image transmission and processing system of miniature photoelectric pod

Similar Documents

Publication Publication Date Title
US10296488B2 (en) Multi-processor with selectively interconnected memory units
Varghese et al. An efficient logic emulation system
US6092174A (en) Dynamically reconfigurable distributed integrated circuit processor and method
Baumgarte et al. PACT XPP—A self-reconfigurable data processing architecture
US5784636A (en) Reconfigurable computer architecture for use in signal processing applications
US7873811B1 (en) Polymorphous computing fabric
Yin et al. Scalable mapreduce framework on fpga accelerated commodity hardware
US9946551B2 (en) System and method that generate reconfiguration information
WO2008061162A1 (en) Hybrid computing platform having fpga components with embedded processors
RU72339U1 (en) MULTI-PROCESSOR COMPUTER SYSTEM MODULE (OPTIONS)
WO2008061161A2 (en) Execution of legacy code on hybrid computing platform
Shibata et al. A virtual hardware system on a dynamically reconfigurable logic device
Cicuttin et al. HyperFPGA: A possible general purpose reconfigurable hardware for custom supercomputing
US6298430B1 (en) User configurable ultra-scalar multiprocessor and method
Rettkowski et al. Application-specific processing using high-level synthesis for networks-on-chip
CN107665281B (en) FPGA-based processor simulation method
Denholm et al. A unified approach for managing heterogeneous processing elements on FPGAs
JP2006513489A (en) System and method for scalable interconnection of adaptive processor nodes for clustered computer systems
Morris et al. A re-configurable processor for Petri net simulation
Morris et al. A scalable re-configurable processor
US20240069918A1 (en) Method and system for replicating core configurations
Masselos et al. Introduction to reconfigurable hardware
JP2013009044A (en) Control device, processing device, processing system and control program
Ferreira et al. Reducing interconnection cost in coarse-grained dynamic computing through multistage network
Uno et al. Implementation of data driven applications on a multi-context reconfigurable device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07864420

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07864420

Country of ref document: EP

Kind code of ref document: A1