CN112416853A

CN112416853A - Stacked programmable integrated circuit system with intelligent memory

Info

Publication number: CN112416853A
Application number: CN202010579137.2A
Authority: CN
Inventors: S.阿特萨特
Original assignee: Intel Corp
Current assignee: Intel Corp
Priority date: 2019-08-20
Filing date: 2020-06-23
Publication date: 2021-02-26
Also published as: US20220231689A1; US10749528B2; US20190379380A1; US11296705B2; US20210058086A1; EP3783649A1

Abstract

Circuitry is provided that includes a programmable fabric with fine-grained routing wires and a separate programmable coarse-grained routing net that provides increased bandwidth, low latency, and deterministic routing behavior. The programmable fabric may be implemented on a top die stacked on an active interposer die. The programmable coarse-grained routing network and the smart memory circuitry may be implemented on an active interposer die. Smart memory circuitry may be configured to perform higher levels of functionality than simple read and write operations. The smart memory circuitry may use a state machine to implement command-based low cycle count operations without requiring execution of program code, complex microcontroller-based multi-cycle operations, and other non-generic microcontroller-based smart RAM functions.

Description

Stacked programmable integrated circuit system with intelligent memory

Background

The present invention relates generally to integrated circuits, and more particularly to programmable integrated circuits.

A programmable integrated circuit is one type of integrated circuit that can be programmed by a user to implement desired custom logic functions. In a typical scenario, a logic designer uses computer-aided design tools to design a custom logic circuit. When the design process is complete, the computer aided design tool generates configuration data. The configuration data is then loaded into memory elements on the programmable integrated circuit device to configure the device to perform the functions of the custom logic circuit. Such types of programmable integrated circuits are sometimes referred to as Field Programmable Gate Arrays (FPGAs).

Multi-chip integrated circuit packages typically include an FPGA die mounted on top of an active interposer (active interposer). The active interposer may contain a memory. The bandwidth and latency of the interface connecting the FGPA die to the interposer memory is limited by the number of connections available between the FPGA die and the active interposer. Existing interposer memories have limited usage models and can only support a small range of applications.

The embodiments described herein are presented within this context.

Drawings

Fig. 1 is a diagram of an illustrative programmable integrated circuit system (circuit) in accordance with an embodiment.

Fig. 2 is a cross-sectional side view of an illustrative 3-dimensional (3D) stacked multi-chip package, in accordance with an embodiment.

Fig. 3A is a perspective view of an array of logical fabric segments mounted over an array of smart memory segments, in accordance with an embodiment.

Fig. 3B is a perspective view illustrating how input-output drivers on a logical fabric segment may be aligned with input-output drivers on a corresponding smart memory segment, according to an embodiment.

FIG. 4 is a diagram of an illustrative logical fabric segment, according to an embodiment.

FIG. 5 is a diagram of an illustrative smart memory segment, according to an embodiment.

FIG. 6 is a diagram of an illustrative intelligent memory group, according to an embodiment.

FIG. 7 is a diagram illustrating how specialized functional blocks may be embedded within an array of smart memory blocks, according to an embodiment.

FIG. 8 is a diagram illustrating how a programmable coarse-grain routing network (programmable coarse-grain routing network) may be provided with multiple n-bit lanes, according to an embodiment.

Fig. 9A is a circuit diagram of an illustrative programmable 4-port switchbox circuit, according to an embodiment.

Fig. 9B is a circuit diagram of an illustrative programmable 3-port junction box circuit, according to an embodiment.

FIG. 10 is a diagram of an illustrative smart memory block, according to an embodiment.

FIG. 11 is a diagram illustrating various modes that can be supported by the smart memory block of FIG. 10, according to an embodiment.

Detailed Description

The present embodiments relate to programmable integrated circuits, and in particular to programmable integrated circuits (e.g., field programmable gate arrays) stacked on active interposers containing distributed smart memory arrays. The term "intelligence" refers to the following capabilities of a memory: perform a higher level of functionality than simple read and write operations, and perform a sequence of operations not typically supported by a typical microcontroller.

Smart memories may utilize a built-in state machine to perform higher level low cycle count operations (e.g., updates in memory, comparisons in memory, simple linked list traversals, content addressable memory operations, cache operations, etc.), or may behave as microcontrollers that perform complex multi-cycle data movement patterns and operations (e.g., complex data placement operations, complex linked list traversals, direct media access controller operations, FPGA logic controller operations, etc.) and other smart memory functions that are not typically optimized in a typical microcontroller. State machines may be faster and more specific than microcontrollers, while microcontrollers are relatively slower and more generalized. The smart memory allows the IC package to keep operations within the active interposer as long as possible without having to cross the FPGA die, which further improves computer performance while consuming less power.

Distributed smart memory arrays may be interconnected using a configurable coarse-grained routing network that provides deterministic pre-wired routing interconnects, giving guaranteed timing closure (timing closure) and register pipelining at fixed locations to meet target maximum operating frequencies in a wide range of computing applications. The use of distributed smart memory arrays and programmable coarse-grained routing networks within active interposers provides a tangible improvement to computer technology by providing more flexible and efficient utilization of interposer memory, by enabling smart memory to support a wide variety of complex use cases via an evolvable Intellectual Property (IP) library model, and by increasing the effective memory bandwidth by a factor of 2-4X.

It will be recognized by one skilled in the art that the present exemplary embodiments may be practiced without some or all of these specific details. In other instances, well known operations have not been described in detail so as not to unnecessarily obscure the present embodiments.

An illustrative embodiment of a programmable integrated circuit system 100, such as a Programmable Logic Device (PLD) or a Field Programmable Gate Array (FPGA), that may be configured to implement a circuit design is shown in fig. 1. As shown in fig. 1, circuitry 100 may include a two-dimensional array of functional blocks including, for example, Logic Array Block (LAB) 110 and other functional blocks such as Random Access Memory (RAM) block 130 and Digital Signal Processing (DSP) block 120.

Functional blocks such as LABs 110 can include smaller programmable regions (e.g., logic elements, configurable logic blocks, or adaptive logic modules) that receive input signals and perform custom functions on the input signals to produce output signals. LABs 110 can also be grouped into larger programmable regions (sometimes referred to as logic segments) that are individually managed and configured by corresponding logic segment managers. The grouping of programmable logic resources on device 100 into logic segments, logic array blocks, logic elements, or adaptive logic modules is merely illustrative. In general, circuitry 100 may include any suitable size and type of functional logic blocks, which may be organized according to any suitable hierarchy of logical resources.

Circuitry 100 may include programmable memory elements. These memory elements may be loaded with configuration data (also referred to as programming data). Once loaded, the memory elements each provide corresponding static control signals that control the operation of an associated functional block (e.g., LAB 110, DSP 120, RAM 130, etc.). In a typical scenario, the output of the loaded memory element is applied to the gates of the metal-oxide-semiconductor transistors in the functional block, thereby turning certain transistors on or off and thereby configuring the logic (including the wiring paths) in the functional block. Programmable logic circuit elements that can be controlled in this manner include portions of multiplexers (e.g., multiplexers used to form wiring paths in interconnect circuitry), look-up tables, logic arrays, AND (AND), OR (OR), NAND (NAND), AND NOR (NOR) logic gates, pass gates, AND so forth.

The memory elements may use any suitable volatile and/or non-volatile memory structures, such as Random Access Memory (RAM) cells, fuses, antifuses, programmable read only memory (prom) cells, mask-programmed and laser-programmed structures, combinations of these structures, and so forth. Since the memory elements are loaded with configuration data during programming, the memory elements are sometimes referred to as configuration memory, Configuration Random Access Memory (CRAM), or programmable memory elements. The circuit system 100 may be configured to implement a custom circuit design. For example, the configuration RAM can be programmed such that LABs 110, DSPs 120, and RAM 130, as well as programmable interconnect circuitry (i.e., vertical lanes 140 and horizontal lanes 150) form a circuit design implementation.

In addition, the programmable logic device may also include input-output (I/O) elements (not shown) for driving signal turn-off of the circuitry 100, and for receiving signals from other devices. The input-output elements may include parallel input-output circuitry, serial data transceiver circuitry, differential receiver and transmitter circuitry, or other circuitry for connecting one integrated circuit device to another integrated circuit device.

As described above, circuitry 100 may also include programmable interconnect circuitry in the form of vertical routing channels 140 and horizontal routing channels 150, each routing channel including at least one track that routes at least one or more wires. If desired, the interconnect circuitry may include pipeline elements, and the content stored in these pipeline elements may be accessed during operation.

Note that other wiring topologies besides the topology of the interconnect circuitry depicted in fig. 1 are intended to be included within the scope of the present invention. For example, a wiring topology may include wires running diagonally or horizontally and vertically along different portions and wires perpendicular to the plane of the device in the case of a three-dimensional integrated circuit, and the drivers of the wires may be located at different points than one end of the wires. The routing topology may include global wires that span substantially all of circuitry 100, partial global wires (such as wires that span portions of circuitry 100), staggered wires of a particular length, smaller local wires, or any other suitable arrangement of interconnection resources.

As described above in connection with fig. 1, circuitry 100 may implement a programmable integrated circuit, such as a Field Programmable Gate Array (FPGA). Configurations in which an FPGA is coupled to a distributed intelligent memory array via a dedicated programmable coarse-grained routing network may sometimes be described herein as examples. However, this is merely illustrative. In general, the structures, methods, and techniques described herein may be extended to other suitable types of integrated circuits.

Horizontal routing wires 140 and vertical routing wires 150 used to interconnect various functional blocks within the FPGA are sometimes referred to as "fine-grained" routing wires. Fine-grained routing wires may be programmed with bit-level granularity. However, because the speed of external input-output interface protocols continues to double every two to three years, performance improvements of fine-grained FPGA wiring interconnects are limited due to semiconductor parasitics (i.e., parasitic capacitance and resistance) and metal width and spacing requirements, all of which limit maximum frequency (Fmax) gain. Likewise, since fine-grained routing is used to distribute both local and global wires, packing large coherent bus networks together will reduce the number of routing wires available for connectivity between logic elements of conventional FPGA logic.

FPGAs can also be provided with a dedicated fixed-function network on chip (NoC) fabric, which can give higher bandwidth capacity, but is subject to additional overhead and trade-offs. For example, the NoC fabric interconnect includes additional overhead required to implement credit throttling (credit throttling), backpressure (backpressure), and required bridging of NoC-based protocols, such as the AXI NoC interface protocol. Other problems associated with NoC-based fabrics are that the routing of the NoC-based fabrics may be non-deterministic, and bandwidth allocation is inflexible and complex.

As integrated circuit technology scales towards smaller device sizes, device performance continues to improve at the expense of increased power consumption. In an effort to reduce power consumption, more than one die may be placed within a single integrated circuit package (i.e., a multi-chip package). Since different types of devices are required to meet different types of applications, in some systems more dies may be required to meet the requirements of high performance applications. Thus, to achieve better performance and higher density, an integrated circuit package may include multiple dies arranged laterally along the same plane, or may include multiple dies stacked on top of each other (sometimes referred to as a 3-dimensional or "3D die stack").

Technologies such as 3D stacking have helped FPGAs stay synchronized and scaled with external IO interface protocols by leveraging one of the stacked dies to enable new dimensions for building heterogeneous products to extend memory capacity, computing power, and also interconnect capacity. FIG. 2 shows a cross-sectional side view of an illustrative multi-chip package 200, the multi-chip package 200 including a package substrate 206; an active interposer die 204, the active interposer die 204 mounted on a package substrate 206; and an Integrated Circuit (IC) die 201, the Integrated Circuit (IC) die 201 mounted on top of the active interposer 204. As shown in fig. 2, an FPGA fabric 202 (which can include programmable logic 110, DSP blocks 120, RAM blocks 130, and associated CRAM cells) may be formed within a top IC die 201.

Micro bumps (microbump) 212 may be formed between die 201 and die 204 to help couple circuitry on die 201 to circuitry on die 204. Bumps, such as controlled collapse chip connection (C4) bumps 214 (sometimes referred to as flip chip bumps), may be formed at the bottom surface of the interposer 204. In general, the C4 bumps 214 (e.g., bumps for interfacing with components outside the package) are substantially larger in size than the microbumps 212 (e.g., bumps for interfacing with other dies within the same multi-chip package). The number of micro-bumps 212 is also typically much larger than the number of flip-chip bumps 214 (e.g., the ratio of the number of micro-bumps to the number of C4 bumps may be greater than 2:1, 5:1, 10:1, etc.).

In particular, the active interposer 204 may include an embedded coarse-grained routing network, such as a programmable coarse-grained routing network 220 and smart memory circuitry 230. The programmable coarse-grained routing network 220 can be used to address the following needs of a programmable IC design: requiring the use of deterministic global wiring interconnects and/or NoC-type fabric networks. The fine-grained routing wires that traditionally implement local routing and global routing within the FPGA fabric 202 have programmable variable routing lengths and pipelining positions (i.e., the fine-grained routing wires have non-fixed lengths and pipelining positions). A design compiler tool for compiling an FPGA design must attempt to meet the target Fmax requirements without assurance. Shorter fine-grained wires are cascaded together to form longer wires and must reach reasonably close registers to meet timing requirements. Furthermore, multi-bit buses in a fine-grained routing configuration may all take different routing paths before reaching the same final destination. The various paths that can be taken may also vary from one design compilation to another. In other words, fine-grained routing lacks frequency certainty. This is because fine-grained routing is not predefined in terms of how it is routed, and therefore, compilers have multiple degrees of freedom. This results in greater flexibility but less clock frequency predictability as paths are executed. In contrast, the multi-bit bus in coarse-grained routing network 220 exhibits relatively large frequency certainty in the sense that: the coarse-grained routing channels and the intelligent RAM blocks are all designed to run at a particular frequency and may all take the same routing path on the interposer die.

In contrast to fine-grained routing wires, the programmable coarse-grained routing network 220 may be programmed at a byte-level, word-level, or other multi-bit-wide granularity, and have pipelines at fixed locations to meet a target operating frequency. The coarse-grained routing network 220 may also exhibit transport delay certainty, which allows the network 220 to know under what circumstances the data is at each clock cycle at least at the transport endpoints (such as at the intelligent RAM blocks or FPGA logic). In other words, the network 22 will be able to determine at which clock cycle an event will occur with certainty. Transport delay determinism is advantageous because it allows each component in the system to be optimized for throughput. The 8-bit granularity provides the least prevalent coarse granularity width and the most commonly used memory and IO data path widths that can support the need for different computational variables (8/16/32/64 bits). The interconnects within coarse-grained routing network 220 are pre-wired to ensure timing closure (e.g., to guarantee routing channels within network 220 in terms of timing and inter-bus skew).

By forming coarse-grained routing network 220 as a separate component from top FPGA die 201, any global or long arriving routing paths that span a larger number of logic areas (e.g., span five or more logic array blocks) can pass down to routing network 220 and then back up to the appropriate remote destination on top die 201. Dedicating a coarse-grained routing network 220 to global routing allows fine-grained routing wires on the top die 201 to focus on handling only local or short-arriving routing paths. Offloading deterministically pipelined coarse-grained routing to the active interposer 204 in this manner provides improvements in integrated circuit performance by: more efficient high bandwidth data movement can be achieved within the FPGA circuitry and likewise on and off the FPGA (since the coarse-grained routing network is designed and optimized to run at the maximum operating frequency fmax); allowing for late binding decisions for FGPA use cases while not excluding higher levels of protocol coverage (such as network on chip); permitting efficient sharing of conductors for different independent traffic flows; flexible scalability allowing desired parallelism and bandwidth to be achieved; and using a fixed pre-wired pipelined lane structure to provide deterministic data streaming between endpoints.

The example of fig. 3 is merely illustrative, where programmable coarse-grained routing networks 220 are formed on separate 3D stacked dies relative to FPGA die 201. If desired, coarse-grained routing networks 220 can be formed in different areas on the same die as fabric 202, can be overlaid on top of fabric 202 in different metal routing layers of the same die, can be formed on separate 2D side-mounted dies, can be formed as part of different IC packages, and so forth. If desired, one or more transceiver dies, high bandwidth memory dies, and other suitable components may optionally be mounted on the active interposer 204 or on the package substrate 206 within the multi-chip package 200.

According to an embodiment, the programmable coarse-grained routing network 220 is coupled to smart memory circuitry 230 within the active interposer 204 (e.g., the smart memory circuitry 230 may communicate with the FPGA fabric 202 via the coarse-grained routing network 220 and the microbumps 212). With this arrangement, the FPGA fabric 202 may be formed over the smart memory circuitry 230. FIG. 3A is a perspective view illustrating an array of logical fabric segments mounted above an array of smart memory segments. As shown in fig. 3A, the FPGA logic fabric 202 may include an array of logic fabric segments 300, while the smart memory circuitry 230 may include an array of smart memory segments 302. As indicated by communication path 304, each logical segment 300 may be coupled to and communicate with a corresponding smart memory segment 302 below.

Fig. 3B illustrates how there may be many distributed connections between each logical fabric segment 300 and the corresponding smart memory segment 302. The connections may be formed using micro-bump driver/receiver pairs coupled via micro-bumps between the top FPGA die and the active interposer. In the example of fig. 3B, the drivers and receivers may be evenly distributed within each sector, or may be grouped together into rows or columns (see, e.g., input-output circuit column 350 in sector 300 and input-output circuit column 352 in sector 302). As long as the microbumps themselves are aligned, the microbump driver/receiver locations on segment 300 and segment 302 may or may not be aligned. There may be 2000-4000 connections between each fabric segment 300 and each smart memory segment 302 (as an example). In other suitable embodiments, there may be at least 1000 connections, more than 4000 connections, five hundred to ten thousand connections, or any suitable number of connections linking section 300 to section 302. The number of connections may be adjusted according to the technology deployed to implement a particular application.

FIG. 4 is a diagram of an illustrative logical fabric segment 300, in accordance with an embodiment. As shown in fig. 4, the logical fabric section 300 may include logic circuitry 400 (e.g., a logic array block with micro-bump drivers and receivers distributed through or grouped into rows or columns), a Random Access Memory (RAM) block 402, and a DSP block 404. The RAM block 402 may or may not correspond to the RAM block 130 shown in fig. 1, while the DSP block 404 may or may not correspond to the DSP block 120 of fig. 1. In one suitable arrangement, the logic section 300 may include multiple stripes (stripes) of the DSP block 404, multiple stripes of the RAM block 404, and stripes of logic, with the microbump drivers distributed among the stripes of the DSP and RAM blocks. As described above in connection with fig. 3, the location of the microbump drivers/receivers is not critical as long as the microbumps can be connected to the appropriate drivers/receivers via on-chip wiring. Proper alignment of the microbump driver/receiver can help reduce signal latency and metal usage efficiency between driver-receiver pairs, if desired.

FIG. 5 is a diagram of an illustrative smart memory segment 302. As shown in fig. 5, smart memory segment 302 may include a plurality of smart memory groups 500. FIG. 6 further illustrates a logical layout in which each intelligent memory group 500 may include a 4x4 array of intelligent RAM blocks 600. This configuration is merely illustrative. In general, each smart memory group 500 may include more or less than four smart RAM blocks 600 arranged in a square footprint (footing), a rectangular footprint, or other irregularly shaped footprint.

As shown in FIG. 6, the intelligent RAM blocks 600 within an intelligent memory segment may be interconnected using a regular grid of coarse-grained routing paths 220 ', which coarse-grained routing paths 220' are part of the programmable coarse-grained routing network 220 described in connection with FIG. 2. The coarse-grained wiring paths 220' may be bundled into groups of wires, which are then switched together using Switch Boxes (SB) 290 and Connection Boxes (CB) 292. The switch box 290 may be configured to statically route signals throughout a coarse-grained routing network and to optionally pipeline bundles of wires. The connection box 292 may act as a local switch that connects the coarse-grained wiring network to the respective intelligent RAM blocks 600. Both the switch-box 290 and the connection-box 292 (and likewise the local group of smart RAM blocks 600) can be statically configured per usage model, and can also be reconfigured quickly and dynamically when switching between different use cases.

FIG. 7 is a diagram that logically illustrates how other specialized functional blocks may be embedded within the array of smart memory blocks 600. As shown in fig. 7, other special function Intellectual Property (IP) blocks, such as

blocks

702, 704, 706, and 708, may be inserted in place of the intelligent RAM blocks. These specialized functional blocks may be hardened for improved efficiency.

For example, block 702 may provide protocol bridging and global routing control, while block 704 may provide a global routing buffer that supports a protocol-based network on chip (NoC) overlaid on top of a coarse-grained routing network. Block 706 may be a Direct Memory Access (DMA) controller that generates address and command signals for orchestrating data movement between the various intelligent RAM blocks. Block 708 may be a general purpose microcontroller operable to handle thermal management functions and/or other more complex/advanced or specialized functions.

If desired, the functionality of one or more of hardened IP blocks 702, 704, 706, and 708 may be fully implementable by smart memory block 600 itself. The exemplary dedicated function IP blocks 702, 704, 706, and 708 of fig. 7 are merely illustrative and are not intended to limit the scope of the present embodiments. In general, other types of hardened IP blocks may also be included in the array of smart RAM blocks 600 to provide the desired embedded functionality.

FIG. 8 is a diagram showing how a programmable coarse-grained routing network may be provided with multiple n-bit lanes. As shown in fig. 8, each switch box circuit 290 from fig. 7 may include multiple instances of m individual switch boxes 290'. Each individual switch box 290 'may be coupled to each of four adjacent switch boxes 290' via a set of incoming and outgoing n-bit buses. An exemplary value for the number of lanes may be 8 (e.g., m = 8), and the width of the lanes may be 32 bits (e.g., n = 32). This is merely illustrative. The actual values of m and n may be determined and adjusted on a per-implementation basis for the smart memory functionality according to the wire allocation. The coarse-grained routing network may also have multiple different channel widths (e.g., some channels may convey n1 bits, while other channels may convey n2 bits, etc.) to accommodate efficient mapping of certain classes of smart memory interface types.

In the example of fig. 8, a first switch box 290 'in the switch box circuitry 290-1 may be coupled to a first switch box 290' in the switch box circuitry 290-2 via a first channel 802-1; a second switch box 290 'in the switch box circuitry 290-1 may be coupled to a second switch box 290' in the switch box circuitry 290-2 via a second channel 802-2; … …, respectively; also, the mth switch box 290 'in the switch box circuit 290-1 may be coupled to the mth switch box 290' in the switch box circuit 290-2 via the mth channel 802-N. The channels are wired both in the horizontal direction (linking switch boxes arranged along the same row) and in the vertical direction (linking switch boxes arranged along the same column).

In some embodiments, the channel routing may be granular at the byte level and may be combinable into multiple groups. In one suitable arrangement, coarse-grained routing interconnects may be divided into four independent groups: (1) a first group of 16 channels; (2) a second group of 8 channels; (3) a third group of 4 channels; and (4) a fourth group of 4 channels. Assuming each lane carries 8-bits in either direction, the configuration provides four independent networks of 16 GBps, 8 GBps, 4 GBps, and 4 GBps, respectively. Different user designs may select different channel assignments based on their unique requirements.

In another suitable arrangement, coarse-grained routing interconnects may be divided into two independent groups: (1) a first group of 16 channels, and (2) a second group of 16 channels. This configuration provides two independent networks each providing 15 GBps. In yet another suitable arrangement, the coarse-grained routing interconnects may be divided into three independent groups: (1) a first group of 16 channels; (2) a second group of 12 channels; and (3) a third group of 4 channels. This configuration provides three independent networks of 16 GBps, 12 GBps and 4 GBps, respectively.

These channel assignments are merely illustrative. In general, m can be any preselected integer and can be divided into any suitable number of groups depending on the needs of the application. This example of selecting allocations among the 8-bit bus is merely illustrative. If desired, each bus may carry 4 bits (sometimes referred to as a "word"), 2 bits, 2-8-bits, more than 8-bits, 16 bits, 8-16 bits, more than 16 bits, 32 bits, 16-32 bits, more than 32 bits, 64 bits, 32-64 bits, more than 64 bits, or another suitable number of bits.

Fig. 9A is a circuit diagram of an illustrative programmable 4-port switchbox circuit 290' in accordance with an embodiment. Each switch box 290' that is not located at the edge of the coarse-grained routing network 220 may include up to four data path routing multiplexers 902, the data path routing multiplexers 902 receiving and transmitting routing channels in each direction (e.g., north to south, south to north, west to east, and east to west). As shown in fig. 9A, the first datapath routing multiplexer 902W may have a first ("0") input horizontally interconnected from a west (W) connection, a second ("1") input coupled to node FN (i.e., the output of multiplexer 902N), a third ("2") input coupled to node FS (i.e., the output of multiplexer 902S), a fourth input ("3") receiving a signal from the FPGA fabric in the top die, and an output drive node FW. The output of the data path routing multiplexer 902W may be latched using a corresponding pipeline register 950. Depending on the distance between adjacent switch boxes 290', the pipeline register 950 may be bypassed statically.

The data path routing multiplexer 902W may be controlled using a selector multiplexer 904W. The selector multiplexer 904W may have: a first ("0") input configured to receive static control bits from an associated configuration unit or register in a locally embedded active interposer; and a second ("1") input configured to receive a control signal from the FPGA fabric in the top die. The static control bits stored in each configuration unit may be a configurable run time. Arranged in this manner, data path routing multiplexer 902W may select its "0" input to continue existing signal routing from west, select between two vertically oriented routing channels (i.e., by picking from a "1" or "2" input), or select data from the FPGA fabric (i.e., by picking a "3" input).

Each of the four directions may be arranged in a similar manner using the second data path routing multiplexer 902N to drive the node FN from north, the third data path routing multiplexer 902E to drive the node FE from east, and the fourth data path routing multiplexer 902S to drive the node FS from south. Multiplexer 902N may be controlled by selector multiplexer 904N. Multiplexer 902E may be controlled by selector multiplexer 904E. The multiplexer 902S may be controlled by a selector multiplexer 904S. The detailed wiring and connections are shown in fig. 9A. Connected as such, the FPGA may provide both data inputs for each of the data path routing multiplexers 902 and control inputs for the selector multiplexers 904. This allows the logic fabric in the top die FPGA to act as a dynamic router.

The various multiplexers 902 and 904 in FIG. 9A may be statically configured per use case and can be quickly updated at runtime to implement multiple use cases in a time domain multiplexed manner. For example, the active interposer may be configured in a first mode during a first period to maximize bandwidth when populating a smart memory block with data from a Double Data Rate (DDR) memory outside of a package; may be configured in a second mode during a second period to maximize bandwidth when sorting or rearranging data among the array of smart memory blocks; and may be configured in a third mode during a third time period to maximize bandwidth when the coarse-grained routing network is fed with control signals from the FPGA logic fabric. Since the multiplexer 904 receives inputs from the FPGA fabric, routing can be dynamically configured using the FGPA logic fabric itself, without requiring complete or partial reconfiguration of the device.

The FPGA fabric in the top die and the smart memory circuitry in the active interposer may share a common clock input, but this sharing is not required. The common clock signal may allow for fully deterministic behavior between the smart memory array and the logic fabric array. At power-up or system reset, the default connectivity scheme may allow the switch box 290' closest to the system controller (e.g., the security device manager on the FPGA) to reach a given control address and then be switched to reach its neighbors. This process may be iteratively performed to traverse the entire coarse-grained routing network.

Fig. 9B is a circuit diagram of an illustrative programmable junction box circuit 292' (see also fig. 6). The connection box 292 'may be a 3-port version of the switch box 290', where the selection of additional channel multiplexing allows for greater configurability of selecting which coarse-grained network channel is connected to the corresponding intelligent RAM block 600. In particular, the connection box 292 should provide the following capabilities: in response to the local intelligent RAM block detecting its address, data from the associated intelligent RAM block 600 is placed onto the coarse-grained wired network channel. This allows the smart memory to be placed in columns, creating deeper direct access memory.

As shown in fig. 9B, data path routing multiplexers 902N and 902S are connected in the same manner as already described in fig. 9A. In contrast, however, multiplexer 902' has only a first ("0") data input (i.e., the output of multiplexer 902N) coupled to node FN and a second ("1") data input (i.e., the output of multiplexer 902S) coupled to node FS. In contrast to switch box 290 'of fig. 9A, connection box 292' also completely lacks multiplexer 902E. The east port of connection box 292' is coupled to an associated smart RAM block. Arranged in this manner, the 3-port connection box 292' allows each associated intelligent RAM block to be coupled to a coarse-grained routing network. The connection box 292' may also include a control circuit 950 that receives a valid signal, which enables the intelligent RAM block to provide data onto the n-bit lanes of the coarse-grained routing network. Operating in this manner, different blocks of smart RAM will be able to provide data over the same coarse-grained routing wires in different clock cycles.

FIG. 10 is a diagram of an illustrative smart memory block 600, according to an embodiment. As shown in fig. 10, smart RAM block 600 may include an X by Y RAM array (i.e., a memory array that is X elements wide and Y elements deep), power management circuitry (such as power manager 1002), comparison circuitry (such as comparator 1004), addressing circuitry (such as address register 1006), counting circuitry (such as counter 1008), state machine circuitry (such as state machine 1010), priority encoding circuitry (such as priority encoder 1012), program counter 1014, registers (such as X/Y/link register 1016), instruction decoder 1018, and Arithmetic Logic Unit (ALU) 1020.

The RAM array 1000 may be, for example, a standard single port random access memory with address, data in, data out, write enable, and byte enable terminals. The RAM 1000 can exhibit a word width that is selectively sized to match a coarse-grained routing (CGR) network channel size, or an integer multiple of the CGR channel size. The RAM array 1000 may also support Error Correction Codes (ECC) that are capable of detecting and correcting various kinds of internal data corruption. The RAM array 1000 may be a dual or multi-port memory with additional memory control capabilities, if desired.

Power manager 1002 may be configured to manage the power states of smart RAM block 600. For example, if smart RAM block 600 is not used for a configurable period of time, power management circuitry 1002 may power down smart RAM block 600 statically or dynamically. Powering down one or more of the smart RAM blocks 600 can help reduce power consumption at the expense of increased latency.

The address register 1006 may be configured to store an address of the smart RAM block 600. This allows each intelligent RAM block to respond in response to determining that its address is asserted on the input address line. Thus, the value stored in the address register 1006 may sometimes be referred to as a "my address". Operating in this manner, one or more blocks of smart RAM may be used to support various data widths and depths at configuration time (e.g., a block of smart RAM may be divided into multiple smaller memory sub-blocks, or multiple blocks of smart RAM may be configurable as larger memory blocks). Compare circuitry 1004 (sometimes referred to as an equality block) may be used to compare the stored "my address" register value with the value provided on the address input. Priority encoder 1012 may be used to support Content Addressable Memory (CAM) operations to extract address values for matching with data words.

The counter 1008 may be configured to support programmable burst lengths in response to commands requiring a streaming response (streaming response). The state machine 1010 may be configured to sequence command responses. Having an embedded state machine 1010 allows the smart RAM block 600 to perform low cycle count operations without executing program code that is typically required for microcontrollers.

In addition to operating as a command-based state machine, the smart RAM block 600 may be further configured as a microcontroller that performs more complex multi-cycle operations with a higher cycle count than state machine driven operations. In the example of fig. 10, smart RAM block 600 includes microcontroller circuitry 1050, which microcontroller circuitry 1050 has program counter 1014, X/Y/link register 1016, instruction decoder 1018, and ALU 1020.

Program counter 1014 may be used to provide the address/location of the instruction currently being executed. The instruction decoder 1018 may be configured to interpret the instruction and to set a corresponding task associated with the instruction to motion. ALU 1020 may be a digital circuit configured to perform arithmetic and logical operations. The registers 1016 may represent one or more registers used by a state machine or microcontroller to hold operations on the ALU, return values for jump commands (jump commands), and so forth.

Still referring to fig. 10, smart RAM block 600 may be configured to receive and output a plurality of interface signals. For example, smart RAM block 600 may be provided with a Command (CMD) input port (e.g., 8-bit input terminal) that receives a command. The received commands may be interpreted by the state machine 1010 or by the microcontroller circuitry 1050 to implement one or more use cases described below in connection with fig. 11. The block 600 may have an address input port (e.g., 11-bit input terminal) for detecting whether the received address signal is equal to the "my address" stored in the local address register 1006.

The smart RAM block 600 may also include a data input port configured to receive write data or other input data for performing a compare operation. In the example of fig. 10, the data input port is configured to receive 36 bits of data. This is merely illustrative. In general, the intelligent RAM block data input port may be configured to receive 4-bit wide data, 8-bit wide data, 16-bit wide data, 2 to 36 bits of data, 32 to 64 bits of data, more than 64 bits of data, or any suitable bit wide data. Data output or read from smart memory RAM array 1000 may be provided on a data output port. The data output port may be the same or different bit width than the data input port.

Smart RAM block 600 may have an active input port configured to receive an active signal indicating whether a signal at a data input port and/or other input terminal is active. The smart RAM block 600 may also have an active output port configured to generate a valid signal indicating whether the smart memory block presents valid information at its data output port and/or other output terminals. The prepare input port is configured to receive a signal indicating whether a corresponding destination smart memory block is capable of accepting data, and the prepare output port is configured to output a signal indicating whether a smart RAM block is capable of accepting data input.

The smart RAM block 600 may further include: a start of packet (SOP) input port configured to receive a signal indicating a start of streaming transfer of information; and also an SOP output port on which a signal is asserted when the smart memory block initiates a packet transfer. The byte enable input port may receive bits for selective writing or mask bits arriving at the data input port.

An Error Correction Code (ECC) status output port may be used to indicate the status of a RAM ECC event (e.g., when one or more erroneous bits have been detected and/or corrected). An operation (op) state input port may be used to indicate when the intelligent RAM block is communicating with another intelligent RAM block via a coarse-grained routing network. Further, the operational state inputs may be used to synchronize and extend ALU operations across multiple intelligent RAM blocks (e.g., extend carry, match, priority encoding, and other suitable operations across a target number of intelligent memory blocks). Conversely, an operation (op) state output port may be used to indicate the state of the last operation, which may or may not include a signal to inform the associated control box that the particular intelligent RAM block should be multiplexed onto the coarse-grained routing network.

The various interface signals described above with respect to the smart RAM block 600 are merely illustrative and are not intended to limit the scope of the present embodiment. If desired, the smart RAM block 600 may include fewer input ports, fewer output ports, more input ports, more output ports, and/or other suitable input-output ports capable of implementing the desired functionality of the state machine 1010 and the microcontroller circuitry 1050.

FIG. 11 is a diagram illustrating a number of different memory operation types that can be supported by a smart RAM block 600 of the type shown in FIG. 10. As shown in fig. 11, the smart RAM block 600 may be operable in at least four modes: (i) simple memory access mode 1100; (ii) state machine driven command-based mode of operation 1102; (iii) a microcontroller mode 1104; and (iv) idle mode 1106. These modes are merely illustrative. If desired, the smart RAM block 600 may be configured to support all of these modes, any subset of these modes, or other suitable modes not typically supported by conventional RAM blocks or general microcontrollers.

In the simple memory access mode 1100, the smart RAM block may be configured to perform direct memory access and streaming memory access. During direct memory access operations, the native protocol of the RAM may be used to carry out normal read and write operations (i.e., read or write access once per memory cycle). This can be done as follows: multiple blocks of intelligent RAM 600 are configured and connected to the required source and destination using a coarse-grained routing network, with the corresponding my address field set appropriately. If desired, smart RAM block 600 may be configured to broadcast to multiple target smart RAM blocks, allowing for variable data widths or memory depths. For different memory depths, each intelligent RAM block with a particular column may be configured to respond to an offset address according to the depth of each intelligent RAM block in the column and to provide an op state output to its associated 3-port connection box to insert its result into a coarse-grained wired network channel.

During a streaming access operation, bursts of data may be streamed to and from a given block of intelligent RAM. As an example, 256 bytes of data may be streamed per memory cycle. This can be done by using a Command (CMD) interface to request a burst read or write. Similar to direct access, the data width can vary, and bursts that are longer than one RAM depth can be performed via an op state input-output handshake connection with a coarse-grained routing network. The source of the command and the source/destination of the data may be set by the configuration of a coarse-grained routing network or by another intelligent RAM block, from a top-die FPGA logic fabric, or from some other dedicated function IP block (see fig. 7). Data movement between source and destination can be managed using SOP, preparation, and valid input/output signals. Further, the op state input/output signals may be used to synchronize multiple intelligent memory blocks.

In the state machine driven mode 1102, the smart RAM block 600 may be configured to perform data updates in memory, data comparisons in memory, simple linked list traversals, Content Addressable Memory (CAM) operations, memory cache operations, and the like. The use of an embedded state machine (e.g., state machine 1010 in fig. 10) enables the smart memory block to carry out these low cycle count operations without the need to execute microcontroller program code.

During a data update operation, the values stored in the smart RAM block may be updated (e.g., once every two memory cycles). Example operations that can be performed during a data update include addition, subtraction, other simple arithmetic operations, logical AND, logical OR, logical NAND, logical NOR, logical XOR, logical XNOR, other simple logic functions, AND/OR other suitable low cycle operations. The destination and width of the updated data may be set by the value stored in the my address register. Any carry-in (carry-in) data from or any carry-out (carry-out) data to an adjacent block of smart RAM may be carried out by issuing appropriate control signals at the op state input-output ports.

During a data comparison operation, the values stored in the smart RAM block may be compared to the values provided to determine if a match exists. Example operations that can be performed during the data comparison include a concise comparison or a mask and compare. The values for comparison may be provided at the data input port. The smart RAM block may include additional registers for storing masking bits. Any carry-in data from or to an adjacent intelligent RAM block may be carried out by issuing appropriate control signals at the op state input-output ports, where the result of the comparison may be provided at the op state output and routed to the desired endpoint via a coarse-grained routing network (as an example).

The state machine may also be configured to perform simple linked list traversal by looking up predefined control and next address fields, where the traversed linked list may be contained entirely within a single intelligent RAM block, or may span multiple intelligent RAM blocks. More complex linked-list traversals (e.g., encoded traversals) may be supported only during the microcontroller mode 1104.

The state machine may also be configured as a Content Addressable Memory (CAM) where the data match value becomes the address to the RAM array and the compare logic 1004 (fig. 10) determines if the value was found and uses the priority encoder 1012 to identify the bit position (bit position) where the value was found. The CAM may be configured in a linear mode or a hierarchical mode (as examples). In the linear mode, multiple blocks of smart RAM can be combined together to increase the CAM word size or to extend the bit size. In the hierarchical mode, CAM outputs from one smart RAM block or group of smart RAM blocks can be fed into another smart RAM block or another group of smart RAM blocks, creating a hierarchical CAM. The data width and RAM depth may vary, if desired.

The intelligent RAM state machine may be further configured as a cache memory, where the top-die FPGA is the source of the cache lookup. This may be accomplished by configuring multiple smart memory blocks and a coarse-grained routing network such that the result is from a smart memory block operating as a tag RAM (e.g., to maintain addressed RAM) to a smart memory block operating as a data RAM. The tag RAM may use internal comparison functions to determine whether the requested data is currently stored in their local RAM array. If there is a match in the data, the associated data may be returned to the FPGA fabric along with the remainder of the tag field.

The various state machine driven operations described above are merely illustrative and are not intended to limit the scope of the present embodiments. Other low cycle operations that are not typically supported by a typical microcontroller and do not require execution of program code may be supported during mode 1102, if desired. For example, one or more blocks of smart RAM may be tiled with a coarse-grained routing net (sticch) to enable configurable memory width and depth when supporting command-based state machine operations. As another example, one or more blocks of intelligent RAM may be tiled with a coarse-grained routing network to allow some subset of the intelligent memory to be configured and integrated with an FPGA fabric to perform unique non-generic microcontroller functions.

In the microcontroller mode 1104, the smart RAM block 600 may be configured to perform complex data arrangement (rearrangement), Direct Memory Access (DMA) controller functions, complex linked list traversal (relative to the "simple" linked list traversal described above in connection with the mode 1102), FPGA logic control, FPGA logic extensions, and so forth.

In a first microcontroller mode, the smart RAM block may act as a DMA controller to rearrange data into the RAM array, enabling efficient access by FPGA logic or efficient access to paged memory in off-package memory (such as DDR memory) (e.g., data may be moved within the smart RAM block, moved across different smart RAM blocks, moved to and from dedicated hard IP blocks within the array of smart RAM blocks, moved to and from external DDR memory, moved to and from top-die FPGA logic, and so on). Exemplary memory accesses include X/Y array swaps, striping fields, fetch fields, sort fields, insert fields, collapse fields, and so forth. These operations may be accomplished by using microcontroller circuitry on the intelligent RAM blocks to generate addresses, or using FPGA logic for a given intelligent RAM block to generate addresses, where coarse-grained routing network channels are used to communicate data from the source intelligent RAM block to the destination intelligent RAM block.

In a second microcontroller mode, the smart RAM block may be configured to perform complex linked list traversal. In this mode, microcontroller circuitry within the smart RAM block may be used to perform more advanced linked list traversals, such as coded traversals. In another microcontroller mode, microcontroller circuitry within the smart RAM block may be used to generate control words for FGPA logic configuration. In yet another microcontroller mode, the smart RAM block may be extended by FPGA logic. For example, an FPGA logic fabric may couple custom instructions to one or more intelligent RAM blocks.

The various microcontroller operations described above are merely illustrative and are not intended to limit the scope of the present embodiments. These microcontroller functions may be secondary to the optimization of the smart RAM block around its mode 1102 for other modes and or state machine driven functions as composable RAM. A typical microcontroller will not be able to support the smart RAM functionality described in connection with the simple mode 1100 and the state machine driven mode 1102.

Still referring to FIG. 11, the smart RAM block may also be configured in idle mode 1106. Local or short-range wiring may be dominated by FGPA-to-FPGA wiring, while global or long-range wiring may be handled by intelligent RAM-to-intelligent RAM wiring, intelligent RAM-to-peripheral wiring, or network-on-chip wiring. In idle mode, the smart memory is prevented from doing useful work due to the lack of routability in this mode.

Thus, embodiments have been described thus far with respect to integrated circuits. The methods and apparatus described herein may be incorporated into any suitable circuitry. For example, the methods and apparatus may be incorporated into many types of devices, such as programmable logic devices, Application Specific Standard Products (ASSPs), and Application Specific Integrated Circuits (ASICs), microcontrollers, microprocessors, Central Processing Units (CPUs), Graphics Processing Units (GPUs), and so forth. Examples of programmable logic devices include Programmable Array Logic (PAL), Programmable Logic Arrays (PLA), Field Programmable Logic Arrays (FPLA), Electrically Programmable Logic Devices (EPLD), Electrically Erasable Programmable Logic Devices (EEPLD), Logic Cell Arrays (LCA), Complex Programmable Logic Devices (CPLD), and Field Programmable Gate Arrays (FPGA), to name a few.

The programmable logic devices described in one or more embodiments herein may be part of a data processing system that includes one or more of the following components: a processor; a memory; an IO circuit system; and a peripheral device. Data processing can be used in a wide variety of applications, such as computer networking, data networking, instrumentation, video processing, digital signal processing, or any other suitable application where the advantages of using programmable or reprogrammable logic are desirable. Programmable logic devices can be used to perform a wide variety of different logic functions. For example, the programmable logic device can be configured as a processor or controller that works in cooperation with the system processor. The programmable logic device may also serve as an arbiter for arbitrating access to a shared resource in the data processing system. In yet another example, the programmable logic device can be configured as an interface between the processor and one of the other components in the system.

Although the method of operations is described in a particular order, it is to be understood that other operations may be performed between the described operations, the described operations may be adjusted so that they occur at slightly different times, or the described operations may be distributed in a system as follows: the system allows processing operations to occur at various intervals associated with processing so long as the processing of the overlay operation is carried out in a desired manner.

Example (c):

the examples below relate to further embodiments.

Example 1 is a multi-chip package, the multi-chip package comprising: a package substrate; an active interposer mounted on the package substrate; and an integrated circuit mounted on the active interposer, wherein the active interposer comprises: a programmable coarse-grained routing network having a plurality of channels forming deterministic routing paths with guaranteed timing closure; and smart memory circuitry configured to perform a plurality of different memory operation types including higher level of functionality than simple read and write memory accesses.

Example 2 is the multi-chip package of example 1, wherein the smart memory circuitry optionally includes a state machine configured to, without executing program code, carry out a sequence of command-based operations.

Example 3 is the multi-chip package of example 2, wherein the smart memory circuitry optionally includes microcontroller circuitry configured to perform more complex operations than the command-based operations associated with the state machine.

Example 4 is the multi-chip package of any of examples 2-3, wherein the command-based operations carried out by the state machine optionally include operations selected from the group consisting of: data update, data comparison, and linked list traversal.

Example 5 is the multi-chip package of any of examples 2-4, wherein the smart memory circuitry is optionally implemented as a Content Addressable Memory (CAM) using the state machine.

Example 6 is the multi-chip package of any of examples 2-5, wherein the smart memory circuitry is optionally implemented as a cache memory using the state machine.

Example 7 is the multi-chip package of example 3, wherein the complex operations carried out by the smart memory circuitry optionally include operations selected from the group consisting of: data placement and linked list traversal.

Example 8 is the multi-chip package of any one of examples 3 and 7, wherein the smart memory circuitry is optionally implemented as a Direct Memory Access (DMA) controller using the microcontroller circuitry.

Example 9 is the multi-chip package of any one of example 3, example 7, and example 8, wherein the integrated circuit die optionally includes logic fabric circuitry, and wherein the microcontroller circuitry is optionally configured to generate control signals for the logic fabric circuitry on the integrated circuit die.

Example 10 is the multi-chip package of any of examples 1-9, wherein the smart memory circuitry optionally includes a plurality of Random Access Memory (RAM) blocks configurable into a variable width and depth memory.

Example 11 is the multi-chip package of any of examples 1-10, wherein the integrated circuit die optionally includes an array of logical fabric segments, and wherein the smart memory circuitry optionally includes an array of smart memory segments spatially corresponding to the array of logical fabric segments.

Example 12 is the multi-chip package of example 11, wherein the array of logical fabric segments optionally includes a first group of input-output driver circuits, and wherein the array of smart memory segments optionally includes a second group of input-output driver circuits aligned with the first group of input-output driver circuits.

Example 13 is the multi-chip package of any one of examples 11-12, wherein each smart memory segment of the array of smart memory segments optionally includes a plurality of smart Random Access Memory (RAM) blocks interconnected by the programmable coarse-grained routing network.

Example 14 is the multi-chip package of example 13, wherein the smart RAM blocks are optionally interconnected using an array of configurable 4-port connection box circuits.

Example 15 is the multi-chip package of example 14, wherein the smart RAM block is optionally connected to the programmable coarse-grained routing network via a plurality of configurable 3-port switchbox circuits.

Example 16 is the multi-chip package of any one of examples 13-15, wherein the active interposer optionally further comprises at least one special function Intellectual Property (IP) block embedded within the plurality of smart RAM blocks.

Example 17 is the multi-chip package of example 16, wherein the dedicated-function IP block optionally comprises a hardened block selected from the group consisting of: a protocol bridge and global routing control block, a global routing buffer block, a direct memory access block, and a microcontroller.

Example 18 is circuitry, the circuitry comprising: a plurality of programmable logic fabric segments (programmable logic fabric segments); and a plurality of smart memory segments formed directly below the plurality of programmable logic fabric segments, wherein each smart memory segment of the plurality of smart memory segments comprises an array of smart Random Access Memory (RAM) blocks, and at least one smart RAM block of the array of smart RAM blocks comprises: a state machine configured to carry out operations at a first speed; and microcontroller circuitry configured to carry out operations at a second speed that is slower than the first speed.

Example 19 is the circuitry of example 18, wherein the microcontroller circuitry optionally includes a program counter, a link register, an instruction decoder, and an arithmetic logic unit.

Example 20 is the circuitry of any one of examples 18-19, wherein the at least one intelligent RAM block optionally further comprises: an address register configured to store a local address; an address input configured to receive an address signal; and a comparison circuit configured to compare a value of the address signal with a stored local address.

Example 21 is the circuitry of any one of examples 18-20, wherein the at least one intelligent RAM block optionally further comprises a counter configured to support a programmable burst length in response to a command requiring a streaming response.

Example 22 is the circuitry of any one of examples 18-21, wherein the at least one smart RAM block optionally further comprises a priority encoder 1012, the priority encoder 1012 configured to support a Content Addressable Memory (CAM) operation to extract an address value for a matching data word.

Example 23 is the circuitry of any one of examples 18-22, wherein the at least one intelligent RAM block optionally further comprises a power manager configured to manage a power state of the at least one intelligent RAM block.

Example 24 is an apparatus, comprising: an active interposer; and a Field Programmable Gate Array (FPGA) die mounted on the active interposer, wherein the active interposer comprises: smart memory circuitry, the smart memory circuitry comprising: a Random Access Memory (RAM) block configurable to different widths and depths, and a state machine configured to drive a sequence of operations without having to execute microcontroller program code.

For example, all optional features of the apparatus described above may also be implemented in relation to the methods or processes described herein. The foregoing merely illustrates the principles of the disclosure and various modifications can be made by those skilled in the art. The foregoing embodiments may be implemented individually or in any combination.

Example implementation

Example 1. a multi-chip package, comprising:

a package substrate;

an active interposer mounted on the package substrate; and

an integrated circuit mounted on the active interposer, wherein the active interposer comprises:

a programmable coarse-grained routing network having a plurality of channels forming deterministic routing paths with guaranteed timing closure; and

smart memory circuitry configured to perform a plurality of different memory operation types including higher level of functionality than simple read and write memory accesses.

Example 2 the multi-chip package of example 1, wherein the smart memory circuitry comprises a state machine configured to, without executing program code, carry out a sequence of command-based operations.

Example 3. the multi-chip package of example 2, wherein the smart memory circuitry includes microcontroller circuitry configured to perform more complex operations than the command-based operations associated with the state machine.

Example 4. the multi-chip package of example 2, wherein the command-based operations carried out by the state machine include operations selected from the group consisting of: data update, data comparison, and linked list traversal.

Example 5 the multi-chip package of example 2, wherein the smart memory circuitry is implemented as a Content Addressable Memory (CAM) using the state machine.

Example 6. the multi-chip package of example 2, wherein the smart memory circuitry is implemented as a cache memory using the state machine.

Example 7. the multi-chip package of example 3, wherein the complex operations carried out by the smart memory circuitry include operations selected from the group consisting of: data placement and linked list traversal.

Example 8 the multi-chip package of example 3, wherein the smart memory circuitry is implemented as a Direct Memory Access (DMA) controller using the microcontroller circuitry.

Example 9. the multi-chip package of example 3, wherein the integrated circuit die includes logic fabric circuitry, and wherein the microcontroller circuitry is configured to generate control signals for the logic fabric circuitry on the integrated circuit die.

Example 10 the multi-chip package of example 1, wherein the smart memory circuitry includes a plurality of Random Access Memory (RAM) blocks capable of being organized into variable width and depth memory.

Example 11 the multi-chip package of example 1, wherein the integrated circuit die includes an array of logical fabric segments, and wherein the smart memory circuitry includes an array of smart memory segments spatially corresponding to the array of logical fabric segments.

Example 12 the multi-chip package of example 11, wherein the array of logical fabric segments includes a first group of input-output driver circuits, and wherein the array of smart memory segments includes a second group of input-output driver circuits aligned with the first group of input-output driver circuits.

Example 13. the multi-chip package of example 11, wherein each smart memory segment in the array of smart memory segments includes a plurality of smart Random Access Memory (RAM) blocks interconnected by the programmable coarse-grained routing network.

Example 14. the multi-chip package of example 13, wherein the smart RAM blocks are interconnected using an array of configurable 4-port connection box circuits.

Example 15 the multi-chip package of example 14, wherein the smart RAM block is connected to the programmable coarse-grained routing network via a plurality of configurable 3-port switch box circuits.

Example 16 the multi-chip package of example 13, wherein the active interposer further comprises at least one special function Intellectual Property (IP) block embedded within the plurality of smart RAM blocks.

Example 17. the multi-chip package of example 16, wherein the dedicated-function IP block comprises a hardened block selected from the group consisting of: a protocol bridge and global routing control block, a global routing buffer block, a direct memory access block, and a microcontroller.

Example 18. a circuitry, comprising:

a plurality of programmable logic fabric segments; and

a plurality of smart memory segments formed directly below the plurality of programmable logic fabric segments, wherein each smart memory segment of the plurality of smart memory segments comprises an array of smart Random Access Memory (RAM) blocks, and at least one smart RAM block of the array of smart RAM blocks comprises:

a state machine configured to carry out operations at a first speed; and

microcontroller circuitry configured to carry out operations at a second speed that is slower than the first speed.

Example 19. the circuitry of example 18, wherein the microcontroller circuitry comprises a program counter, a link register, an instruction decoder, and an arithmetic logic unit.

Example 20. the circuitry of example 18, wherein the at least one smart RAM block further comprises:

an address register configured to store a local address;

an address input configured to receive an address signal; and

a comparison circuit configured to compare a value of the address signal to a stored local address.

Example 21. the circuitry of example 18, wherein the at least one intelligent RAM block further comprises a counter configured to support a programmable burst length in response to a command requiring a streaming response.

Example 22. the circuitry of example 18, wherein the at least one smart RAM block further comprises a priority encoder 1012, the priority encoder 1012 configured to support a Content Addressable Memory (CAM) operation to extract an address value for a matching data word.

Example 23. the circuitry of example 18, wherein the at least one intelligent RAM block further comprises a power manager configured to manage a power state of the at least one intelligent RAM block.

An apparatus of example 24, comprising:

an active interposer; and

a Field Programmable Gate Array (FPGA) die mounted on the active interposer, wherein the active interposer comprises:

smart memory circuitry, the smart memory circuitry comprising: a Random Access Memory (RAM) block configurable to different widths and depths, and a state machine configured to drive a sequence of operations without having to execute microcontroller program code.

Claims

1. A multi-chip package, comprising:

a package substrate;

an active interposer mounted on the package substrate; and

2. The multi-chip package of claim 1, wherein the smart memory circuitry comprises a state machine configured to, without executing program code, carry out a sequence of command-based operations.

3. The multi-chip package of claim 2, wherein the smart memory circuitry includes microcontroller circuitry configured to perform more complex operations than the command-based operations associated with the state machine.

4. The multi-chip package of claim 2, wherein the command-based operations carried out by the state machine include operations selected from the group consisting of: data update, data comparison, and linked list traversal.

5. The multi-chip package of claim 2, wherein smart memory circuitry is implemented as a Content Addressable Memory (CAM) using the state machine.

6. The multi-chip package of claim 2, wherein smart memory circuitry is implemented as a cache using the state machine.

7. The multi-chip package of claim 3, wherein the complex operations carried out by the smart memory circuitry comprise operations selected from the group consisting of: data placement and linked list traversal.

8. The multi-chip package of claim 3, wherein smart memory circuitry is implemented as a Direct Memory Access (DMA) controller using the microcontroller circuitry.

9. The multi-chip package of claim 3, wherein the integrated circuit die includes logic fabric circuitry, and wherein the microcontroller circuitry is configured to generate control signals for the logic fabric circuitry on the integrated circuit die.

10. The multi-chip package of claim 1, wherein the smart memory circuitry comprises a plurality of Random Access Memory (RAM) blocks capable of being organized into variable width and depth memories.

11. The multi-chip package of any of claims 1-10, wherein the integrated circuit die includes an array of logical fabric segments, and wherein the smart memory circuitry includes an array of smart memory segments that spatially correspond to the array of logical fabric segments.

12. The multi-chip package of claim 11, wherein the array of logical fabric segments includes a first group of input-output driver circuits, and wherein the array of smart memory segments includes a second group of input-output driver circuits aligned with the first group of input-output driver circuits.

13. The multi-chip package of claim 11, wherein each smart memory segment in the array of smart memory segments comprises a plurality of smart Random Access Memory (RAM) blocks interconnected by the programmable coarse-grained routing network.

14. The multi-chip package of claim 13, wherein the smart RAM blocks are interconnected using an array of configurable 4-port connection box circuits.

15. The multi-chip package of claim 14, wherein the smart RAM blocks are connected to the programmable coarse-grained routing network via a plurality of configurable 3-port switchbox circuits.

16. The multi-chip package of claim 13, wherein the active interposer further comprises at least one special function Intellectual Property (IP) block embedded within the plurality of smart RAM blocks.

17. The multi-chip package of claim 16, wherein the dedicated-function IP block comprises a hardened block selected from the group consisting of: a protocol bridge and global routing control block, a global routing buffer block, a direct memory access block, and a microcontroller.

18. A circuit system, comprising:

a plurality of programmable logic fabric segments; and

a state machine configured to carry out operations at a first speed; and

19. The circuitry of claim 18, wherein the microcontroller circuitry comprises a program counter, a link register, an instruction decoder, and an arithmetic logic unit.

20. The circuitry of claim 18, wherein the at least one smart RAM block further comprises:

an address register configured to store a local address;

an address input configured to receive an address signal; and

21. The circuitry of claim 18, wherein the at least one smart RAM block further comprises a counter configured to support a programmable burst length in response to a command requiring a streaming response.

22. The circuitry of claim 18, wherein the at least one smart RAM block further comprises a priority encoder 1012, the priority encoder 1012 configured to support a Content Addressable Memory (CAM) operation to extract an address value for a matching data word.

23. The circuitry of claim 18, wherein the at least one smart RAM block further comprises a power manager configured to manage a power state of the at least one smart RAM block.

24. An apparatus, comprising:

an active interposer; and

25. A multi-chip package, comprising:

a package substrate;

an active interposer mounted on the package substrate; and

means for forming a deterministic wiring path with guaranteed timing closure; and

means for performing a plurality of different memory operation types that include a higher level of functionality than simple read and write memory accesses.