WO2015153693A1

WO2015153693A1 - Interface, interface methods, and systems for operating memory bus attached computing elements

Info

Publication number: WO2015153693A1
Application number: PCT/US2015/023730
Authority: WO
Inventors: Stephen Belair; Parin DALAL
Original assignee: Xockets IP, LLC
Priority date: 2014-03-31
Filing date: 2015-03-31
Publication date: 2015-10-08

Abstract

A system can include at least one computing module comprising a physical interface for connection to a memory bus, a processing section configured to decode at least a predetermined range of physical address signals received over the memory bus into computing instructions for the computing module, and at least one computing element configured to execute the computing instructions; and a controller attached to the memory bus and configured to generate the physical address signals with corresponding control signals; and a controller configured to generate the physical address signals with corresponding control signals.

Description

INTERFACE, INTERFACE METHODS, AND SYSTEMS FOR

OPERATING MEMORY BUS ATTACHED COMPUTING ELEMENTS

TECHN ICAL FIELD

The present disclosure relates generally to systems having computing elements attached to a memory bus, and particularly to interfaces for accessing such memory bus attached computing elements.

BRIEF DESCRIPTION OF DRAWINGS FIG.1 is a block schematic diagram of a system according to an embodiment.

FIG. 2 is a block schematic diagram of a system according to another embodiment.

FIG. 3 is a block diagram of a memory bus attached computing module that can be included in embodiments.

FIG. 4 is a block diagram of a computing module (XIMM) that can be included in embodiments.

FIG. 5 is a diagram showing XIMM address mapping according to an embodiment.

FIG. 6 is a diagram showing separate read/write address ranges for XIMMs according to an embodiment.

FIG. 7 is a block schematic diagram of a system according to another embodiment.

FIG. 8 is a block schematic diagram of a system according to a further embodiment.

FIG. 9 is a block diagram of XIMM address memory space mapping according to an embodiment.

FIG. 10 is a flow diagram of a XIMM data transfer process according to an embodiment.

FIG. 1 1 is a flow diagram of a XIMM data transfer process according to another embodiment.

FIG. 12 is block schematic diagram showing data transfers in a system according to embodiments.

FIG. 13 is a diagram showing a XIMM according to another embodiment.

FIG. 14 is a timing diagram of a conventional memory access.

FIGS. 15A to 15F are timing diagrams showing XIMM accesses according to various embodiments.

FIGS. 16A to 16C are diagrams showing a XIMM clock synchronization according to an embodiment.

FIG. 17 is a flow diagram of a method according to an embodiment.

DETAILED DESCRIPTION OF THE EMBODIMENTS

According to embodiments, systems can include a computing module attached to a memory bus that can execute computing operations according to compute requests included in at least the physical address signals received over the memory bus. Computing modules can include processing sections to decode computing requests from received addresses, as well as computing elements for performing such computing requests. In the various embodiments described, like items are referred to by the same reference character but with the leading digit(s) corresponding to the figure number.

FIG. 1 shows a system 100 according to an embodiment. A system 100 can include one or more memory bus attached computing modules 102, a memory bus 104, and a controller device 106. Each computing module 102 can include a processing section 108 which can decode signals 1 10 received over the memory bus 104 into computing requests to be performed by a computing module 102. In particular embodiments, processing section 108 can decode all or a portion of a physical address to arrive at computing requests to be performed. Optionally, a system 100 can include one or more conventional memory devices 1 12 attached to the memory bus 104.

Conventional memory device 112 can have storage locations corresponding to physical addresses received over memory bus 104.

According to embodiments, computing module 102 can be accessible via interfaces and/or protocols generated from other devices and processes, which are encoded into memory bus signals. Such signals appear as memory device requests, but are in fact, operational requests for execution by a computing module 102. FIG. 2 shows a system 200 according to another embodiment. In the embodiment shown, a control device can include a memory controller 206-0 and a host device 206-1. Further, computing module 202 is referred to as a XIMM. In some embodiments, a XIMM 204 can include a physical interface compatible with a dual-in line memory module (DIMM) type memory bus. In very particular embodiments, a XIMM 204 can operate according to a double data rate (DDR) type memory interface (e.g., DDR3, DDR4). However, in alternate embodiments, a XIMM 204 can be compatible with any other suitable memory bus. Other memory buses can include, without limitation, memory buses with separate read and write data buses and/or non-multiplexed addresses. In the embodiment shown, a XIMM 204 can include an arbiter 208. An arbiter 208 can decode physical addresses into compute operation requests, in addition to various other functions on the XIMM 204).

A memory controller 206-0 can generate memory access signals on memory bus 204 according to requests issued from host device 206-1 (or some other device). As noted, in particular embodiments, a memory controller 206-0 can be a DDR type controller attached to a DIMM type memory bus.

A host device 206-1 can receive and/or generate computing requests based on an application program or the like. A host device 206-1 can include a request encoder 214. A request encoder 214 can encode computing operation requests into memory requests executable by memory controller 206-0. Thus, a request encoder 214 and memory controller 206-0 can be conceptualized as forming a host device-XIMM interface. According to embodiments, a host device-XIMM interface can be a lowest level protocol in a hierarchy of protocols to enable a host device to access a XIMM 202. According to embodiments, a host device-XIMM interface can encapsulate the interface and semantics of accesses used in reads and writes initiated by the host device 206-1 to do any of: initiate, control, configure computing operations of XIMMs 202. At the interface level, XIMMs 202 can appear to a host device 206-1 as memory devices having a base physical address and some memory address range (i.e., the XIMM has some size, but it is understood that the size represents accessible operations rather than storage locations).

Optionally, a system 200 can also include a conventional DIMM 212.

In some embodiments, a system 200 can include memory channels accessible by a memory controller 206- 0. Accesses to a XIMM 202 can go through the memory controller 206-0 for the channel that a XIMM 202 resides on. According to embodiments, memory accesses to a XIMM 202 can go through the same operations as those for a conventional memory module 212 residing on the channel (or that could reside on the channel). However, such accesses vary substantially from conventional memory access operations. Based on address information, an arbiter 208 within a XIMM 202 can respond to a host device memory access like a conventional memory module 212, but within the XIMM 202 such an access can identify one or more targeted resources of the XIMM 202 (input/output queues, a scatter-list for DMA, etc.) and the identification of what device is mastering the transaction (e.g., host device, network interface (NIC), or other bus attached device such as a peripheral component interconnect (PCI) type device) Viewed this way, a physical address that is used in an access of a XIMM 202 encodes the semantics of the access.

A host device-XIMM protocol according to embodiments can be in contrast to many conventional communication protocols. In conventional protocols, there can be an outer layer-2 header which expresses the semantics of an access over the physical communication medium. A host device-XIMM interface, according to embodiments, can depart from such traditional communication methods because it can occurs over a memory bus, and in particular embodiments, can be mediated by a memory controller (e.g., 206-0). Thus, according to some embodiments, a physical memory address can serve as the equivalent of the L2 header in the communication between the host device 206-1 and a XIMM 202 and the address decode performed by an arbiter 208 within the XIMM 202 can logically performs the same function as an L2 header decode for a particular access (where such decoding can take into account the type of access (read or write)).

FIG. 3 is a block schematic diagram of a XIMM 302 according to an embodiment. A XIMM 302 can be formed on a structure 316 which includes a physical interface 318 for connection to a memory bus. A XIMM 302 can include logic 320 and memory 322. Logic 320 can include circuits for performing functions of a processing section (108 in FIG. 1 ) and/or arbiter (208 in FIG. 2), including but not limited to processor and logic, including

programmable logic. Memory 322 can include any suitable memory, including DRAM, static RAM (SRAM), and nonvolatile memory (e.g., flash electrically erasable and programmable read only memory, EEPROM), as but a few examples. However, as noted above, unlike a conventional memory module, addresses received at physical interface 318 to not directly map to storage locations within memory 322, but rather are decoded into computing operations. Such computing operations may require a persistent state, which can be maintained in 322.

FIG. 4 is a diagram of another, very particular example of a XIMM 402. A XIMM 402 can include a printed circuit board 416 that includes a DIMM type physical interface 418. Mounted on the XIMM 402 can be circuit components 436, which the embodiment shown can include processor cores, programmable logic, a programmable switch (e.g., network switch) and memory (as described for other embodiments herein). In addition, the XIMM 402 of FIG. 4 can further include a network connection 434. A network connection 434 can enable a physical connection to a network. In some embodiments, this can include a wired network connection compatible with IEEE 802 and related standards. However, in other embodiments, a network connection can include a wireless connection.

As disclosed herein, physical memory addresses received by XIMMs can modify XIMM operations. FIG. 5 shows one example of XIMM address encoding according to one particular embodiment. A base portion of the physical address (BASE ADD) can identify a XIMM. A next portion of the address (ADD Ext1 ) can identify a resource of the XIMM. A next portion of the address (ADD Ext2) can identify a "host" soured for the transaction (e.g., host device, NIC or other device, such as a PCI attached device).

Some embodiments can include a conventional memory controllers with a global write buffer (GWB) or another similar memory caching structure. Such a memory controller can service read requests from its GWB when the address of a read matches the address of a write in the GWB. Such optimizations may not be suitable for XIMM accesses in some embodiments, since XIMMs are not true memories. For example, a write to a XIMM can update the internal state of the XIMM, and a subsequent read would have to follow after the write has been performed at the XIMM (i.e., such accesses have to performed at the XIMM, not at the memory controller).

Accordingly, according to some embodiments, XIMMs can have read addresses that are different from their write addresses. More particularly, a same XIMM can have different read and write address ranges. In such an arrangement, reads from a XIMM that has been written to will not return data from the GWB.

FIG. 6 is a table showing memory mapping according to one particular embodiment. Physical memory addresses can include a base portion (BASE ADDn, where n is an integer) and an offset portion (OFFSET(s)). For one XIMM (XIMM1 ), all reads will fall within the range starting within addresses starting with base address BASE ADDO, while all write operations to the same XIMM1 will fall within addresses starting with BASE ADD1.

FIG. 7 shows a system 700 according to a further embodiment. A host device 706-1 can include a driver (XKD) 714. XKD 714 can be a program executed by host device 706-1 which can encode requests into physical addresses, as described herein. A memory controller 706-0 can include a GWB 738. XIMM devices (702-0/1 ) can have read addresses that are different from write addresses (ADD Read != ADD Write). As shown, a conventional memory device (DIMM) 712 has conventional read/write address mapping, where data written to an address is read back from the same address.

Some conventional host devices (e.g., x86 type processors) can utilize processor speculative reads. Therefore, if a XIMM is viewed as a write-combining or cacheable memory by such a processor, the processor may speculate with reads to the XIMMs. As understood from herein, reads to XIMMs are not data accesses, but rather encoded operations, thus speculative reads could be destructive to a XIMM state.

Accordingly, according to some embodiments, in systems having speculative reads, XIMM read address ranges can be mapped as uncached. Because uncached reads can incur latencies, in some embodiments, XIMMs accesses can vary according to data output size. For encoded read operations that result smaller data outputs from the XIMMs (e.g., 64 to 128 bytes), such data can be output in a conventional read fashion. However, for larger data sizes, where possible, such accesses can involve direct memory access (DMA) type transfers (or DMA equivalents of other memory bus types).

In systems according to some embodiments, write caching can be employed. While embodiments can include XIMM write addresses that are uncached (as in the case of read addresses) such an arrangement may be less desirable due to the performance hit incurred, particularly if accesses include burst writes of data to XIMMs. Write back caching can also yield unsuitable results in implemented with XIMMs. Write caching can result in consecutive writes to the same cache line, resulting in write data from a previous access being overwritten. This can essentially destroy any previous write operation to the XIMM address. Write-through caching can incur extra overhead that is unnecessary. Particularly, since there may never be reads to addresses that are written (when XIMM read addresses are different from their write addresses).

Accordingly, according to some embodiments, a XIMM write address range can be mapped as write- combining. Thus, such writes can be stored and combined in some structure (e.g., write combine buffer) and then written in order into the XIMM.

FIG. 8 is a block diagram of a system 800 according to a further embodiment. A host processor 806-1 can access an address space having an address mapping 840 that includes physical addresses corresponding to XIMM reads 840-0, XIMM writes 840-1 and conventional memory (e.g., DIMM) read/writes 840-2. Host processor 806-1 can include a request encoder (not shown), which can be a driver, logic or combination thereof, which can encode requests into memory accesses to XIMM address spaces 840-0/1.

The particular system 800 shown can also include a cache controller 842 connected to memory bus 804. A cache controller 842 can have a cache policy 846, which in the embodiment shown, can treat XIMM read addresses a uncached, XIMM write addresses as write combining, and addresses for conventional memories (e.g., DIMMs) as cacheable. A cache memory 844 can be connected to the cache controller 846. While FIG. 8 shows a lookaside cache, alternate embodiments can include a look through cache.

According to embodiments, an address that accesses a XIMM can be decomposed into a base physical address and an offset (shown as ADD Ext 1 , ADD Ext 2 in FIG. 5). Thus, in some embodiments, each XIMM can have a base physical address which represents the memory range hosted by the XIMM to the memory controller. In such embodiments, a base physical address can be used to select a XIMM, thus the access semantics can be encoded in the offset bits of the address. Accordingly, according to some embodiments, a base address can identify a XIMM to be accessed, and the remaining offset bits can indicate operations that occur in the XIMM. Thus, it is understood that an offset between base addresses will be large enough to accommodate the entire encoded address map. The size of the address map encoded in the offset will become the actual "size" of the XIMM memory device, which is the size of the memory range that will be mapped by request encoder (e.g., XKD kernel driver) for the memory interface to each XIMM.

As noted above, for systems with memory controllers having a GWB or similar type of caching, XIMMs can have separate read and write address ranges. Furthermore, read address ranges can be mapped as uncached, in order to ensure that no speculative reads are made to a XIMM. Writes can be mapped as write-combining in order to ensure that writes always get performed when they are issued, and with suitable performance (see FIGS. 6-8, for example).

Thus, a XIMM can look like a memory device with separate read and write address ranges, with each separate range having separate mapping policies. A total size of a XIMM memory device can thus include a sum of both its read and write address ranges.

According to embodiments, address ranges for XIMMs can be chosen to be a multiple of the largest page size that can be mapped (e.g., either 2 or 4 Mbytes). Since these page table mappings may not be backed up by RAM pages, but are in fact a device mapping, a host kernel can be configured for as many large pages as it takes to map a maximum number of XIMMs. As but one very particular example, there can be 32 to 64 large pages/XIMM, given that the read and write address ranges must both have their own mappings. FIG. 9 is a diagram showing memory mapping according to an embodiment. A memory space 950 of a system can include pages, address ranges for XIMMs can mapped to groups of such pages. For example, address ranges for XIMMO can be mapped from page 948i (Pagei) to page 948k (Pagek). As noted above, according to some embodiments, data transfers between XIMMs and a data source/sink can vary according to size. FIG. 10 is a flow diagram showing data transfer processes that can be included embodiments. A data transfer process 1001 can include determining a XIMM data access is to occur (1052). This can include determining if a data write or data read is to occur to a XIMM (note, again this is not a conventional write operation or read operation). If data is over a certain size (Y from 1054), data can be transferred to/from a XIMM with a DMA (or equivalent) type of data transfer. If data is not over a certain size (N from 1054), data can be transferred to/from a XIMM with a conventional data transfer operation 1058 (e.g., CPU controlled writing). It is noted that a size used in box 1054 can be different from read and write operations.

FIG. 1 1 is a flow diagram showing another data transfer process 1101. Process 1 101 can be like that of FIG. 10, but it is directed write accesses. Further, in the even the amount of write data is not over a certain size (N from 1154), a write can be a write combining type write 1160.

FIG. 12 is a block schematic diagram showing possible data transfer operations for a system 1200 according to embodiments. A system 1200 can include items like those of FIG. 2, however, a controller can include a memory controller 1206-0, one or more processors 1206-1 , a host bridge 1206-2 and another bus attached device 1206-3. Data transfer paths can include a path 1262-0 between processor(s) 1206-1 and a XIMM 1202-0, a path 1262-1 between another bus attached (e.g., PCI) device 1206-3 and a XIMM 1202-0, and a path 1262-2 between one XIMM 1202-0 and another XIMM 1202-1. In some embodiments, such data transfers (1262-0 to -2) can be DMA or equivalent type transfers.

In particular embodiments, an interface can be compatible with DRAM type accesses (e.g., DIMM accesses). In such embodiments, accesses to the XIMM can be via row address strobe (RAS) and then (in some cases) a column address strobe (CAS) phase of a memory access. As understood from embodiments herein, internally to the XIMM no row and column selection of memory cells can occur. Rather, the physical address provided in the RAS and (optionally CAS) phases can tells an arbiter within a XIMM which resource of the XIMM is the target of the memory operation and identify which device is mastering the transaction (e.g., host device, N IC, or PCI device). While embodiments can utilize any suitable memory interface, as noted herein, particular embodiments can include operations in accordance with a DDR interface.

As noted herein, a XIMM can include an arbiter for handling accesses over a memory bus. In embodiments where address multiplexing is used (i.e., a row address is followed by a column address), an interface/protocol can encode certain operations along address boundaries of the most significant portion of a multiplexed address (most often the row address). Further such encoding can vary according to access type.

In particular embodiments, how an address is encoded can vary according to the access type. In an embodiment with row and column addresses, an arbiter within a XIMM can be capable of locating the data being accessed for an operation and can return data in a subsequent CAS phase of the access. In such an embodiment, in read accesses, a physical address presented in the RAS phase of the access identifies the data for the arbiter so that the arbiter has a chance to respond in time during the CAS phase. In a very particular embodiment, read addresses for XIMMs are aligned on a row address boundaries (e.g., 4K boundary assuming a 12-bit row address).

While embodiments can include address encoding limitations and read accesses to ensure rapid response, such a limitation may not be included in write accesses, since no data will be returned. For writes, an interface may have a write address (e.g., row address, or both row and column address) completely determine a target within the XIMM to send the write data to. In some systems, a memory controller can be included that utilizes error correction/detection (ECC).

According to some embodiments, in such a system ECC can be disabled, at least for accesses to XIMMs. However, in other embodiments, XIMMs can be include the ECC algorithm utilized by the memory controller, and generate the appropriate ECC bits for data transfers.

FIG. 13 shows a XIMM 1302 according to an embodiment. A XIMM 1302 can interface with in-line module compatible bus (IMM BUS 1304) that can include address and control inputs (ADD/CTRL) as well as data inputs/outputs (DQ). An arbiter (ARBITER) 1308 can decode address and/or control information to derive transaction information, such as a targeted resource, as well as a host (controlling device) for the transaction. XIMM 1302 can include one or more resources, including computing resources (COMP RESOURCES 1364) (e.g., processor cores), one or more input queues 1366 and one or more output queues 1368. Optionally, a XIMM 1302 can include an ECC function 1370 to generate appropriate ECC bits for data transmitted over DQ.

FIG. 14 shows a conventional memory access over a DDR interface. FIG. 14 shows a conventional RAM read access. A row address (RADD) is applied with a RAS signal (active low), and a column address (CADD) is applied with a CAS signal (active low). It is understood that tO and t1 can be synchronous with a timing clock (not shown). According to a read latency, output data (Q) can be provided in a data IO (DQ).

FIG. 15A shows a XIMM access over a DDR interface according to one embodiment. FIG. 15A shows a "RAS" only access. In such an access, unlike a conventional access, operations can occur in response to address data available on a RAS strobe. In some embodiments, additional address data can be presented in a CAS strobe to further define an operation. However, in other embodiments, all operations for a XIMM can be dictated within the RAS strobe.

FIG. 15B shows XIMM accesses over a DDR interface according to another embodiment. FIG. 15B shows consecutive "RAS" only access. In such accesses, operations within a XIMM or XIMMs can be initiated by RAS strobes only.

FIG. 15C shows a XIMM access over a DDR interface according to a further embodiment. FIG. 15C shows a RAS only access in which data are provided with the address. It is understood that the timing of the write data can vary according to system configuration.

FIG. 15D shows a XIMM access over a DDR interface according to another embodiment. FIG. 15D shows a "RAS CAS" read type access. In such an access, operations can occur like a conventional memory access, supplying a first portion XCOM1 on a RAS strobe and a second portion XCOM2 on a CAS strobe. Together XCOM1/XCOM2 can define a transaction to a XIMM.

FIG. 15E shows a XIMM access over a DDR interface according to another embodiment. FIG. 15E shows a "RAS CAS" write type access. In such an access, operations can occur like a conventional memory access, supplying a first portion XCOMO on a RAS strobe and a second portion XCOM1 on a CAS strobe. As in the case of FIG. 15C, timing of the write data can vary according to system configuration.

It is noted that FIGS. 15A to 15E show but one very particular example of XIMM access operations on a DRAM DDR compatible bus. However, embodiments can include any suitable memory device bus/interfaces including but not limited hybrid memory cube (HMC) and those promulgated by Rambus, to name just two.

FIG. 15F shows XIMM access operations according to a more general embodiment. Memory access signals (ACCESS SIGNALS) can be generated in a memory interface/access structure (MEMORY ACCESS). Such signals can be compatible with signals to access one or more memory devices. However, within such access signals can be XIMM metadata. XIMM metadata can be used by a XIMM to perform any of the various XIMM functions described herein, or equivalents.

In a very particular, embodiment, all reads of different resources in a XIMM can fall on a separate range (e.g., 4K) of the address. An address map can divide the address offset into three (or optionally four) fields: Class bits; Selector bits; Additional address metadata; and optionally (Read/write bit). Such fields can have the following features:

Class bits: used to define the type of transaction encoded in the address

Selector bits: which can select a FIFO or a processor (e.g., ARM) within a particular class, or perhaps specify different control operations.

Additional address metadata: if any, which is relevant for a particular class of transaction involving the compute elements

Read/write: One (or more) bit will determine whether the access applies to a read or a write. This can be a highest bit of the physical address offset for the XIMM.

Furthermore, an address map should be large enough in range to accommodate transfers to/from any given processor/resource. In some embodiments, such a range can be 256 Kbytes, preferably 512 Kbytes. Since the address mapping may not backed up by physical pages in the page tables, the only resource that is potentially wasted with this large range is virtual address, which is not an issue since the processor is running in 64-bit mode.

Input formats according to very particular embodiments will be described. The description below points out an arrangement in which three address classes can be encoded in the upper bits of the physical address, (optionally allowing for a R/W bit and) for a static 512K address range for each processor/resource. The basic address format for a XIMM according to this particular embodiment, is as follows:

In such an address mapping, each XIMM can have a mapping up to 128 Mbytes in size with each read/write address range can be 64 Mbytes in size. There can be 16 Mbytes/32 = 512 Kbytes available for data transfer to/from a processor/resource. There can be an additional 4 Mbytes available for large transfers to/from only one processor/resource at a time. In the format above, bits 25, 24 of the address offset can determine the address class. An address class determines the handling and format of the access. In one embodiment, there can be three address classes.

Control: There can be two types of Control inputs - Global Control and Local Control. Control inputs are used for clock synchronization between a request encoder (e.g., XKD) and an Arbiter of the XIMM; metadata reads; assigning physical address ranges to a compute element, and so on. Control inputs may access FIFOs with control data in them or may result in the Arbiter updating its internal state.

APP: Accesses which are of the APP class can target a processor (ARM) core (i.e., computing element) and involve data transfer into/out of a compute element.

DMA: The access can be performed by a DMA device. Optionally, whether it is a read or write can be specified in the R/W bit in the address for the access.

Each of the class bits can determine a different address format. An arbiter within the XIMM can interpret the address based upon the class and whether the access is a read or write. Examples of particular address formats are discussed below.

Possible address formats for the different classes are as follows:

A. Control Address Format: class bits 00b: This is the address format for Control Class inputs. Bits 25 and 24 can be 0. Bit 23 can be used to specify where the Control input is Global or Local: Global control inputs can be for the Arbiter of a XIMM, whereas a local control input can be for control operations of a particular processor/resource within the XIMM (e.g., computing element, ARM core, etc.). For Global Control inputs, bits 22...12 are available for a Control type, whereas for Local Control bits 22...12 are available to specify a target resource. An initial data word of 64 bits can be followed by "payload" data words, which can provide for additional decoding or control values.

Bit 23 = 1 specifies Global Control. XXX can be zero for reads (i.e., the lower 12 bits), but these 12 bits can hold address metadata for writes, which may be used for Local Control inputs. Since Control inputs are not data intensive, not all of the Target/Cntrl Select bits may be used. Assume a 4K max inputs size for Control inputs. Thus, when the Global bit is 0 (Control inputs destined for an ARM), only the Select bits 16...12 can be set.

B. Application (APP) Address Format: class bits 01 b: For APP class inputs, bit 25 = 0, bit 24 = 1. This address format can have the following form (RW may not be included):

where XXX may encode address metadata on writes but can be all be O's on reads.

It is understood that a largest size of transfer that can possible with a fixed format scheme like can be 512K. Therefore, the bits 18 .. 12 can be 0 so that the Target Select bits are aligned on a 512K boundary. The Target Select bits can allow for a 512K byte range for every resource of the XIMM, with an additional 4 Mbytes that can be used for a large transfer.

C. DMA Address Format: class bits 10b: This format can be used for DMA to or from a XIMM. In some embodiments, control signals can indicate read/write. Other embodiments may include bit 26 to determine read/write.

In embodiments in which a XIMM can be accessed over a DDR channel, a XIMM can be a slave device. Therefore, when the XIMM Arbiter has output queued up for the host or any other destination, it does not master the DDR transaction and send the data. Instead, such output data is read by the host or a DMA device. According to embodiments, a host and the XIMM/Arbiter have coordinated schedules, thus the host (or other destination) knows the rate of arrival/generation of at a XIMM and can time its read accordingly.

In some embodiments, in host writes to a XIMM the lower 12 bits of the address can encode additional state/information for the target (are address metadata). In some embodiments, such address metadata is not interpreted by the Arbiter but is passed through to the target. Thus, address metadata provides further information to assist in determining the context of written payloads. In some embodiments, address metadata is only employed in writes to a XIMM and not in reads. For APP class inputs written to the XIMM, the address metadata can encode a socket id, where host applications are interacting with an XKD through the socket interface. A destination XIMM can be encoded in the Target Select bits of the address and which will uniquely identify a resource (e.g., ARM core) (the XIMM Id will tell the Arbiter which computing element to send the input out on). Likewise, for Local Control inputs written to the XIMM, the address metadata can also encode the 5-bit XIMM.

Embodiments can include other metadata that can be communicated in reads as part of the payload. This metadata may not be part of the address and can be generated by the Arbiter. The purpose of Arbiter metadata in the XKD-Arbiter interface can be to communicate scheduling information so that the XKD can schedule reads in a timely enough manner in order to minimize the latency of XIMM processing, as well as avoiding back-pressure in the XIMMs.

Therefore, the XKD-Arbiter DDR interface will be enhanced by the following mechanisms in order to make it as efficient as possible: XKD can encode metadata in the address of DDR inputs sent to the Arbiter- discussed above already; Clock synchronization and adjustment protocol to maintain a clock-synchronous domain of an XKD instance and its DDR-network of XIMMs. All XIMMs can maintain a clock that is kept in sync with the local XKD clock; Timestamping of inputs XKD sends to the Arbiter; when any data is read from the Arbiter by the host, the Arbiter will write metadata with the data, communicating information about what data is available to read next; control messages from XKD to an Arbiter to query its output queue(s) and other relevant state.

According to embodiments, XIMMs in a same memory domain can operate in a same clock domain.

XIMMs of a same memory domain can be those that are directly accessible by a host (e.g., an instance of an XKD driver and those XIMMs that are directly accessible via memory bus accesses). Such a common clock domain can enable the organization of scheduled accesses to keep data moving through the XIMMs. According to embodiments, an XKD driver does not have to poll for output or output metadata on its own host schedule, as XIMM operations can be synchronized for deterministic operations on data. An arbiter can communicate at time interval when data will be ready for reading, or at an interval of data arrival rate, as the Arbiter and host have agreement on their clock values.

Thus, according to embodiments, each Arbiter of a XIMM can implement a clock that is kept in sync with a host device. When a host (e.g., XKD) discovers a XIMM through a startup operation (e.g., SMBIOS operation) or through a probe read, a host can seek to sync up the Arbiter clock with its own clock, so that subsequent communication is deterministic. From then on, the Arbiter will implement a simple clock synchronization protocol to maintain clock synchronization, if needed (such synchronization may not be needed, or may be needed very infrequently according to the type of clock circuits employed on the XIMM).

According to very particular embodiments, an Arbiter clock can operate with fine granularity (e.g., nanosecond granularity) for accurate timestamping. However, for operations with a host, an Arbiter can sync up with a coarser granularity (e.g., microsecond granularity). In some embodiments, a clock drift of up to one μββο can be allowed.

Clock synchronization can be implemented in any suitable way. Periodic clock values can be transmitted from one device to another (e.g., controller to XIMM or vice versa).

FIGS. 16A to 16C shows a clock synchronization method according to one very particular embodiment. This method should not be construed as limiting. Referring to FIG. 16A, an XKD 1614 can discover a XIMM 1602 through a system management BIOS operation or through a probe read (see section on Control inputs).

Referring to FIG. 16B, an XKD 1614 can to send to the arbiter 1608 of the XIMM 1602 a Global Control type ClockSync input, which will supply a base clock and the rate that the clock is running at (frequency). Arbiter 1608 can use the clock base it receives in the Control ClockSync input and can start its clock 1670.

Referring to FIG. 16C, for certain inputs (e.g., Global Control input) and XKD 1614 can sends to the arbiter 1608 a clock timestamp. Such a timestamp can be encoded into address data. A timestamp can be included in every input to a XIMM 1602 or periodically sent. According to embodiments, a timestamp can be taken as late as possible by an XKD 1614, in order to reduce scheduler-induced jitter on the host. For every timestamp received, an arbiter can check its clock and make adjustments.

According to some embodiments, whenever an arbiter responds to a read request from the host, where the read is not a DMA read, an arbiter can include the following metadata: (1 ) a timestamp of the input when it arrived in the Arbiter FIFO; (2) Output queued up from a XIMM, along with length (i.e., source, destination, length). The arbiter metadata can be modified to accommodate a bulk interface. This interface will accommodate up to a max number of inputs, with source and length for each input queued. The purpose of this extension is to allow bulk reads of arbiter output and subsequent queuing in memory (e.g., RAM) of a XIMM output so that the number of XKD transactions can be reduced.

According to some embodiments, a system can include various control messages from an XKD to an arbiter of a XIMM. The control messages listed here can be a subset of the control messages that a host can send to an arbiter according to very particular embodiments. The control messages described here can assist the synchronization between the XKD and the Arbiter and therefore can all be Global Control inputs. In the next section all Control inputs are discussed, where we will also discuss the input formats for these inputs listed here.

Probe read: These reads are used for XIMM discovery. The Arbiter returns the data synchronously for the reads. The data returned is constant and identifies the memory device as a XIMM. The response can be 64 bytes and includes XIMM model number, XIMM version, operating system (e.g., Linux version running on ARM cores), and other configuration data.

Output snapshot: This read is to get a snapshot of the Arbiter output queue and the lengths of each, along with any state that is of interest for a queue. Since these reads are for the Arbiter, the global bit will be set, as well as bit 21 of the address as well.

Clock sync: used to set the clock base for the Arbiter clock. There will be a clock value in the data (e.g., 64 bit), and the rest of the input can be padded with O's. The global bit 23 will be set, along with bit 20. Note that the host may also send a ClockSync input to the Arbiter if a read from a XIMM shows the Arbiter clock to be too far out of sync.

Embodiments herein have described XIMM address classes and formats used in communication with a XIMM. While some semantics are encoded in the address, for some transactions it may not be possible to encode all semantics, nor to include parity on all inputs, or to encode a timestamp, etc. This section discusses the input formats that can be used at the beginning of the data that is sent along with the address of the input. The discussion will only deal with Control and APP class inputs. Since these are DDR inputs, there can be data encoded in the address, and the input header will be sent at the head of the data according to the formats specified.

Global Control Inputs:

Address:

• Class = Control = 00b, bit 23 = 1

• Bits 22..12: Control select or 0 (Control select in address might be redundant since Control values are set in the input header)

· Address metadata: all 0

Data:

• Decode = GLOBAL_CNTL

• Control values:

• Reads: Probe, Get Monitor, Output Probe

o Writes: Clock Sync. Set Large Transfer Window Destination, Set Xockets Mapping

Set Monitor

The input format can differ for reads and writes. Note that in the embodiment shown, header decode is constant and set to GLOBAL_CNTL, because the address bits for Control Select specify the input type. In other embodiments, a format can differ, if the number of global input types exceeds the number of Control Select bits.

Reads: Data can be returned synchronously for Probe Reads and identifies the memory device as a XIMM.

Address: (XIMM base addr) + (class bits = 00b) + (bit 23 = 1 ) + (Control select = bit setting for XIMM_PROBE)

This next input is the response to an OUTPUT_PROBE:

Address: (XIMM base addr) + (class bits = 00b) + (bit 23 = 1 ) + (Control select = bit setting for APP_SCHEDULING) Note: this format assumes output from a single source. It will be modified to accommodate bulk reads, so that one read can absorb multiple inputs, requiring buffering by XKD.

Writes: The following is the CLOCK_SYNC input, sent by the host when it first identifies a XIMM or when it deems the XIMM as being too out of sync with the host.

Address: (XIMM base addr) + (class bits = 00b) + (bit 23 = 1 ) + (Control select = bit setting for CLOCK_SYNC)

This next input can be issued after the XIMM has indicated in its metadata that no output is queued up. When the XKD driver encounters that, it must start polling the Arbiter for output (in some embodiments, this can be at predetermined intervals). Address: (XIMM base addr) + (class bits = 00b) + (bit 23 = 1 ) + (Control select = bit setting for OUTPUT_PROBE)

The following input can be sent by an XKD to associate a Xocket ID with a compute element of XIMM (e.g., an ARM core). The Xocket ID from then on can be used in Target Select bits of the address for Local Control or APP inputs. Address: (XIMM base addr) + (class bits = 00b) + (bit 23 = 1 ) + (Control select = bit setting for SET_XOCKET_MAPPING)

The following input can be used to set the Large Xfer Window mapping. It is presumed that no acknowledgement is required: once this input is sent, the next input using the Large Xfer Window should go to the latest destination. Address: (XIMM base addr) + (class bits = 00b) + (bit 23 = 1 ) + (Control select = bit setting for SET_LARGE_XFER_WNDW)

Local Control Inputs:

Address:

• Class = Control = 00b, bit 23 = 0

• Bits 22..12: Target select (Destination Xocket ID or ARM ID)

• Address metadata = Xocket Id (writes only)

Data:

• Decode = CNTL_TYPE

• Control values:

o Can specify an executable to load, download information, etc. These Control values can help to specify the environment or operation of XIMM resources (e.g., ARM cores). Note that, unlike Global Control, the input header can be included for the parsing and handling of the input- the address cannot specify the control type, since only the Arbiter sees the address. Parity 8 bits Parity calculated off of this control message

Decode 8 A single byte distinguishing the control message type

Control 16 The control action to take specific to the control channel and decode

Monitoring 16 Local state that the host would like presented after action(async)

Application Inputs

Address:

• Class = APP= 01 b

· Bits 23..19: Target select (Destination Xocket ID or ARM ID)

• Address metadata = Socket Number or Application ID running on the ARM core associated with the Xocket ID in the Target select bits of the address (writes only)

Data:

Writes: Input format for writes to a socket/application on a computing resource (e.g., ARM core). Note that for these types of writes, all writes to the same socket or to the same physical address can be of this message until M/8 bytes of the payload are received, and the remaining bytes to a 64B boundary is zero-filled. If a parity or a zero fill is indicated, errors can be posted in the monitoring status (see Reads). That is, writes may be interleaved if the different writes are targeting different destinations within the XIMM. The host drivers can make sure that there is only one write at a time targeting a given computing resource.

App - Gather RX

Class = APP

Decode = GATHER RX

It is understood that the particular command formats described herein are provided by way of example. It should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the various inventive aspects. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

It is also understood that the embodiments of the invention may be practiced in the absence of an element and/or step not specifically disclosed. That is, an inventive feature of the invention may be elimination of an element.

Accordingly, while the various aspects of the particular embodiments set forth herein have been described in detail, the present invention could be subject to various changes, substitutions, and alterations without departing from the spirit and scope of the invention.

Claims

IN THE CLAIMS

1. A system, comprising:

at least one computing module comprising

a physical interface for connection to a memory bus,

a processing section configured to decode at least a predetermined range of physical address signals received over the memory bus into computing instructions for the computing module, and

at least one computing element configured to execute the computing instructions.

2. The system of claim 1 , further including:

a controller attached to the memory bus and configured to generate the physical address signals with corresponding control signals.

3. The system of claim 2, wherein:

the control signals indicate at least a read and write operation.

4. The system of claim 2, wherein:

the control signals include at least a row address strobe (RAS) signal and address signals compatible with dynamic random access memory (DRAM) devices.

5. The system of claim 1 , further including:

the controller includes a processor and a memory controller coupled to the processor and the memory bus.

6. The system of claim 1 , further including:

a processor coupled to a system bus and a memory controller coupled to the processor and the memory bus; wherein

the controller includes a device coupled to the system bus different from the processor.

7. The system of claim 1 , further including:

the processing section is configured to decode a set of read physical addresses and a set of write physical addresses for the same computing module, the read physical addresses being different than the write physical addresses.

8. The system of claim 7, wherein:

the read physical addresses are different than the write physical addresses.

9. A system, comprising:

at least one computing module comprising a physical interface for connection to a memory bus,

at least one computing element configured to execute the computing instructions; and

a controller attached to the memory bus and configured to generate the physical address signals with corresponding control signals; and

a controller configured to generate the physical address signals with corresponding control signals.

10. The system of claim 9, wherein:

the at least computing module includes a plurality of computing modules; and

the controller is configured to generate physical addresses for an address space, the address space including different portions corresponding to operations in each computing module.

11. The system of claim 10, wherein:

the address space is divided into pages, and

the different portions each include an integer number of pages.

12. The system of claim 9, wherein:

the processing section is configured to determine a computing resource from a first portion of a received physical address and an identification of a device requesting the computing operation from a second portion of the received physical address.

13. The system of claim 9, wherein:

the controller includes at least a processor and another device, the processor being configured to enable direct memory access (DMA) transfers between the other device and the at least one computing module.

14. The system of claim 9, wherein:

the controller includes a processor, a cache memory and a cache controller; wherein

at least read physical addresses corresponding to the at least one computing module are uncached addresses.

15. The system of claim 9, wherein:

the controller includes a request encoder configured to encode computing requests for the computing module into physical addresses for transmission over the memory bus.

17. A method, comprising:

receiving at least physical address values on a memory bus at a computing module attached to the memory bus; decoding computing requests from at least the physical address values in the computing module; and performing the computing requests with computing elements in the computing module.

18. The method of claim 17, wherein:

receiving at least physical address values on a memory bus further includes receiving at least one control signal to indicate at least a read or write operation.

19. The method of claim 17, further including:

determining a type of computing request from a first portion of the physical address and determining a requesting device identification from a second portion of the physical address.

20. The method of claim 17, further including:

encoding computing requests for the computing module into physical addresses for transmission over the memory bus.

21. The method of claim 17, further including:

synchronizing a module clock on the at least one computing module with a host clock in a host device coupled to the memory bus;

determining a time stamp for received computing requests in the computing module; and

generating a timestamp using the module clock for request results output from the computing module.

22. The method of claim 17, further including:

generating a list of data transfer operations within the computing module;

reading the list from the computing module; and

performing the data transfer operations by operation of a host device.

23. The method of claim 22, wherein:

the list of data transfer operations includes a scatter/gather list for a direct memory access or equivalent operation executed by a device different than the computing module.