WO2022212145A1 - System and method for coalesced multicast data transfers over memory interfaces - Google Patents
System and method for coalesced multicast data transfers over memory interfaces Download PDFInfo
- Publication number
- WO2022212145A1 WO2022212145A1 PCT/US2022/021529 US2022021529W WO2022212145A1 WO 2022212145 A1 WO2022212145 A1 WO 2022212145A1 US 2022021529 W US2022021529 W US 2022021529W WO 2022212145 A1 WO2022212145 A1 WO 2022212145A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- memory
- multicast
- submodules
- processor
- coalesced
- Prior art date
Links
- 230000015654 memory Effects 0.000 title claims abstract description 650
- 238000012546 transfer Methods 0.000 title claims abstract description 101
- 238000000034 method Methods 0.000 title claims abstract description 42
- 239000000284 extract Substances 0.000 claims description 14
- 238000012545 processing Methods 0.000 description 36
- 230000004044 response Effects 0.000 description 15
- 238000010586 diagram Methods 0.000 description 10
- 238000013459 approach Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 3
- 230000001965 increasing effect Effects 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 230000004931 aggregating effect Effects 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 230000002708 enhancing effect Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 230000008685 targeting Effects 0.000 description 2
- 230000003213 activating effect Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000001976 improved effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000000977 initiatory effect Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 230000000670 limiting effect Effects 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000036961 partial effect Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000002829 reductive effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000008093 supporting effect Effects 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
- G06F3/064—Management of blocks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/06—Addressing a physical block of locations, e.g. base addressing, module addressing, memory dedication
- G06F12/0607—Interleaved addressing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/0604—Improving or facilitating administration, e.g. storage management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
- G06F3/0644—Management of space entities, e.g. partitions, extents, pools
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0655—Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
- G06F3/0659—Command handling arrangements, e.g. command buffers, queues, command scheduling
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/0671—In-line storage system
- G06F3/0673—Single storage device
- G06F3/0679—Non-volatile semiconductor memory device, e.g. flash memory, one time programmable memory [OTP]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/10—Providing a specific technical effect
- G06F2212/1016—Performance improvement
Definitions
- Cache memories are typically organized into cache lines in which information read from main memory or information to be written to the main memory is stored.
- memory interfaces are designed to read or write entire cache line(s) of information at a time, even if only a portion of the entire cache line of information is needed or only a portion of the entire cache line of information needs to be updated in the main memory.
- FIG. 1 is a logical representation of a prior-art system implementing a wide access from a single memory bank.
- a memory module 100 has a host interface 102 which couples with a memory interface 106 of a host processor 104, also referred to as a main processor which orchestrates the computation performed in the load and store operations of the memory module 100, via a memory bus or memory channel 108 of a specific width, in this case 256 bits or 32 bytes.
- the memory module 100 can only support a single coarse granularity of access (that is, 32 bytes) for any load or store operation.
- the memory module 100 includes a plurality of memory banks or submodules 112, 114, 116, and 118 (which in this example has 16 submodules) each operatively coupled with the host interface 102 via another memory channel 110 (which in this example has a channel width of 256 bits).
- Each memory submodule may include a memory address register (MAR) to store the address of the memory location that is being accessed by the load or store operation, as well as a memory data register (MDR) to store the data on which the operation is being performed.
- MAR memory address register
- MDR memory data register
- the host processor 104 issues a load or store request when there is a need to load memory data from certain memory submodules or to store data into the memory submodule.
- a data transfer between the host processor 104 and the memory module 100 such as the host processor 104 requesting certain bits of data from a specified submodule (which in this example is the submodule 114) to be sent to the host processor 104, a wide access is performed and an entire cache line of contiguous data, which includes the requested bits of data, is transferred from the single specified submodule to the host interface 102.
- a specified submodule which in this example is the submodule 114
- Some conventional computing systems use software to statically identify such short accesses with low cache reuse potential in an attempt to address the inefficiency of sub- cache-line memory accesses.
- Some of the approaches involve sub-ranking, which groups the multiple memory chips of a dual in-line memory module (DIMM), or RAM stick, into subsets to allow each subset to provide a data chunk smaller than the size of the original transfers.
- sub-ranking requires separate commands for each sub-rank, resulting in reduced performance due to higher demands on a shared command bus, or increased hardware cost due to the dedicated command path required for each sub-rank.
- Sub-ranking is also impractical when each data access is provided by a single memory module.
- Some approaches involve reducing the height and/or width of DRAM banks, but such approaches involve changing the core DRAM array design and incur high overheads.
- FIG. l is a prior art block diagram illustrating one example of a memory module and a host processor of a system in which an entire cache line of contiguous data is transferred therebetween upon the host processor issuing a load or store request;
- FIG. 2 is a block diagram illustrating one example of a memory module and a host processor of a system configured to perform multicast memory coalesce operations in accordance with an embodiment set forth in the disclosure;
- FIG. 3 is a block diagram illustrating one example of a multicast coalesced block data that is formed as result of the multicast memory coalesce operations carried out by the system shown in FIG. 2;
- FIG. 4 is a diagram illustrating one example of the system according to FIG. 2 with a plurality of memory modules coupled with the host processor;
- FIG. 5 is a flowchart illustrating one example of a method for performing a multicast memory coalesce operation in accordance with one example set forth in the disclosure
- FIG. 6 is a flowchart illustrating one example of a method for switching between facilitating multicast memory coalesce operation and facilitating contiguous block data transfers in accordance with one example set forth in the disclosure
- FIG. 7 is a flowchart illustrating one example of a method for loading data from the memory submodules in accordance with one example set forth in the disclosure
- FIG. 8 is a flowchart illustrating one example of a method for storing data in the memory submodules in accordance with one example set forth in the disclosure
- FIG. 9 is a diagram illustrating one example of a system in accordance with one example set forth in the disclosure.
- FIG. 10 is a diagram illustrating one example of a memory module in accordance with one example set forth in the disclosure.
- FIG. 11 is a diagram illustrating one example of a system in accordance with one example set forth in the disclosure.
- FIG. 12 is a diagram illustrating one example of a system in accordance with one example set forth in the disclosure.
- FIG. 13 is a diagram illustrating one example of a system in accordance with one example set forth in the disclosure.
- FIG. 14 is a diagram illustrating one example of a system in accordance with one example set forth in the disclosure.
- systems and methods help reduce the data transfer overhead and facilitate fine-grained data transfers by coalescing or aggregating short data words from a plurality of disparate memory submodules and transferring or communicating the coalesced data, referred to herein as multicast coalesced block data, over the memory channel simultaneously in a single block data transfer.
- a short data word is returned or loaded from each of a collection of partitioned memory submodules to a host processor at a unique position within the single block data transfer.
- a short data word is written or stored into each of a collection of partitioned memory submodules from a host processor at a unique position within the single block data transfer.
- the memory submodules has register(s) associated therewith, where the register(s) stores the short data word from the corresponding submodule until a multicast memory coalesce operation is performed, or stores the short data words that are extracted from the multicast coalesced block data that is received from the processor.
- the register(s) may be implemented as part of near-memory or in-memory processing technologies.
- a method for controlling digital data transfer via a memory channel between a memory module and a processor carried out by at least one of the memory module or the processor, coalesces a plurality of short data words into multicast coalesced block data comprising a single data block for transfer via the memory channel.
- Each of the plurality of short data words pertains to one of at least two partitioned memory submodules in the memory module.
- the multicast coalesced block data is communicated over the memory channel.
- the method includes the processor detecting a condition indicative of potential for short data word coalescing and, responsive to the detected condition, switching between a first mode facilitating the transfer of the multicast coalesced block data and a second mode facilitating a contiguous block data transfer between the processor and one of the memory submodules.
- the method includes the processor sending a coalesced load command to the memory module to cause the memory module to retrieve the short data words from the memory submodules and perform a multicast memory coalesce operation to coalesce the short data words into the multicast coalesced block data and, responsive to receiving the multicast coalesced block data, extracting each of the short data words from the multicast coalesced block data.
- the method includes the processor sending one or more location identifiers to the memory module identifying a plurality of locations associated with the short data words within the memory submodules to cause the memory module to retrieve the short data words from the identified locations within the memory submodules.
- the method includes the processor sending one location identifier such that the memory module retrieves the short data words from the same location (e.g., address offset or near-memory register ID) within multiple memory submodules and coalesces the retrieved short data words into a single data block as the multicast coalesced block data.
- the plurality of locations are associated with or identified by a plurality of location identifiers.
- the method includes the processor configuring at least one register associated with each of the memory submodules to store the short data words accessible by the processor to be coalesce into the multicast coalesced block data.
- the method includes the processor generating the multicast coalesced block data to be transferred to the memory module and sending a coalesced store command to the memory module.
- the coalesced store command causes the memory module to perform a multicast memory extract operation to extract the short data words from the multicast coalesced block data, distribute the short data words to the memory submodules, and store the short data words within the memory submodules.
- the method includes the processor sending one or more location identifiers to the memory module identifying a plurality of locations associated with the short data words within the memory submodules to cause the memory module to store the short data words at the identified locations within the memory submodules.
- the method includes the processor sending one location identifier such that the memory module stores the short data words in the coalesced data block to the same location (address offset or near-memory register ID) within multiple memory submodules.
- the method includes the processor generating the multicast coalesced block data to be transferred to the memory module and configuring at least one register associated with the memory submodules to cause the memory module to extract the short data words from the multicast coalesced block data and store the short data words in the at least one register.
- the method includes the processor generating the multicast coalesced block data to be transferred to the memory module and configuring the memory module to cause the memory module to determine one or more location identifiers identifying a plurality of locations associated with the short data words within the memory submodules based on the multicast coalesced block data supplied by the processor or information stored in the memory module.
- a processor includes a memory interface which communicates with a memory module via a memory channel and multicast coalesce logic.
- the multicast coalesce logic performs data transfer between the processor and the memory module via the memory channel by causing a plurality of short data words to be coalesced into multicast coalesced block data comprising a single data block prior to the transfer, and communicates the multicast coalesced block data over the memory channel.
- Each of the plurality of short data words pertains to one of at least two partitioned memory submodules.
- the multicast coalesce logic configures a memory controller associated with the memory submodules to cause the memory controller to detect a condition to switch between a first mode facilitating the transfer of the multicast coalesced block data and a second mode facilitating a contiguous block data transfer between the processor and one of the memory submodules, the condition indicative of potential for short data word coalescing.
- the multicast coalesce logic sends a coalesced load command to the memory module to cause the memory module to retrieve the short data words from the memory submodules and perform a multicast memory coalesce operation to coalesce the short data words into the multicast coalesced block data and, responsive to receiving the multicast coalesced block data, extracts each of the short data words from the multicast coalesced block data.
- the multicast coalesce logic also sends one or more location identifiers to the memory module identifying a plurality of locations associated with the short data words within the memory submodules to cause the memory module to retrieve the short data words from the identified locations within the memory submodules.
- the multicast coalesce logic configures at least one register associated with the memory submodules to cause the memory module to read the short data words from the memory submodules and perform a multicast memory coalesce operation to coalesce the short data words from the register(s) into the multicast coalesced block data.
- the register(s) may be per-submodule near-memory register(s) or per- submodule in-memory register(s), and in performing the multicast memory coalesce operation, the short data words stored in the register(s) from a prior in-memory or near memory operation are retrieved and coalesced into a single data block.
- the multicast coalesce logic generates the multicast coalesced block data to be transferred to the memory module and sends a coalesced store command to the memory module to cause the memory module to perform a multicast memory extract operation to extract the short data words from the multicast coalesced block data, distribute the short data words to the memory submodules, and store the short data words within the memory submodules.
- the multicast coalesce logic also sends one or more location identifiers to the memory module identifying a plurality of locations associated with the short data words within the memory submodules to cause the memory module to store the short data words at the identified locations within the memory submodules.
- the multicast coalesce logic generates the multicast coalesced block data to be transferred to the memory module and configures at least one register associated with the memory submodules to cause the memory module to extract the short data words from the multicast coalesced block data and store the short data words in the register(s).
- the register(s) may be per-submodule near-memory register(s) or per-submodule in-memory register(s).
- a memory module includes a processor interface which communicates with a processor via a memory channel, a plurality of partitioned memory submodules, and multicast coalesce logic.
- the multicast coalesce logic performs data transfer between the processor and the memory module via the memory channel by causing a plurality of short data words to be coalesced into coalesced block data comprising a single data block prior to the transfer, and communicating the multicast coalesced block data over the memory channel.
- Each of the plurality of short data words pertains to one of the memory submodules.
- the memory module includes a mode selection component, including but not limited to tri-state gates or multiplexer, that switches between a first mode facilitating the transfer of the multicast coalesced data and a second mode facilitating a contiguous block data transfer between the processor and one of the memory submodules.
- the memory module also includes a memory controller associated with the memory submodules, the memory controller which detects a condition indicative of potential for short data word coalescing to switch between the first mode and the second mode.
- the memory module includes at least one near-memory or in memory processing logic which determines one or more location identifiers identifying a plurality of locations associated with the short data words within the memory submodules based on the multicast coalesced block data supplied by the processor or information stored in the memory module.
- the multicast coalesce logic of the memory module includes a plurality of shifter logic components coupled with the memory submodules. Each of the shifter logic components shifts a position of the short data word based on an address offset for the memory submodule. The short data words from the shifter logic components are concatenated to form the multicast coalesced block data.
- the multicast coalesce logic in response to receiving from the processor a coalesced load command, retrieves the short data words from the memory submodules and perform a multicast memory coalesce operation by shifting the positions of the short data words using the shifter logic components and coalescing the short data words into the multicast coalesced block data.
- the multicast coalesce logic receives one or more location identifiers from the processor identifying a plurality of locations associated with the short data words within the memory submodules such that the short data words are retrieved from the identified locations within the memory submodules.
- the multicast coalesce logic further includes a plurality of selector logic components.
- Each of the selector logic components selects the short data word from a portion of data retrieved from one of the memory submodules based on the address offset from the memory submodule.
- the short data words from the selector logic components are concatenated to form the multicast coalesced block data.
- the memory module includes at least one register associated with the memory submodules.
- the register stores a short data word associated with the corresponding memory submodule until a multicast memory coalesce operation is performed to coalesce the short data words from the register into the multicast coalesced block data.
- each of the short data words may have been previously loaded from the corresponding memory submodule, or it may have been computed by in-memory or near memory processing logic based on the data stored in the submodule.
- the multicast coalesce logic of the memory module includes a plurality of subset distribute logic components coupled with the memory submodules. Each of the subset driver logic components extracts one of the short data words from the multicast coalesced block data and distribute the extracted short data word to one of the memory submodules.
- the multicast coalesce logic of the memory module in response to receiving from the processor a coalesced store command, performs a multicast memory extract operation to store within the memory submodules the short data words distributed to the memory submodules by the subset distribute logic components.
- the multicast coalesce logic further receives one or more location identifiers from the processor identifying a plurality of locations associated with the short data words within the memory submodules such that the short data words are stored at the identified locations within the memory submodules.
- the memory module includes at least one register associated with the memory submodules. The register stores the short data words extracted from the multicast coalesced block data that is received from the processor.
- subsequent memory command(s) may perform a memory operation using the short data words in the registers and/or the short data words in the associated memory submodules.
- the multicast coalesce logic performs a multicast memory extract operation to extract and distribute the short data words from the register to the memory submodules and store the short data words within the memory submodules.
- a system for controlling digital data transfer includes a processor, a memory module including a plurality of partitioned memory submodules, a memory channel between the processor and the memory module, and a multicast coalesce logic.
- the multicast coalesce logic performs data transfer between the processor and the memory module via the memory channel by causing a plurality of short data words to be coalesced into multicast coalesced block data comprising a single data block prior to the transfer, and communicating the multicast coalesced block data over the memory channel.
- Each of the plurality of short data words pertains to one of the memory submodules.
- the system further includes a mode selection component that switches between a first mode facilitating the transfer of the multicast coalesced data and a second mode facilitating a contiguous block data transfer between the processor and one of the memory submodules, and a memory controller associated with the memory submodules.
- the memory controller detects a condition indicative of potential for short data word coalescing and control the mode selection component to switch between the first mode and the second mode based on the detected condition.
- FIG. 2 illustrates a logical representation of one example of a system, or more specifically a computing system such as a portion of a hardware server, smartphone, wearable, printer, laptop, desktop, or any other suitable computing device that utilizes data transfers between a memory module 100 and a host processor 104.
- a single memory module and a single memory channel is shown for simplicity, but it is to be understood that the disclosure also applies to systems with multiple memory modules and channels.
- the memory module 100 may be the main memory of the host processor 104, which may be a central processing unit (CPU), graphics processing unit (GPU), accelerated processing unit (APU), application specific integrated circuit (ASIC), or any other suitable integrated circuit that performs computations and issues memory requests to the memory module 100.
- CPU central processing unit
- GPU graphics processing unit
- APU accelerated processing unit
- ASIC application specific integrated circuit
- Cache storage may also exist on the host processor, the memory module, or both. It is to be understood that any number of memory modules may be coupled with the host processor 104, as shown in FIG. 4, and the system may also implement a plurality of host processors 104 operatively coupled together.
- the memory submodules may also be referred to herein as memory banks.
- the memory submodules 112, 114, 116, and 118 are disparate and operatively coupled with the host interface 102 via a plurality of memory channels or data links 200, 202, 204, and 206 for short data word transfer.
- there may be 16 short data word links where each short data word link facilitates short data word transfer of 16 bits, for example, and each memory link operates independently from other links.
- the links may each occupy a subsection of a shared memory channel such as a shared memory data bus 207.
- the data links 200, 202, 204, and 206 have smaller channel width than the memory channel 108 such that the total combined width of all the data links 200, 202, 204, and 206 equals the width of the memory channel 108.
- the memory module 100 includes 16 short word links, each having a width of 16 bits, totaling 256 bits, which equals the width of the memory channel 108. It is to be understood that other links may exist to couple the memory submodules with the host interface 102. For example, each the submodules may also be operatively coupled with the host interface 102 via the memory channel 110 as shown in FIG. 1, in which case the memory channel 110, which has a greater channel width than any of the data links 200, 202, 204, and 206, is used for contiguous block data transfer instead of a multicast coalesced block data transfer, the details of which is further disclosed herein.
- FIG. 3 illustrates one example of multicast coalesced block data 300 that is to be transferred via the memory channel 110 between the memory module 100 and the host processor 104.
- the block data 300 includes data segments 302, 304, 306, and 310 such that each data segment pertains to or is associated with a separate memory submodule. That is, in the case of a load command, each data segment is a short data word retrieved from one of the memory submodules in one-to-one correlation, i.e., each memory submodule may contribute no more than one short data word to the block data 300.
- each data segment is a short data word that is to be stored in a location within one of the memory submodules in one-to-one correlation, i.e., each memory submodule may receive no more than one short data word from the block data 300 to be stored therein.
- Each short data word may have any suitable word size, for example 8 bits, 16 bits, 32 bits, 64 bits, etc., that is smaller than the channel width and the cache line width, depending on the number of memory submodules in the memory module.
- the block data 300 to be transferred includes 16 data segments, each segment having 16 bits, and each segment is assigned to no more than one of the memory submodules.
- the data segment 302 may be assigned to the memory submodule 112, the data segment 304 to the memory submodule 114, the data segment 306 to the memory submodule 116, and the data segment 308 to the memory submodule 118, and so on.
- a CLD facilitates each memory submodule (which in some examples may include the PIM unit) to return 16 bits of data, calculated by dividing 256 bits (data transfer width of the memory interface of a single channel) by 16 memory submodules.
- data from the memory submodule 112 may occupy bits 0 to 15 of the block data 300
- data from the memory submodule 114 may occupy bits 16 to 31 of the block data 300
- data from the memory submodule 116 may occupy bits 32 to 47 of the block data 300
- data from the memory submodule 118 may occupy bits 240 to 255 of the block data 300.
- the block data 300 is then transferred or communicated over the memory channel 108 in a single block data transfer between the memory module 100 and the host processor 104.
- the block data 300 is also referred to as multicast coalesced block data due to the nature of the block data including a plurality of separate short data words that are addressed to a plurality of separate and independently functioning memory submodules, and the separate short data words are transferred simultaneously in a single block data transfer when the multicast coalesced block data is sent via the memory channel.
- FIG. 4 illustrates one example of a system with a plurality of memory modules 100, 400, 402, and 404, each of which is operatively coupled with the host processor 104 via the memory channel 108.
- Each memory module may include one or more memory dies and one or more logic dies with built-in computation capabilities provided by processing-near- memory (PNM) technologies.
- PPM processing-near- memory
- the computation capabilities of memory module 100 may be implemented on a separate logic die 406 which is 3D-stacked with the memory die(s), and the memory die(s) may be implemented with the memory submodules 112, 114, 116, and 118.
- the memory submodules 112, 114, 116, and 118 may be dynamic random access memory (DRAM) devices in some examples.
- DRAM dynamic random access memory
- Each of the other memory modules 400, 402, and 404 may be similarly constructed.
- the implementations described herein are also applicable to cases where the computation capabilities (that is, computing units) are incorporated at each memory bank or memory module, as in bank-level processing-in-memory (PIM) systems.
- the computing units may be implemented directly on the memory dies instead of on the separate logic die.
- the stream of commands may be broadcast to multiple PIM units within a memory module.
- An example implementation of such an organization is for each PIM command to be broadcast to all the memory banks associated with a single memory channel.
- the implementations described herein are also applicable in other system configurations that may consist of multiple host processors and memory modules interconnected in various configurations.
- a non-transitory storage medium such as memory, includes executable instructions that when executed by one or more processors, such as the host processor 104 or the memory module 100 with data processing capabilities, causes the one or more processors to perform the methods for controlling digital data transfer via a memory channel as disclosed in FIGs. 5 through 8.
- FIG. 5 illustrates a method 500 of performing the coalesced block data transfer, as performed by the host processor 104, or by the memory module 100 having either logic die(s) with built-in computation capabilities as provided by PNM or memory dies with built-in computation capabilities as provided by PIM where each memory die has its own independent computation capabilities.
- the processor or the memory module coalesces a plurality of short data words into the multicast coalesced block data.
- the multicast coalesced block data includes a single data block for transfer via the memory channel, where each of the short data words pertains to one of at least two partitioned memory submodules in the memory module.
- the short data words are associated with a subset of the memory submodules such that the short data words are either loaded from specific locations within the memory submodules or stored in the specific locations within the memory submodules.
- the single data block containing the plurality of short data words is transferred via the memory channel in a single data transfer. The transfer may be from the processor to the memory module or from the memory module to the processor.
- FIG. 6 illustrates a method 600 of switching between different modes in the system, where one mode facilitates the formation and transfer of the multicast coalesced block data between the processor and at least two of the memory submodules as explained in method 500, and the other mode facilitates the contiguous block data transfer between the processor and a single memory submodule.
- the processor detects a condition indicative of potential for short data word coalescing.
- the memory module or the processor switches between a first mode facilitating the transfer of the multicast coalesced block data and a second mode facilitating a contiguous block data transfer between the processor and one of the memory submodules.
- the processes of detecting the condition and issuing a command to switch between the first and second modes take place entirely on the host side at the memory interface of the host processor.
- the multicast memory coalesce operation to coalesce the short data words associated with the plurality of memory submodules may occur on the memory side at the host interface of the memory module (i.e., when a coalesced load command is issued by the processor)
- the host processor is responsible for detecting the condition and issuing multicast coalescing requests to the memory module, which may be issued as part of the coalesced load command.
- the condition is explicitly triggered by an application running on the processor or instructions issued by the application (e.g., a special memory command to “start coalescing”, or explicit coalesced memory commands issued by the application which trigger coalescing).
- the condition includes an indication of a sparse memory access operation at the memory interface.
- the sparse memory access is defined as accessing a smaller number of bits of data (e.g., a short data word) sparsely at two or more of the memory submodules, as opposed to a contiguous memory access in which a single contiguous section of a larger number of bits of data (e.g., an entire cache line or data block) is to be accessed at a single memory submodule.
- the memory interface may include a memory controller which stores a sequence of memory commands from the processor into queues such that there is a separate queue for each memory submodule.
- the memory controller may detect a hint or indication that a command at the front of each queue is comprised of memory commands for sparse bits or short data words, and in response to the detection, the sparse bits or short data words from the queues are concatenated or coalesced into the multicast coalesced block data.
- the memory controller may store a sequence of memory commands into a single memory queue.
- the condition may be detected by periodically searching the queue for memory commands to short data words that map to different memory submodules. Additionally or alternatively, the queue may be searched when inserting a memory command to short data words for other commands of the same or similar type to short data words in different submodules with which it may be coalesced. A threshold in the number of coalesce-able commands may be implemented to trigger the condition. If coalescing is limited to short data words that share some address offset bits (e.g., the short data words fall in the same DRAM column index), the address bits are also compared when searching for coalescing opportunities.
- the queue entries may also contain information regarding whether the associated memory command targets a short data word, information regarding whether multiple short data words have been coalesced into the queue entry, and offset information regarding the short data word(s) targeted by the queue entry.
- FIG. 7 illustrates a method 700 of performing a coalesced memory load operation by the processor.
- the processor sends a coalesced load command to the memory module.
- the command causes the memory module to retrieve the short data words from the memory submodules and perform a multicast memory coalesce operation to coalesce the short data words into the coalesced block data.
- the processor determines whether the coalesced command directly targets the memory submodules, or whether the command targets in-memory or near-memory registers.
- the type of command that is being coalesced specifies whether the memory submodules or the in-memory or near memory registers are targeted.
- the processor communicates one or more location identifiers to the memory module.
- the one or more location identifiers identify the plurality of locations associated with the short data words within the memory submodules.
- the processor may cause the memory module to retrieve the short data words from the identified locations within the memory submodules.
- the short data words are coalesced into the multicast coalesced block data prior to be transferred as a single data block via the memory channel.
- the processor extracts (or de- coalesces) each of the short data words from the multicast coalesced block data, in response to receiving the multicast coalesced block data.
- the processor configures at least one register associated with each of the memory submodules to cause the memory module to store a short data word to be coalesced into the coalesced block data.
- a register is implemented for every memory submodule and can be directly accessed. If the same register and same offset within the register is accessed for every memory submodule, then, according to some examples, it may not be necessary to supply location information for each of the short data words. Placing the short data words in the registers may be orchestrated in advance by the processor by performing memory-local load commands from the memory submodules in some examples.
- the placement of the short data words in the registers may be orchestrated by processing each of the short data words (for example, performing calculations on the short data words) using the near-memory or in-memory processing capabilities of the memory module.
- the multicast memory coalesce operation is performed by the processor to read the short data words from the register(s) and form the multicast coalesced block data, before proceeding to step 708.
- FIG. 8 illustrates a method 800 of performing a coalesced memory store operation by the processor.
- the processor generates the multicast coalesced block data to be transferred to the memory module.
- the processor determines whether the coalesced commands directly target the memory submodules, or whether the commands target the in-memory or near-memory registers.
- the processor sends one or more location identifiers to the memory module.
- the location identifiers identify the locations associated with the short data words within the memory submodules, which causes the memory module to store the short data words at the identified locations within the memory submodules.
- the processor sends a coalesced store command to the memory module.
- the command causes the memory module to perform a multicast memory extract operation to extract the short data words from the multicast coalesced block data, distribute the short data words to the memory submodules, and store the short data words at the identified locations within the memory submodules.
- the steps 806 and 808 may be performed sequentially or simultaneously.
- the location identifier(s) may be generated or computed by the in-memory or near-memory processing logic component based on data stored in the per-submodule register or the memory submodule.
- step 810 configure at least one register associated with the memory submodules to cause the memory module to extract the short data words from the multicast coalesced block data and store the short data words in the at least one register.
- the memory submodule’s near-memory or in-memory computing component may access the register(s) for further computation and/or data movement.
- the location identified s) may be communicated by using number bits in the coalesced load/store command and/or data bus to transmit per-submodule location bits.
- the location identifier(s) may be obtained by computing per- submodule location information, for example by loading the location identifier(s) and/or computing the location identifier(s) near-memory and subsequently storing it in a near memory register.
- the location information that is static or common to all memory submodules may not require to be transmitted separately. For example, additional per-submodule location bits may be added to a common base address, and no additional location information may to be provided if a coalesced command targets a short data word at the same offset in each memory submodule or near-memory register.
- any of the logic components as disclosed herein, such as those shown in FIGs. 9 through 14, may be implemented as discrete logic, one or more state machines, a field programmable gate array (FPGA), or any suitable combination of processors/processing logic components executing instructions and other hardware.
- a logic device or component includes devices or components that are able to perform a logical operation, e.g., an apparatus able to perform a Boolean logical operation. It will be recognized that functional blocks are shown and that the various operations may be combined or separated as desired. It will also be recognized that all the functional blocks are not needed in all implementations.
- the arrows shown in the figures define the directions of data transfer between the components during the specified loading or storing operation, which may be implemented using any suitable data path, such as via hardwiring or a data channel such as a data bus. Some implementations may also involve pipelining of transfers/operations in which the data is buffered in one or more registers. Furthermore, for simplicity, the embodiments described herein pertain to a memory module 100 that is high bandwidth memory (HBM) where an entire cache line of data access is provided from a single memory module.
- HBM high bandwidth memory
- FIG. 9 illustrates one example of a system, or more specifically a computing system that utilizes data transfers between a memory module 100 and a host processor 104.
- the memory module 100 includes a multicast coalesce logic 900 which includes a plurality of processing logic components 902, 904, 906, 908 with a plurality of registers 912, 914, 916, 918, where each processor and register is associated with one of the memory submodules 112, 114, 116, 118.
- the processing logic 902, 904, 906, 908 are the near memory or in-memory computing components configured to provide computing capabilities for the memory submodules.
- the multicast coalesce logic 900 is coupled with the data links 200, 202, 204, 206 such that the short data words are transferred via the data links 200, 202, 204, 206 through a shared data bus, or a data channel 934.
- the short word data to be received from or sent to the processing logic 902 may occupy bits 0 to 15 of the data channel 934
- the short word data to be received from or sent to the processing logic 904 may occupy bits 16 to 31 of the data channel 934
- the short word data to be received from or sent to the processing logic 906 may occupy bits 32 to 47 of the data channel 934
- the processing logic 902, 904, 906, 908 may include enough processing capabilities to support a coalesced load (CLD) and coalesced store (CST) operations and may be dedicated for supporting only such operations.
- CLD coalesced load
- CST coalesced store
- the memory module 100 further includes a mode selector 922, a mode selection component which may be implemented as tri-state gates or a multiplexer, for example.
- the mode selector 922 operates as a switch between a first mode facilitating the transfer of the multicast coalesced block data and a second mode facilitating a contiguous block data transfer, as explained.
- the mode selector 922 may be implemented as a programmable logic device such that a control bit is utilized to activate the switching.
- the mode selector 922 is configured to transfer data through the data channel 934 while in the first mode and transfer data to and from a memory submodule selector 924 while in the second mode.
- the memory submodule selector 924 is shared by all the memory submodules and is configured to select which submodule to access based on a memory submodule identifier (ID) 910 provided, when the mode selector 922 is operating in the second mode.
- ID memory submodule identifier
- one or both of the mode selector 922 and the submodule selector 924 may be implemented in a distributed manner such that the mode selector 922 and/or the submodule selector 924 includes a plurality of separately functioning logic components, with at least one logic component disposed near each of the memory submodules to control the access to a single shared data channel or data bus by the memory submodules.
- the logic components may include, but are not limited to, multiplexers or tri-state gates.
- the host processor 104 also includes a multicast coalesce logic 926 separate from the multicast coalesce logic 900 of the memory module 100 and therefore has a different functionality. Examples of performing a multicast coalesced block data transfer are described in detail below, in view of the components mentioned above.
- a PIM register identifier is specified for the PIM unit associated with each memory bank or submodule, utilizing PIM support.
- Each of the PIM units contributes the short data word (e.g., 16 bits) of the identified register to the coalesced output.
- some embodiments may return the lower 16 bits of the register or some other fixed offset of the register.
- the CLD operation may have a parameter that allows the software on the host to specify which 16 bits of the register each PIM unit should return.
- the register can be programmed at the PIM unit a priori before the CLD operation is issued.
- the CLD operation may specify a memory address within each memory bank or submodule such that each memory submodule reads the short word data (e.g., 16 bits) stored at the specified memory location and returns it.
- this is achieved by each memory module receiving a broadcast intra-module memory address as part of the CLD operation and returning the data at that location in each memory submodule.
- this utilizes support for communicating additional address information to each memory submodule. This may be achieved via bank- local address generation or by enhancing the command interface, e.g., by using the data bus to send command information, or alternatively by coalescing commands that share target address bits.
- the memory submodules in response to a command broadcast to a collection of memory submodules, each returns to the requesting host a chunk or block of data at a fixed, unique position within the wide load data return of the memory interface. As each memory submodule returns data in a specific and unique location within the data returned to the host, all the participating memory submodules return their data in a single block data transfer over the memory data bus or channel.
- the host processor 104 specifies the submodule-specific registers 912, 914, 916, 918 associated with the memory submodules 112, 114, 116, 118 to access.
- Each of the in-memory or near memory processing logic components 902, 904, 906, 908 contributes a short data word of a predetermined word length (for example, 16 bits in the illustrated example) as stored in the corresponding register to be output to the host interface 102 via a data channels 934 and 936.
- the lower 16 bits (or the number of bits pertaining to the length of the short data word) are returned, or alternatively other offsets may be employed to retrieve the short data words from different locations in the registers.
- the host processor 104 may issue a command specifying which bits of the register each of processing logic 902, 904, 906, 908 should return.
- the registers 912, 914, 916, 918 may be populated with the short data words by the corresponding processing logic 902, 904, 906, 908 by retrieving the short data words via memory channels 930 a priori before the CLD operation is issued by the host processor 104.
- the memory channels 930 are wider (for example, 256 bits) than the data links 200, 202, 204, 206 because the same channels may be used to transfer block data between the submodules 112, 114, 116, 118 and the memory bank selector 924 as well.
- the processing logic 902, 904, 906, 908 transfer the short data words via the data channel 934 as multicast coalesced block data.
- the host processor 104 can retrieve the block data via the data channel 934.
- the host processor 104 can retrieve contiguous block data via another data channel 932 from one of the memory submodules 112, 114, 116,
- the host processor 104 transfers the block data via an internal data channel 938 to the multicast coalesce logic 926 to extract the short data words from the block data to be processed accordingly.
- the multicast coalesce logic 926 therefore, is also capable of operating as an extractor or de-coalescing component, and the multicast coalesce logic 900 in the memory module 100 is also capable of performing such operation, as further explained herein with regard to a coalesced store operation.
- a PIM register identifier is specified for the PIM unit associated with each memory bank or submodule, utilizing PIM support.
- Each of the PIM units writes the short data word received from the coalesced input to the identified register.
- some embodiments may store or perform a PIM operation on the data in the lower 16 bits and zero out the remaining bits, for example via masking.
- Other embodiments may sign extend the 16-bit short data word and store or perform a PIM operation with the extended data targeting the specified PIM register.
- the CST operation may have a parameter that allows the software on the host to specify to which 16 bits of the register that each PIM unit is to write the corresponding data.
- the register can be programmed at the PIM unit a priori before the CST operation is issued.
- the CST operation may specify a memory address within each memory bank or submodule such that each memory submodule writes the short word data (e.g., 16 bits) to the specified memory location.
- this is achieved by each memory module receiving a broadcast intra-module memory address as part of the CST operation and storing the data at that location in each memory submodule.
- this utilizes support for communicating additional address information to each memory submodule. This may be achieved via bank-local address generation or by enhancing the command interface, e.g., by using the data bus to send command information, or alternatively by coalescing commands that share target address bits.
- the host processor 104 transfers data in the opposite direction from the CLD operation such that the multicast coalesce logic 926 performs the coalescing of multiple short data words to be transferred to the memory module 100 as a single block of data (or a single write operation), after which the short data words are extracted and distributed across the memory submodules 112, 114, 116, 118.
- the data distribution may be performed as follows: data from bits 0 to 15 of the host-provided multicast coalesced block data is sent to the submodule 112, data from bits 16 to 31 of the host-provided multicast coalesced block data is sent to the submodule 114, data from bits 32 to 47 of the host-provided multicast coalesced block data is sent to the submodule 116, and so forth, until data from bits 240 to 255 of the host-provided multicast coalesced block data is sent to the submodule 118.
- the CST operation specifies the registers 912, 914, 916, 918 associated with the submodules 112, 114, 116, 118, and the corresponding processing logic 902, 904, 906, 908 write the short data words (for example, 16 bits) extracted from the multicast coalesced block data to be stored into the registers 912, 914, 916, 198.
- the short data words for example, 16 bits
- the short data word may be stored in the lower 16 bits (or the number of bits pertaining to the length of the short data word) while leaving the remaining bits as 0.
- a sign extension operation is performed on the short data words to extend the data length thereof before storing it or performing a processing operation by the processor, with the extended data targeting the specified register 912, 914, 916, 918.
- the CST operation may have a parameter that allows the host to specify which of the bits that are stored in the registers 912, 914, 916, 918 should be written by the processing logic 902, 904, 906, 908 into the corresponding memory submodules 112, 114, 116, 118.
- the host processor 104 may provide instructions regarding the operation of the mode selector 922 and the submodule selector 924.
- the multicast coalesce logic 926 may initiate a command or instruction 940 to switch the mode selector 922 from the second mode facilitating contiguous block data transfer to the first mode facilitating multicast coalesced block data transfer when providing the CLD or CST operation via a command channel or command bus 928.
- the instruction to switch to the first mode may be as simple as activating a control bit in the mode selector 922, such that the first mode is activated when the control bit is 1, and the second mode is activated when the control bit is 0, for example.
- the near-memory or in-memory processing logic components 902, 904, 906, 908 are capable of determining one or more location identifiers (for example, address bits such as column indices of memory arrays) identifying the specific locations associated with the short data words within the memory submodules 112, 114, 116, 118 based on the multicast coalesced block data supplied by the host processor 104 or the information stored in the memory module 100, for example in the registers 912, 914, 916,
- FIG. 9 illustrates each memory submodule associated with its own near-memory or in-memory processor, in some embodiments, the processing capabilities of a single near-memory or in-memory processor may be shared among a plurality of memory submodules.
- the instruction to switch between the two aforementioned modes may be provided by the near-memory or in-memory processor(s), as shown by the transfer of a coalescing configuration bit 944 from a near-memory storage, for example from any of the processing logic components 902, 904, 906, 908.
- the processor may update his near-memory storage value whenever it determines a condition to switch between coalesced mode and contiguous mode.
- the processor(s) may be capable of detecting a condition indicative of potential for a multicast coalesced block data transfer, such as those previously explained herein.
- some implementations may also involve pipelining of transfers/operations in which the data is buffered in one or more intermediate buffer registers, for example the block data register 1102 shown in FIGs. 11 to 14, which may be disposed between the mode selector 922 and the multicast coalesce logic 900.
- FIG. 10 illustrates the different types of registers which may be utilized by the processing logic components 902, 904, 906, 908 as part of the multicast coalesce logic 900.
- an offset register 1000, 1002, 1004, 1006 may be implemented to store the offset information for the short data word that is either loaded from or to be stored in the corresponding submodule.
- the offset register 1000, 1002, 1004, 1006 may store a specific address offset for the short data word when it is being coalesced with the other short data words into the multicast coalesced block data (such as the short data word from the memory submodule 112 occupying bits 0 to 15, the short data word from the memory submodule 114 occupying bits 16 to 31, and so forth).
- Each address offset is different and unique to the corresponding memory submodule in order to avoid an short data word entry overwriting another short data word entry due to an unintended overlap of the occupied bits.
- the address offset may help the processing logic 902, 904, 906, 908 identify which bits of the multicast coalesced block data to retrieve the short data word from in order to store the retrieved short data word in the corresponding memory submodule 112, 114, 116, 118.
- the offset register 1000 In some embodiments, the offset register 1000,
- 1002, 1004, 1006 contains a plurality of offset information such that when the processor supplies a base address, the processing logic 902, 904, 906, 908 can calculate the unique location information associated with each of the plurality of memory submodules based on the stored offset information and the provided base address.
- the offset values stored in the register may be preprogrammed or stored a priori before the CLD or CST operation is issued by the host processor 104.
- a coalesce configuration register 1008, 1010, 1012, 1014 may be implemented to store data regarding the mode to be selected by the mode selector 922, as determined by the processing logic components 902, 904, 906, 908 in response to detecting conditions indicative of a multicast coalesced block data transfer.
- the register may indicate a single bit, where the first mode is activated when the bit is 1, and the second mode is activated when the bit is 0.
- Short data registers 1016, 1018, 1020, 1022 are the registers which store the short data word to be coalesced into the multicast coalesced block data during the CLD operations or to be stored in the corresponding memory submodule 112, 114, 116, 118 during the CST operations.
- the CLD operation reads from the short data registers 1016, 1018, 1020, 1022 (or one or more per-submodule near-memory registers) to obtain the short data words, in which case the short data words are already stored in the registers as a result of a prior near-memory processing operation. Therefore, the short data words may be stored a priori in these short data registers before the operation is issued.
- FIG. 11 illustrates one example of a computing system that utilizes data transfers between a memory module 100 and a host processor 104.
- the host processor 104 provides the instruction to operate the mode selector 922, the address offset for each of the short data words, and the location address for each of the memory submodules 112, 114, 116, 118.
- Address offset and location address are sent as an instruction command 1100 to the multicast coalesce logic 900, coupled with a block data register 1102 to store the multicast coalesced block data, and the memory submodules, respectively, from the host processor 104.
- the address offset defines how much a position of the short data word is shifted during the multicast coalescing operation, and the location address defines the specific location within a specified memory submodule (for example, a column index of a memory array within the memory submodule) that is to be accessed for the CLD or CST operations.
- a specified memory submodule for example, a column index of a memory array within the memory submodule
- near-memory integer adders may be implemented to generate the location addresses for the memory submodules.
- FIGs. 12 and 14 illustrate the dataflow within the system shown in FIG. 11 during the CLD operations
- FIG. 13 illustrates the dataflow within the system during the CST operations.
- FIG. 12 illustrates one example of a system operating the CLD operation, where the multicast coalesce logic 900 includes a concatenate logic 1200 coupled with address shift components, which in this example is shifter/selector logic 1202, 1204, 1206, 1208, which may be programmable, configured to receive the data to loaded from the memory submodules 112, 114, 116, 118.
- the shifter/selector logic 1202, 1204, 1206, 1208 is configured to receive contiguous block data from the corresponding memory submodule 112, 114, 116,
- the shifter/selector logic 1202, 1204, 1206, 1208 is then configured to (1) select the short data word from the contiguous block data to store in the block data register 1102, or (2) shift the short data word by an address offset before storing the short data word in the block data register 1102.
- the selector logic (1) is implemented, the short data word is selected using the predetermined address offset without performing address shifting. If the shifter logic (2) is used, the short data words are separately shifted using the respective offsets into the first predetermined number of bits, and the predetermined number of bits, starting with the first bit, are selected to obtain the short data words.
- the stored short data words are coalesced using the concatenate logic 1200, which performs string concatenation to join the short data words end-to-end.
- the shifter/selector logic 1202, 1204, 1206, 1208 may include register(s) configured to store the location address (to determine which bits or location within the contiguous block data should be selected as the short data word) and/or the address offset value (to determine the amount of shifting to be performed on the short data word prior to coalescing).
- Short data word extraction logic 1210 may be implemented in the host processor 104 to extract the individual short data words from the multicast coalesced block data after receiving the same via the memory channel 108 in a single block data transfer.
- Each shifter/selector logic 1202, 1204, 1206, 1208 may receive the address offset information in the instruction command 1100 provided by the host processor 104, where the address offset information defines where the short data word from each memory submodule is to be located in the multicast coalesced block data.
- Each memory submodule 112, 114, 116, 118 may receive the location address information in the instruction command 1100 to store in the memory address register the location from which the short data word is to be retrieved.
- FIG. 13 illustrates one example of a system operating the CST operation, where the host processor 1300 includes concatenate logic 1300 to form the multicast coalesced block data including short data words that are to be stored in a plurality of the memory submodules 112, 114, 116, 118.
- the multicast coalesce logic 900 which in this case receives the multicast coalesced block data via a de-coalescing path, includes subset distribution logic 1302, 1304, 1306, 1308 to distribute the short data words to their respective memory submodules as intended by the host processor 104.
- the subset distribution log further implements additional address offset bits communicated from the multicast coalesce logic to the memory submodule to indicate which short data word(s) needs to be written, as well as to prevent writing other bits in the column index of the memory array.
- the block data register 1102 stores the multicast coalesced block data received from the host processor 104, and the stored data is transferred to each subset distribution logic 1302, 1304, 1306, 1038 via a data channel 1310, 1312, 1314, 1316.
- Each subset distribute logic may receive the address offset information in the instruction command 1100 provided by the host processor 104, where the address offset information defines which bits within the multicast coalesced block data is to be distributed to which memory submodule.
- each subset distribute logic may be utilized to drive a subset of the data stored in the block data register 1102.
- Each memory submodule 112, 114, 116, 118 may receive the location address information in the instruction command 1100 to store in the memory address register the location in which the short data word is to be stored.
- FIG. 14 illustrates one example of a system operating the CLD operation, where the shifter/selector logic 1202, 1204, 1206, 1208 from FIG. 12 is excluded from the multicast coalesce logic 900.
- the address offset of the short data word from the corresponding memory submodule 112, 114, 116, 118 within the block data register 1102 are configurable a priori or predetermined. That is, the offset of the short data word may be a static offset, in which case a data connection to the appropriate bits within the block data register 1102 may be hardwired (thus forming the address shift components) such that the address bits of the short data word from each memory submodule are automatically shifted to be stored in certain predetermined or preconfigured bits within the block data register 1102.
- the concatenate logic 1200 is also excluded from the multicast coalesce logic 900. Instead of using the concatenate logic 1200 to control the coalescing of the short data words into a single block of data to be stored in the block data register 1102 as did FIG. 12, the position-shifted short data words from the memory submodules 112, 114, 116, 118 are concatenated together by joining wires from these memory submodules directly into a data bus to transfer the bits of the short data words to the block data register 1102.
- the CLD and CST operations may be performed by communicating partial command information via the data bus rather than the command bus.
- the short data words combined may occupy a portion of the multicast coalesced block data, such that the remaining portion of the block data is used to store other information such as the address bits for each of the short data words.
- the address bits for each of the short data words may be implemented in a portion of the multicast coalesced block data that is not occupied by the short data words that are to be stored in the memory submodules.
- a non-transitory storage medium such as memory, includes executable instructions, commands, or address information that when executed by one or more processors, such as the host processor 104 or the PNM/PIM device, causes the one or more processors to place the short data words such that sparse accesses can be orchestrated via the CLD and CST operations to the short data word associated with the corresponding locations within each memory submodule.
- processors such as the host processor 104 or the PNM/PIM device
- This may be applicable in situations with deterministic sparse access patterns, such as accesses along the secondary axes of multidimensional matrices or tensors, or where large table entries are split across memory submodules for efficient use of PNM or PIM operations, including but not limited to applications involving machine-learning-based recommendation systems.
- the coalescing may be performed explicitly via software implementation, the executable instructions for which is stored in the non-transitory storage medium.
- the processor that is running the software may send a pre-coalesced request which bypasses the cache, and the request is handled at a memory controller without the need for any additional hardware for coalescing.
- the executable instructions when executed by the one or more processors, causes the one or more processors to send independent fine-grained sparse accesses, which are then coalesced via hardware such as discrete logic, state machines,
- the host processor 104 may be capable of dynamically detecting coalescing opportunities based on monitoring the data channels and memory submodules targeted by independent and concurrent accesses.
- the host processor or the PNM/PIM devices may also be capable of merging or coalescing independent requests (that is, CLD or CST command requesting for access to the memory submodules) and splitting the responses during the CLD or CST operations.
- sparse commands are capable of traversing a cache hierarchy, according to some examples, there may be a need to differentiate between sparse accesses and non-sparse data accesses such as contiguous block data access involving a single memory submodule, since these requests are to be handled differently by cache controlled s). In such situations, the differentiation may be facilitated by implementing an additional bit or opcode in the command, for example.
- sparse accesses are handled differently from non-sparse data accesses because the sparse accesses do not access a full cache line of data. As such, sparse accesses may simply bypass the cache (e.g., using existing support for uncached address ranges) according to some examples.
- sparse accesses may traverse the caches but may be prevented from allocating or populating cache blocks on a miss, according to some examples.
- caches are enhanced with a sector mask which tracks the state information at the granularity of the sparse accesses, allowing caches to store partially-valid cache blocks.
- the command from the host processor may require additional address information as compared to a non-sparse access. This may be implemented by splitting the request into more packets, using bit masks or a byte mask to indicate which byte(s) within a cache line are accessed by a sparse command, or by sending some or all of the request along a dedicated data path, such as the command bus 928.
- the processor may be implemented with a means for merging and/or splitting the requests and responses associated with sparse memory access.
- the coalescing operation occurs in the memory controller, where the requests are already sorted into different queues based on target memory submodule. If a sparse memory access at the front of the multiple memory submodule queues is detected, the memory controller merges the requests together and issues a single CLD or CST operation that includes all the requests associated with the memory submodules.
- sparse memory access requests are placed in separate queues, from which the sparse memory access requests are interleaved with dense memory access requests (that is, requests to access a contiguous block of data from a single memory submodule) when a threshold of coalesce-able sparse memory accesses to different submodules is reached, or when a timing threshold is exceeded.
- the memory controller may split the response and return the individual sparse memory access responses (potentially through the cache hierarchy) to the requesting processor(s). This may be accomplished by storing the response metadata (e.g., requestor ID, sparse address offset, or access granularity) for all pending sparse memory accesses in a small structure in the memory controller.
- the memory controller may split the data contents thereof into multiple response packets based on sparsity granularity, appending the stored metadata to these packets. If sparse memory access responses are returned to requestors through the cache hierarchy, caches may not allocate space for this data if the granularity of valid state tracking is greater than the sparsity granularity.
- Advantages of implementing systems with the capability to perform multicast memory coale sce/extract operations as disclosed herein include increased efficiency in reading/loading metadata (e.g., the number of data elements participating in an irregular computation at each PNM/PIM device) to the host processor from a collection of PNM/PIM devices associated with the memory channel in a single load operation.
- metadata e.g., the different loop iteration counts
- the condition code e.g., which PNM/PIM devices have data that meet a dynamically calculated, data-dependent condition
- short data words distributed across a collection of memory banks or submodules coupled with the memory channel may be efficiently loaded or stored without the need to wastefully transfer a full cache line of data from each of the memory submodules.
- This improves performance in many application domains such as scientific computing (e.g., high-performance computing, or HPC) and graph analytics, as well as in implementing widely-used data structures such as hash tables and set membership data structures.
- the systems and methods disclosed herein also provide capabilities to improve the performance of near-memory or in-memory processing technologies which often transfer short data words to or from multiple near-memory or in-memory processing units associated with memory banks or submodules.
- a fine-grain operand may be efficiently provided to a plurality of memory submodules concurrently for a load operation that necessitates loading a short data word from different memory submodules and combining them with the supplied operand (e.g., for high-throughput fine-grained atomic accesses).
- the multicast memory coalesce operations as disclosed herein are also effective at improving the efficiency of the memory module (for example, DRAM efficiency) for sparse memory access patterns, such as those that exist in graph analytics, sparse matrix algebra, sparse machine learning models, and so on.
- ROM read-only memory
- RAM random-access memory
- register cache memory
- semiconductor memory devices magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks, and digital versatile disks (DVDs).
- systems and methods as disclosed herein help reduce the data transfer overhead by coalescing or aggregating short data words from a plurality of disparate memory submodules and transferring or communicating the multicast coalesced block data over the memory channel simultaneously in a single block data transfer.
- a short data word is returned or loaded from each of a collection of partitioned memory submodules to a host processor at a unique position within the single block data transfer, or is written or stored into each of a collection of partitioned memory submodules from a host processor at a unique position within the single block data transfer.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Information Transfer Between Computers (AREA)
- Memory System Of A Hierarchy Structure (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
- Memory System (AREA)
- Multi Processors (AREA)
Abstract
Description
Claims
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202280025718.9A CN117099086A (en) | 2021-03-31 | 2022-03-23 | System and method for consolidated multicast data transmission over a memory interface |
KR1020237035505A KR20230162012A (en) | 2021-03-31 | 2022-03-23 | System and method for aggregated multicast data transfers over memory interfaces |
EP22716642.8A EP4315078A1 (en) | 2021-03-31 | 2022-03-23 | System and method for coalesced multicast data transfers over memory interfaces |
JP2023559702A JP2024512086A (en) | 2021-03-31 | 2022-03-23 | System and method for coalesced multicast data transfer via memory interface |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/218,700 | 2021-03-31 | ||
US17/218,700 US11803311B2 (en) | 2021-03-31 | 2021-03-31 | System and method for coalesced multicast data transfers over memory interfaces |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022212145A1 true WO2022212145A1 (en) | 2022-10-06 |
Family
ID=81308359
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2022/021529 WO2022212145A1 (en) | 2021-03-31 | 2022-03-23 | System and method for coalesced multicast data transfers over memory interfaces |
Country Status (6)
Country | Link |
---|---|
US (2) | US11803311B2 (en) |
EP (1) | EP4315078A1 (en) |
JP (1) | JP2024512086A (en) |
KR (1) | KR20230162012A (en) |
CN (1) | CN117099086A (en) |
WO (1) | WO2022212145A1 (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11803311B2 (en) * | 2021-03-31 | 2023-10-31 | Advanced Micro Devices, Inc. | System and method for coalesced multicast data transfers over memory interfaces |
US20230195317A1 (en) * | 2021-12-21 | 2023-06-22 | Micron Technology, Inc. | I/o expanders for supporting peak power management |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120204069A1 (en) * | 2011-02-03 | 2012-08-09 | Arm Limited | Integrated circuit and method for testing memory on the integrated circuit |
WO2014120193A1 (en) * | 2013-01-31 | 2014-08-07 | Hewlett-Packard Development Company, L. P. | Non-volatile multi-level-cell memory with decoupled bits for higher performance and energy efficiency |
US20200402198A1 (en) * | 2019-06-24 | 2020-12-24 | Intel Corporation | Shared local memory read merge and multicast return |
Family Cites Families (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7549014B1 (en) * | 2005-04-22 | 2009-06-16 | Network Appliance, Inc. | Method and apparatus for coalesced multi-block read |
US7698498B2 (en) * | 2005-12-29 | 2010-04-13 | Intel Corporation | Memory controller with bank sorting and scheduling |
US9137780B1 (en) * | 2010-07-29 | 2015-09-15 | Crimson Corporation | Synchronizing multicast data distribution on a computing device |
US8601214B1 (en) * | 2011-01-06 | 2013-12-03 | Netapp, Inc. | System and method for write-back cache in sparse volumes |
US20150363312A1 (en) * | 2014-06-12 | 2015-12-17 | Samsung Electronics Co., Ltd. | Electronic system with memory control mechanism and method of operation thereof |
US10742703B2 (en) * | 2015-03-20 | 2020-08-11 | Comcast Cable Communications, Llc | Data publication and distribution |
US10866897B2 (en) | 2016-09-26 | 2020-12-15 | Samsung Electronics Co., Ltd. | Byte-addressable flash-based memory module with prefetch mode that is adjusted based on feedback from prefetch accuracy that is calculated by comparing first decoded address and second decoded address, where the first decoded address is sent to memory controller, and the second decoded address is sent to prefetch buffer |
WO2018119901A1 (en) * | 2016-12-29 | 2018-07-05 | 华为技术有限公司 | Storage system and solid state hard disk |
US11119920B2 (en) * | 2018-04-19 | 2021-09-14 | Eta Scale Ab | Systems and methods for non-speculative store coalescing and generating atomic write sets using address subsets |
US10938709B2 (en) * | 2018-12-18 | 2021-03-02 | Advanced Micro Devices, Inc. | Mechanism for dynamic latency-bandwidth trade-off for efficient broadcasts/multicasts |
US11194729B2 (en) * | 2019-05-24 | 2021-12-07 | Texas Instruments Incorporated | Victim cache that supports draining write-miss entries |
US11487616B2 (en) * | 2019-05-24 | 2022-11-01 | Texas Instruments Incorporated | Write control for read-modify-write operations in cache memory |
US11681465B2 (en) * | 2020-06-12 | 2023-06-20 | Advanced Micro Devices, Inc. | Dynamic multi-bank memory command coalescing |
US11803311B2 (en) * | 2021-03-31 | 2023-10-31 | Advanced Micro Devices, Inc. | System and method for coalesced multicast data transfers over memory interfaces |
-
2021
- 2021-03-31 US US17/218,700 patent/US11803311B2/en active Active
-
2022
- 2022-03-23 JP JP2023559702A patent/JP2024512086A/en active Pending
- 2022-03-23 KR KR1020237035505A patent/KR20230162012A/en unknown
- 2022-03-23 CN CN202280025718.9A patent/CN117099086A/en active Pending
- 2022-03-23 EP EP22716642.8A patent/EP4315078A1/en active Pending
- 2022-03-23 WO PCT/US2022/021529 patent/WO2022212145A1/en active Application Filing
-
2023
- 2023-10-23 US US18/492,081 patent/US20240045606A1/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120204069A1 (en) * | 2011-02-03 | 2012-08-09 | Arm Limited | Integrated circuit and method for testing memory on the integrated circuit |
WO2014120193A1 (en) * | 2013-01-31 | 2014-08-07 | Hewlett-Packard Development Company, L. P. | Non-volatile multi-level-cell memory with decoupled bits for higher performance and energy efficiency |
US20200402198A1 (en) * | 2019-06-24 | 2020-12-24 | Intel Corporation | Shared local memory read merge and multicast return |
Also Published As
Publication number | Publication date |
---|---|
CN117099086A (en) | 2023-11-21 |
KR20230162012A (en) | 2023-11-28 |
EP4315078A1 (en) | 2024-02-07 |
JP2024512086A (en) | 2024-03-18 |
US20240045606A1 (en) | 2024-02-08 |
US20220317876A1 (en) | 2022-10-06 |
US11803311B2 (en) | 2023-10-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20240045606A1 (en) | System and method for coalesced multicast data transfers over memory interfaces | |
US10915249B2 (en) | Apparatuses and methods for in-memory operations | |
US10540093B2 (en) | Multidimensional contiguous memory allocation | |
US10387315B2 (en) | Region migration cache | |
CN107408404B (en) | Apparatus and methods for memory devices as storage of program instructions | |
CN107683505B (en) | Apparatus and method for compute-enabled cache | |
US11693775B2 (en) | Adaptive cache | |
WO2017165273A1 (en) | Apparatuses and methods for cache operations | |
US11410717B2 (en) | Apparatuses and methods for in-memory operations | |
US7177981B2 (en) | Method and system for cache power reduction | |
US20210117327A1 (en) | Memory-side transaction context memory interface systems and methods | |
US10877889B2 (en) | Processor-side transaction context memory interface systems and methods | |
US10579519B2 (en) | Interleaved access of memory | |
US20070083712A1 (en) | Method, apparatus, and computer program product for implementing polymorphic reconfiguration of a cache size | |
EP4443305A1 (en) | Method for storing and accessing a data operand in a memory unit |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22716642 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2023559702 Country of ref document: JP Ref document number: 202280025718.9 Country of ref document: CN |
|
ENP | Entry into the national phase |
Ref document number: 20237035505 Country of ref document: KR Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 202317073547 Country of ref document: IN |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2022716642 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref document number: 2022716642 Country of ref document: EP Effective date: 20231031 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |