WO2019148129A1 - Contrôleur de mémoire - Google Patents
Contrôleur de mémoire Download PDFInfo
- Publication number
- WO2019148129A1 WO2019148129A1 PCT/US2019/015463 US2019015463W WO2019148129A1 WO 2019148129 A1 WO2019148129 A1 WO 2019148129A1 US 2019015463 W US2019015463 W US 2019015463W WO 2019148129 A1 WO2019148129 A1 WO 2019148129A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- memory
- circuit
- data
- request
- control circuit
- Prior art date
Links
- 230000015654 memory Effects 0.000 title claims abstract description 1188
- 230000004044 response Effects 0.000 claims abstract description 146
- 238000012546 transfer Methods 0.000 claims abstract description 36
- 238000004891 communication Methods 0.000 claims description 160
- 238000000034 method Methods 0.000 claims description 97
- 239000000872 buffer Substances 0.000 claims description 38
- 238000007667 floating Methods 0.000 claims description 3
- 230000003068 static effect Effects 0.000 claims description 2
- 230000008569 process Effects 0.000 description 30
- 238000012545 processing Methods 0.000 description 17
- 230000005540 biological transmission Effects 0.000 description 15
- 238000010586 diagram Methods 0.000 description 15
- 238000012986 modification Methods 0.000 description 7
- 230000004048 modification Effects 0.000 description 7
- 230000008901 benefit Effects 0.000 description 5
- 230000004888 barrier function Effects 0.000 description 4
- 238000013500 data storage Methods 0.000 description 4
- 238000010801 machine learning Methods 0.000 description 4
- 239000000463 material Substances 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 230000011664 signaling Effects 0.000 description 4
- 238000012937 correction Methods 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000008030 elimination Effects 0.000 description 3
- 238000003379 elimination reaction Methods 0.000 description 3
- 238000001914 filtration Methods 0.000 description 3
- 230000006855 networking Effects 0.000 description 3
- 238000002360 preparation method Methods 0.000 description 3
- 230000003595 spectral effect Effects 0.000 description 3
- 230000007704 transition Effects 0.000 description 3
- 238000013519 translation Methods 0.000 description 3
- 230000003044 adaptive effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 235000019800 disodium phosphate Nutrition 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000004806 packaging method and process Methods 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000000712 assembly Effects 0.000 description 1
- 238000000429 assembly Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000001276 controlling effect Effects 0.000 description 1
- 230000007123 defense Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000003278 mimic effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000026676 system process Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0655—Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
- G06F3/0659—Command handling arrangements, e.g. command buffers, queues, command scheduling
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0815—Cache consistency protocols
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0875—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with dedicated cache, e.g. instruction or stack
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/14—Handling requests for interconnection or transfer
- G06F13/16—Handling requests for interconnection or transfer for access to memory bus
- G06F13/1668—Details of memory controller
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
- G06F15/7807—System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/0604—Improving or facilitating administration, e.g. storage management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/0671—In-line storage system
- G06F3/0683—Plurality of storage devices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/3004—Arrangements for executing specific machine instructions to perform operations on memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30076—Arrangements for executing specific machine instructions to perform miscellaneous control operations, e.g. NOP
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3851—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/52—Program synchronisation; Mutual exclusion, e.g. by means of semaphores
- G06F9/526—Mutual exclusion algorithms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/10—Providing a specific technical effect
- G06F2212/1008—Correctness of operation, e.g. memory ordering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/10—Providing a specific technical effect
- G06F2212/1016—Performance improvement
- G06F2212/1024—Latency reduction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/45—Caching of specific data in cache memory
- G06F2212/452—Instruction code
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- the present invention in general, relates to memory controllers, and more particularly, relates to a memory controller which provides for both predetermined and programmable atomic operations and reduced latency for repetitively accessed memory locations.
- Memory controllers are ubiquitous in computing technologies, and among other things, memory controllers control access to read data from a memory circuit, to write data to a memory circuit, and to refresh the data held in the memory circuit.
- a wide variety of memory controllers are commercially available and a designed to be generally suitable for a wide range of applications, but are not optimized for particular applications, including machine learning and artificial intelligence (“AI”) applications.
- AI artificial intelligence
- Such a memory controller should provide support for compute intensive kernels or operations which require considerable and highly frequent memory accesses, e.g., applications which have performance that may be limited by how quickly the application can access data stored in memory, such as for performing Fast Fourier Transform (“FFT”) operations, finite impulse response (“FIR”) filtering, and other compute intensive operations typically used in larger applications such as synthetic aperture radar (e.g., requiring frequent access to tables stored in memory), 5G networking and 5G base station operations, machine learning, AI, stencil code operations, and graph analytic operations such as graph clustering using spectral techniques, for example and without limitation.
- FFT Fast Fourier Transform
- FIR finite impulse response
- graph analytic operations such as graph clustering using spectral techniques, for example and without limitation.
- Such a memory controller should also be optimized for high throughput and low latency, including high throughput and low latency for atomic operations.
- Such a memory controller should also provide for a wide range of atomic operations, including both predetermined atomic operations and also programmable or user-defined atomic operations.
- the representative apparatus, system and method provide for a memory controller which has high performance and is energy efficient.
- Representative embodiments of the memory controller provide support for compute intensive kernels or operations which require considerable and highly frequent memory accesses, such as for performing Fast Fourier Transform (“FFT”) operations, finite impulse response (“FIR”) filtering, and other compute intensive operations typically used in larger applications such as synthetic aperture radar, 5G networking and 5G base station operations, machine learning, AI, stencil code operations, and graph analytic operations such as graph clustering using spectral techniques, for example and without limitation.
- Representative embodiments of the memory controller are optimized for high throughput and low latency, including high throughput and low latency for atomic operations.
- Representative embodiments of the memory controller also provide for a wide range of atomic operations, including both predetermined atomic operations and also programmable or user-defined atomic operations.
- representative embodiments of the memory controller When evaluated using an architectural simulator, representative embodiments of the memory controller produced dramatic results. For example, representative embodiments of the memory controller provided over a three-fold (3 48x) better atomic update performance using a standard GDDR6 DRAM memory compared to a state-of-the-art X86 server platform. Also for example, representative embodiments of the memory controller provided a seventeen fold (l7.6x) better atomic update performance using a modified GDDR6 DRAM memory (having more memory banks), also compared to a state-of-the-art X86 server platform.
- the representative embodiments of the memory controller also provided for very low latency and high throughput memory read and write operations, generally only limited by the memory bank availability, error correction overhead, and the bandwidth (Gb/s) available over communication networks and the memory and cache devices themselves, resulting in a flat latency until maximum bandwidth is achieved.
- Representative embodiments of the memory controller also provide very high performance (high throughput and low latency) for programmable or user-defined atomic operations, comparable to the performance of predetermined atomic operations.
- circuitry in the memory controller transfers the atomic operation request to programmable atomic operations circuitry and sets a hazard bit stored in a memory hazard register corresponding to the memory address of the memory line used in the atomic operation, to ensure that no other operation (read, write, or atomic) is performed on that memory line, which hazard bit is then cleared upon completion of the atomic operation.
- Additional, direct data paths provided for the programmable atomic operations circuitry 135 executing the programmable or user-defined atomic operations allow for additional write operations without any limitations imposed by the bandwidth of the communication networks and without increasing any congestion of the communication networks.
- a memory controller circuit is coupleable to a first memory circuit, with the memory controller comprising: a first memory control circuit coupleable to the first memory circuit, the first memory control circuit adapted to read or load requested data from the first memory circuit in response to a read request and to write or store requested data to the first memory circuit in response to a write request; a second memory circuit; a second memory control circuit coupled to the second memory circuit, the second memory control circuit adapted to read or load the requested data from the second memory circuit in response to a read request read when the requested data is stored in the second memory circuit, and to transfer the read request to the first memory control circuit when the requested data is not stored in the second memory circuit; predetermined atomic operations circuitry adapted to perform at least one predetermined atomic operation of a plurality of predetermined atomic operations in response to an atomic operation request designating the at least one predetermined atomic operation; and programmable atomic operations circuitry adapted to perform at least one programmable atomic operation of a plurality of programmable
- a memory controller circuit is coupleable to a first memory circuit, with the memory controller comprising: a first memory control circuit coupleable to the first memory circuit, the first memory control circuit adapted to read or load requested data from the first memory circuit in response to a read request and to write or store requested data to the first memory circuit in response to a write request; programmable atomic operations circuitry coupled to the first memory control circuit, the programmable atomic operations circuitry adapted to perform at least one programmable atomic operation of a plurality of programmable atomic operations in response to an atomic operation request designating the at least one programmable atomic operation; a second memory circuit; and a second memory control circuit coupled to the second memory circuit and to the first memory control circuit, the second memory control circuit adapted, in response to an atomic operation request designating the at least one programmable atomic operation and a memory address, to transfer the atomic operation request to the programmable atomic operations circuitry and to set a hazard bit stored in a memory hazard
- the plurality of predetermined atomic operations may comprise at least two predetermined atomic operations selected from the group consisting of: Fetch-and-AND, Fetch-and-OR, Fetch-and-XOR, Fetch-and-Add, Fetch-and- Subtract, Fetch-and-Increment, Fetch-and-Decrement, Fetch-and-Minimum, Fetch-and- Maximum, Fetch-and-Swap, Compare-and-Swap, and combinations thereof.
- the programmable atomic operations circuitry may comprise: an instruction cache storing a plurality of processor instructions corresponding to the at least one programmable atomic operation; an execution queue storing a thread identifier corresponding to the programmable atomic operation; a core control circuit coupled to the instruction cache and to the execution queue, the core control circuit adapted, in response to the thread identifier corresponding to the programmable atomic operation, to select a starting or next instruction or instruction address in the instruction cache for execution of the programmable atomic operation; and a processor core adapted to execute at least one instruction for the programmable atomic operation and to generate resulting data.
- the programmable atomic operations circuitry may further comprise: a memory controller interface circuit coupled to the processor core to receive the resulting data and to transfer the resulting data to the second memory control circuit to write the resulting data to the second memory circuit.
- the memory controller circuit may further comprise: a network communication interface coupleable to a communication network and coupled to the memory controller interface circuit, the network communication interface adapted to prepare and transmit a response data packet having the resulting data on the communication network.
- the programmable atomic operations circuitry may further comprise: at least one data buffer to store operand data and interim results generated from executing the at least one instruction for the programmable atomic operation. Also in a representative embodiment, the programmable atomic operations circuitry may further comprise: a network command queue coupled to the processor core, the network command queue storing resulting data; and a network communication interface coupled to the network command queue and coupleable to a communication network, the network communication interface adapted to prepare and transmit a response data packet having the resulting data on the communication network.
- the processor core may be coupled to a data buffer, and the processor core may be further adapted to execute a load non-buffered instruction to determine if an operand is stored in the data buffer and, when the data is not stored in the data buffer, to generate a read request to the second memory control circuit.
- the processor core may be further adapted to execute a store and clear lock instruction to generate an atomic write request to the second memory control circuit, the atomic write request having the resulting data and a designation to reset or clear a memory hazard bit following writing of the resulting data to the second memory circuit.
- the processor core may be further adapted to execute an atomic return instruction to reset or clear a memory hazard bit following writing of the resulting data to the second memory circuit.
- the processor core may be further adapted to execute an atomic return instruction to generate a response data packet having the resulting data.
- the processor core may be further adapted to execute an atomic return instruction to complete an atomic operation.
- the atomic operation request designating the at least one programmable atomic operation comprises a physical memory address, a
- the programmable atomic operations circuitry may further comprise at least one register storing thread state information.
- the programmable atomic operations circuitry may be further adapted, in response to receiving the atomic operation request designating the at least one programmable atomic operation, to initialize the at least one register with the physical memory address, any data corresponding to the memory address, and the at least one thread state register value.
- the memory controller circuit may further comprise a network communication interface coupleable to a communication network and coupled to the first memory control circuit and to the second memory control circuit, the network communication interface adapted to decode a plurality of request packets received from the communication network, and to prepare and transmit a plurality of response data packets on the communication network.
- the programmable atomic operations circuitry is adapted to perform user-defined atomic operations, multi-cycle operations, floating point operations, and multi -instruction operations.
- the memory controller circuit may further comprise a write merge circuit adapted to write or store data read from the first memory circuit to the second memory circuit.
- the second memory control circuit is further adapted to read or load the requested data from the second memory circuit in response to an atomic operation request when the requested data is stored in the second memory circuit, and to transfer the atomic operation request to the first memory control circuit when the requested data is not stored in the second memory circuit.
- the second memory control circuit is further adapted to write or store data to the second memory circuit in response to a write request or in response to an atomic operation request.
- the second memory control circuit is further adapted, in response to a write request designating a memory address in the second memory circuit, to set a hazard bit stored in a memory hazard register corresponding to the memory address and, following writing or storing data to the second memory circuit at the memory address, to reset or clear the set hazard bit.
- the second memory control circuit is further adapted, in response to a write request having write data and designating a memory address in the second memory circuit, to transfer current data stored at the memory address to the first memory control circuit to write the current data to the first memory circuit, and to overwrite the current data in the second memory circuit with the write data.
- the second memory control circuit is further adapted, in response to a write request having write data and designating a memory address in the second memory circuit, to set a hazard bit stored in a memory hazard register corresponding to the memory address, to transfer current data stored at the memory address to the first memory control circuit to write the current data to the first memory circuit, to overwrite the current data in the second memory circuit with the write data and, following writing or storing the write data to the second memory circuit at the memory address, to reset or clear the set hazard bit.
- the second memory control circuit is further adapted, in response to an atomic operation request designating the at least one programmable atomic operation and a memory address, to transfer the atomic operation request to the programmable atomic operations circuitry and to set a hazard bit stored in a memory hazard register corresponding to the memory address.
- the second memory control circuit is further adapted, in response to an atomic operation request designating the at least one predetermined atomic operation and a memory address, to transfer the atomic operation request to the predetermined atomic operations circuitry, to set a hazard bit stored in a memory hazard register corresponding to the memory address, to write resulting data from the predetermined atomic operation in the second memory circuit and, following writing of the resulting data, to reset or clear the set hazard bit.
- the first memory control circuit may comprise: a plurality of memory bank request queues storing a plurality of read or write requests to the first memory circuit; a scheduler circuit coupled to the plurality of memory bank request queues, the scheduler adapted to select a read or write request of the plurality of read or write requests from the plurality of memory bank request queues and to schedule the read or write request for access to the first memory circuit; and a first memory access control circuit coupled to the scheduler, the first memory access control circuit adapted to read or load data from the first memory circuit and to write or store data to the first memory circuit.
- the first memory control circuit may further comprise: a plurality of memory request queues storing a plurality of memory requests; a request selection multiplexer to select a memory request from the plurality of memory request queues; a plurality of memory data queues storing data corresponding to the plurality of memory requests; and a data selection multiplexer to select data from the plurality of memory data queues, the selected data corresponding to the selected memory request.
- the second memory control circuit may comprise: a network request queue storing a read request or a write request; an atomic operation request queue storing an atomic operation request; an inbound request multiplexer couple to the network request queue and to the atomic operation request queue to select a request from the network request queue or the atomic operation request queue; a memory hazard control circuit having one or more memory hazard registers; and a second memory access control circuit coupled to the memory hazard control circuit and to the inbound request multiplexer, the second memory access control circuit adapted to read or load data from the second memory circuit or to write or store data to the second memory circuit in response to the selected request, and to signal the memory hazard control circuit to set or clear a hazard bit stored in the one or more memory hazard registers.
- the second memory control circuit may further comprise: a delay circuit coupled to the second memory access control circuit; and an inbound control multiplexer to select an inbound network request which requires accessing the first memory circuit or to select a cache eviction request from the second memory circuit when a cache line of the second memory circuit contains data which is to be written to the first memory circuit prior to being overwritten by data from a read request or a write request.
- the memory controller circuit may be coupled to a communication network for routing of a plurality of write data request packets, a plurality of write data request packets, a plurality of predetermined atomic operations request packets, a plurality of programmable atomic operation request packets to the memory controller circuit, and routing of a plurality of response data packets from the memory controller circuit to a request source address.
- the programmable atomic operations circuitry may comprise: a processor circuit coupled to the first memory control circuit through an unswitched, direct communication bus.
- the first memory control circuit, the second memory circuit, the second memory control circuit, the predetermined atomic operations circuitry, and the programmable atomic operations circuitry may be embodied as a single integrated circuit or as a single system-on-a-chip (SOC).
- SOC system-on-a-chip
- the first memory control circuit, the second memory circuit, the second memory control circuit, and the predetermined atomic operations circuitry may be embodied as a first integrated circuit
- the programmable atomic operations circuitry may be embodied as a second integrated circuit coupled through an unswitched, direct communication bus to the first integrated circuit.
- the programmable atomic operations circuitry is adapted to generate a read request and generate a write request to the second memory circuit.
- the programmable atomic operations circuitry is adapted to perform arithmetic operations, logic operations, and control flow decisions.
- the first memory circuit comprises dynamic random access memory (DRAM) circuitry and the second memory circuit comprises static random access memory (SRAM) circuitry.
- DRAM dynamic random access memory
- SRAM static random access memory
- a representative method of using a memory controller circuit to perform a programmable atomic operation is also disclosed, with the memory controller circuit coupleable to a first memory circuit, with the method comprising: using a first memory control circuit coupleable to the first memory circuit, reading or loading requested data from the first memory circuit in response to a read request and writing or storing requested data to the first memory circuit in response to a write request; using a second memory control circuit coupled to a second memory circuit, reading or loading the requested data from the second memory circuit in response to a read request read when the requested data is stored in the second memory circuit, and transferring the read request to the first memory control circuit when the requested data is not stored in the second memory circuit; using predetermined atomic operations circuitry, performing at least one predetermined atomic operation of a plurality of
- predetermined atomic operations in response to an atomic operation request designating the at least one predetermined atomic operation; and using programmable atomic operations circuitry, performing at least one programmable atomic operation of a plurality of programmable atomic operations in response to an atomic operation request designating the at least one
- Another representative method of using a memory controller circuit to perform a programmable atomic operation is also disclosed, with the memory controller circuit coupleable to a first memory circuit, with the method comprising: using a first memory control circuit coupleable to the first memory circuit, reading or loading requested data from the first memory circuit in response to a read request and writing or storing requested data to the first memory circuit in response to a write request; using a second memory control circuit coupled to a second memory circuit, reading or loading the requested data from the second memory circuit in response to a read request read when the requested data is stored in the second memory circuit, and transferring the read request to the first memory control circuit when the requested data is not stored in the second memory circuit, and in response to an atomic operation request designating the at least one programmable atomic operation and a memory address, transferring the atomic operation request to programmable atomic operations circuitry and setting a hazard bit stored in a memory hazard register corresponding to the memory address; using
- predetermined atomic operations circuitry performing at least one predetermined atomic operation of a plurality of predetermined atomic operations in response to an atomic operation request designating the at least one predetermined atomic operation; and using the
- programmable atomic operations circuitry performing at least one programmable atomic operation of a plurality of programmable atomic operations in response to an atomic operation request designating the at least one programmable atomic operation.
- the programmable atomic operations circuitry comprises a processor core coupled to a data buffer, wherein the method may further comprise: using the processor core, executing a load non-buffered instruction to determine if an operand is stored in the data buffer and, when the data is not stored in the data buffer, generating a read request to the second memory control circuit.
- the programmable atomic operations circuitry comprises a processor core, and wherein the method may further comprise: using the processor core, executing a store and clear lock instruction to generate an atomic write request to the second memory control circuit, the atomic write request having the resulting data and a designation to reset or clear a memory hazard bit following writing of the resulting data to the second memory circuit.
- the programmable atomic operations circuitry comprises a processor core, wherein the method may further comprise: using the processor core, executing an atomic return instruction to reset or clear a memory hazard bit following writing of the resulting data to the second memory circuit.
- the programmable atomic operations circuitry comprises a processor core, and wherein the method may further comprise: using the processor core, executing an atomic return instruction to generate a response data packet having the resulting data. Also in a
- the programmable atomic operations circuitry comprises a processor core, and wherein the method may further comprise: using the processor core, executing an atomic return instruction to complete an atomic operation.
- the atomic operation request designating the at least one programmable atomic operation comprises a physical memory address, a
- the programmable atomic operations circuitry further comprises at least one register storing thread state information
- the method may further comprise: using the programmable atomic operations circuitry, in response to receiving the atomic operation request designating the at least one programmable atomic operation, initializing the at least one register with the physical memory address, any data corresponding to the memory address, and the at least one thread state register value.
- the method may further comprise: using the second memory control circuit, reading or loading the requested data from the second memory circuit in response to an atomic operation request when the requested data is stored in the second memory circuit, and transferring the atomic operation request to the first memory control circuit when the requested data is not stored in the second memory circuit.
- the method may further comprise: using the second memory control circuit, in response to a write request designating a memory address in the second memory circuit, setting a hazard bit stored in a memory hazard register corresponding to the memory address and, following writing or storing data to the second memory circuit at the memory address, resetting or clearing the set hazard bit.
- the method may further comprise: using the second memory control circuit, in response to a write request having write data and designating a memory address in the second memory circuit, transferring current data stored at the memory address to the first memory control circuit to write the current data to the first memory circuit, and overwriting the current data in the second memory circuit with the write data.
- the method may further comprise: using the second memory control circuit, in response to a write request having write data and designating a memory address in the second memory circuit, setting a hazard bit stored in a memory hazard register corresponding to the memory address, transferring current data stored at the memory address to the first memory control circuit to write the current data to the first memory circuit, overwriting the current data in the second memory circuit with the write data and, following writing or storing the write data to the second memory circuit at the memory address, resetting or clearing the set hazard bit.
- the method may further comprise: using the second memory control circuit, in response to an atomic operation request designating the at least one programmable atomic operation and a memory address, transferring the atomic operation request to the programmable atomic operations circuitry and setting a hazard bit stored in a memory hazard register corresponding to the memory address.
- Another memory controller is disclosed, the memory controller coupleable to a first memory circuit, with the memory controller comprising: a first memory control circuit coupleable to the first memory circuit, the first memory control circuit comprising: a plurality of memory bank request queues storing a plurality of read or write requests to the first memory circuit; a scheduler circuit coupled to the plurality of memory bank request queues, the scheduler adapted to select a read or write request of the plurality of read or write requests from the plurality of memory bank request queues and to schedule the read or write request for access to the first memory circuit; and a first memory access control circuit coupled to the scheduler, the first memory access control circuit adapted to read or load data from the first memory circuit and to write or store data to the first memory circuit; a second memory circuit; predetermined atomic operations circuitry adapted to perform at least one predetermined atomic operation of a plurality of predetermined atomic operations; and programmable atomic operations circuitry adapted to perform at least one programmable atomic operation of a plurality of programmable
- Another memory controller is disclosed, the memory controller coupleable to a first memory circuit, with the memory controller comprising: a first memory control circuit coupleable to the first memory circuit, the first memory control circuit comprising:
- a plurality of memory bank request queues storing a plurality of read or write requests to the first memory circuit; a scheduler circuit coupled to the plurality of memory bank request queues, the scheduler adapted to select a read or write request of the plurality of read or write requests from the plurality of memory bank request queues and to schedule the read or write request for access to the first memory circuit; and a first memory access control circuit coupled to the scheduler, the first memory access control circuit adapted to read or load data from the first memory circuit and to write or store data to the first memory circuit; a second memory circuit; predetermined atomic operations circuitry adapted to perform at least one
- predetermined atomic operation of a plurality of predetermined atomic operations and programmable atomic operations circuitry adapted to perform at least one programmable atomic operation of a plurality of programmable atomic operations; and a second memory control circuit coupled to the second memory circuit, the second memory control circuit comprising: at least one input request queue storing a read or write request; a memory hazard control circuit having memory hazard registers; and a second memory access control circuit adapted to read or load data from the second memory circuit and to write or store data to the second memory circuit, the second memory access control circuit further adapted, in response to an atomic operation request designating the at least one predetermined atomic operation and a memory address, to transfer the atomic operation request to the predetermined atomic operations circuitry, to set a hazard bit stored in a memory hazard register corresponding to the memory address, to write resulting data from the predetermined atomic operation in the second memory circuit and, following writing of the resulting data, to reset or clear the set hazard bit.
- Figure (or“FIG.”) 1 is a block diagram of a representative first computing system embodiment.
- Figure (or“FIG.”) 2 is a block diagram of a representative second computing system embodiment.
- FIG. 3 is a high-level block diagram of a representative first and second memory controller circuits.
- Figure (or“FIG.”) 4 is a block diagram of a representative first memory controller circuit embodiment.
- Figure (or“FIG.”) 5 is a block diagram of a representative second memory controller circuit embodiment.
- Figures (or“FIGs.”) 6A, 6B, and 6C are block diagrams of, respectively, a representative second memory control circuit embodiment, a representative first memory control circuit embodiment, and a representative atomic and merge operations circuit.
- FIG. 7A, 7B and 7C are flow charts of a representative method of receiving and decoding of a request and performing a read or load request, with FIGs. 7A and 7B showing a representative method of receiving and decoding of a request and of performing a read or load request from a first memory circuit, and FIG. 7C showing a representative method of performing a read or load request from a second memory circuit.
- Figure (or“FIG.”) 8A, 8B, 8C and 8D are flow charts showing a representative method of performing an atomic operation as part of an atomic operation request.
- Figure (or“FIG.”) 9 is a flow chart showing a representative method of performing a data eviction from the second memory circuit as part of a read (or load) request or as part of a write (or store) request.
- Figure (or“FIG.”) 10 is a flow chart of a representative method of performing a write or store request.
- Figure (or“FIG.”) 11 is a block diagram of a representative programmable atomic operations circuitry embodiment. DETAILED DESCRIPTION OF REPRESENTATIVE EMBODIMENTS
- FIG. 1 is a block diagram of a representative first computing system 50 embodiment.
- FIG. 2 is a block diagram of a representative second computing system 50A embodiment.
- FIG. 3 is a high-level block diagram of a representative first and second memory controller circuits.
- FIG. 4 is a block diagram of a representative first memory controller circuit 100 embodiment.
- FIG. 5 is a block diagram of a representative second memory controller circuit 100A embodiment.
- FIG. 6, illustrated as FIGs. 6A, 6B, and 6C, are block diagrams of, respectively, a representative second memory control circuit embodiment, a representative first memory control circuit embodiment, and a representative atomic and merge operations circuit.
- FIGs. 1 and 2 show different first and second computing system 50, 50A embodiments which include additional components forming comparatively larger and smaller systems 50, 50A, any and all of which are within the scope of the disclosure.
- a computing system 50, 50A in various combinations as illustrated, may include one or more processors 110, a communication network 150, optionally one or more hybrid threading processors (“HTPs”) 115, optionally one or more configurable processing circuits 105, various one or more optional communication interfaces 130, a first memory controller circuit 100 in the first computing system 50 or a second memory controller circuit 100A in the second computing system 50A, and in both first and second computing system 50, 50A, a first memory circuit 125 which is coupled, respectively, to either the first memory controller circuit 100 or the second memory controller circuit 100A.
- SOC system-on-a-chip
- the first memory controller circuit 100 differs from the second memory controller circuit 100A insofar as the first memory controller circuit 100 further includes programmable atomic operations circuitry 135 as an integrated device, i.e., the first memory controller circuit 100 comprises all of the functionality and circuitry of a second memory controller circuit 100A, and further comprises programmable atomic operations circuitry 135.
- a processor 110, 110A comprises programmable atomic operations circuitry 135 and other, additional circuitry, such as network communication interface circuitry 170 or other or additional communication and processing circuitry, for example and without limitation.
- the programmable atomic operations circuitry 135 is utilized for performance of programmable atomic operations.
- those programmable atomic operations are performed within the programmable atomic operations circuitry 135 of the first memory controller circuit 100.
- those programmable atomic operations are performed in conjunction with the programmable atomic operations circuitry 135 of the separate processor 110A.
- the second memory controller circuit In the second computing system 50 A, the second memory controller circuit
- a processor 110A is directly coupled, such as through a separate bus structure 60, to a processor 110A, either as separate integrated circuits or as separate chiplets, for example and without limitation.
- a processor 110A may be implemented to be identical to a processor 110, or may be implemented as a different or simpler processor designed to mostly or only implement programmable atomic operations.
- the processor 110A is illustrated separately solely to illustrate that the second memory controller circuit 100A has a direct, rather than switched or routed, communication path to and from the processor 110A.
- a processor 110 may be utilized to implement a processor 110A, with the processor 110A additionally provided with the direct communication path (e.g., bus 60) to the second memory controller circuit 100A.
- the first memory controller circuit 100 differs from the second memory controller circuit 100A only insofar as the first memory controller circuit 100 includes the additional circuitry and functionality of programmable atomic operations circuitry 135 as an integrated device, such as within a single integrated circuit or as part of an SOC, whereas a second memory controller circuit 100A communicates directly with programmable atomic operations circuitry 135 which is part of a separate processor 110A, as illustrated in FIG. 3.
- the first memory controller circuit 100 comprises all of the identical circuitry and functionality of a second memory controller circuit 100A and further comprises the additional programmable atomic operations circuitry 135.
- a processor 110, 110A is typically a multi -core processor, which may be embedded within the first or second computing system 50, 50A, or which may be an external processor coupled into the first or second computing system 50, 50A via a communication interface 130, such as a PCIe-based interface.
- a communication interface 130 such as a PCIe-based interface.
- Such a processor may be implemented as known or becomes known in the electronic arts, and as described in greater detail below.
- the communication interface 130 such as a PCIe-based interface, may be implemented as known or becomes known in the electronic arts, and provides communication to and from the system 50, 50A and another, external device.
- the programmable atomic operations circuitry 135 of a first memory controller circuit 100 or of a processor 110, 110A may be RISC-V ISA based multi -threaded processor having one or more processor cores 605, for example, and further having an extended instruction set for executing programmable atomic operations, as discussed in greater detail below with reference to FIG. 11.
- representative programmable atomic operations circuitry 135 and/or processors 110, 110A may be embodied as one or more hybrid threading processor(s) 115 described in U.S. Patent Application No.
- the programmable atomic operations circuitry 135 of a first memory controller circuit 100 or of a processor 110, 110A provides barrel-style, round-robin instantaneous thread switching to maintain a high instruction-per-clock rate.
- the communication network 150 also may be implemented as known or becomes known in the electronic arts.
- the communication network 150 is a packet-based communication network providing data packet routing between and among the processor(s) 110, 110A, the first or second memory controller circuits 100, 100A, optionally one or more hybrid threading processors 115, optionally one or more configurable processing circuits 105, and various one or more optional communication interfaces 130.
- each packet typically includes destination and source addressing, along with any data payload and/or instruction.
- first or second memory controller circuits 100, 100A may receive a packet having a source address, a read (or load) request, and a physical address in the first memory circuit 125.
- the first or second memory controller circuits 100, 100A will read the data from the specified address (which may be in the first memory circuit 125 or in second memory circuit 175, as discussed below), and assemble a response packet to the source address containing the requested data.
- first or second memory controller circuits 100, 100A may receive a packet having a source address, a write (or store) request, and a physical address in the first memory circuit 125.
- the first or second memory controller circuits 100, 100A will write the data to the specified address (which may be in the first memory circuit 125 or in second memory circuit 175, as discussed below), and assemble a response packet to the source address containing an acknowledgement that the data was stored to a memory (which may be in the first memory circuit 125 or in second memory circuit 175, as discussed below).
- the communication network 150 may be embodied as a plurality of crossbar switches having a folded clos configuration, and/or a mesh network providing for additional connections, depending upon the system 50, 50A
- the communication network 150 may be part of an asynchronous switching fabric, meaning that a data packet may be routed along any of various paths, such that the arrival of any selected data packet at an addressed destination may occur at any of a plurality of different times, depending upon the routing.
- the communication network 150 may be implemented as a synchronous communication network, such as a synchronous mesh communication network. Any and all such communication networks 150 are considered equivalent and within the scope of the disclosure.
- a representative embodiment of a communication network 150 is also described in U.S. Patent Application No. 16/176,434.
- the optional one or more hybrid threading processors 115 and one or more configurable processing circuits 105 are discussed in greater detail in various related applications, such as U.S. Patent Application No. 16/176,434, and are illustrated to provide examples of the various components which may be included within a computing system 50, 50A.
- a first memory controller circuit 100 is coupled to a first memory circuit 125, such as for write (store) operations and read (load) operations to and from the first memory circuit 125.
- the first memory controller circuit 100 comprises a first memory control circuit 155, a second memory control circuit 160, atomic and merge operation circuits 165, a second memory circuit 175, and a network communication interface 170.
- the network communication interface 170 is coupled to the communication network 150, such as via bus or other communication structures 163, which typically include address (routing) lines and data payload lines (not separately illustrated).
- the first memory control circuit 155 is directly coupled to the first memory 125, such as via a bus or other communication structure 157, to provide write (store) operations and read (load) operations to and from the first memory circuit 125.
- the first memory control circuit 155 is also coupled for output to the atomic and merge operation circuits 165 and, for input, to the second memory control circuit 160.
- the second memory control circuit 160 is directly coupled to the second memory circuit 175, such as via a bus or other communication structure 159, coupled to the network communication interface 170 for input (such as incoming read or write requests), such as via a bus or other communication structure 161, and coupled for output to the first memory control circuit 155.
- the second memory circuit 175 is typically part of the same integrated circuit having the first or second memory controller circuit 100, 100A.
- the atomic and merge operation circuits 165 is coupled to receive (as input) the output of the first memory control circuit 155, and to provide output to the second memory circuit 175, the network communication interface 170 and/or directly to the communication network 150.
- a second memory controller circuit 100A is coupled to a first memory circuit 125, such as for write (store) operations and read (load) operations to and from the first memory circuit 125, and to a processor 110A.
- the second memory controller circuit 100A comprises a first memory control circuit 155, a second memory control circuit 160, atomic and merge operation circuits 165 A, a second memory circuit 175, and a network communication interface 170.
- the network communication interface 170 is coupled to the communication network 150, such as via bus or other communication structures 163, which typically include address (routing) lines and data payload lines (not separately illustrated).
- the first memory control circuit 155 is directly coupled to the first memory 125, such as via a bus or other communication structure 157, to provide write (store) operations and read (load) operations to and from the first memory circuit 125.
- the first memory control circuit 155 is also coupled for output to the atomic and merge operation circuits 165 A and, for input, to the second memory control circuit 160.
- the second memory control circuit 160 is directly coupled to the second memory circuit 175, such as via a bus or other communication structure 159, coupled to the network communication interface 170 for input (such as incoming read or write requests), such as via a bus or other communication structure 161, and coupled for output to the first memory control circuit 155.
- the atomic and merge operation circuits 165 A is coupled to receive (as input) the output of the first memory control circuit 155, and to provide output to the second memory circuit 175, the network communication interface 170 and/or directly to the communication network 150.
- the first and second memory controller circuits 100, 100A differ insofar as the first memory controller circuit 100 includes programmable atomic operations circuitry 135 (in atomic and merge operation circuits 165), which is coupled to the first memory control circuit 155 through bus or communication lines 60A, and the second memory controller circuit 100A is coupled to programmable atomic operations circuitry 135 in a separate processor 110A, coupled to the first memory control circuit 155 through bus or communication lines 60.
- programmable atomic operations circuitry 135 in atomic and merge operation circuits 165
- the second memory controller circuit 100A is coupled to programmable atomic operations circuitry 135 in a separate processor 110A, coupled to the first memory control circuit 155 through bus or communication lines 60.
- the atomic and merge operation circuits 165 comprise a memory hazard clear (reset) circuit 190, a write merge circuit 180, predetermined atomic operations circuitry 185, and programmable atomic operations circuitry 135, and in the second memory controller circuit 100A, the atomic and merge operation circuits 165 A comprise a memory hazard clear (reset) circuit 190, a write merge circuit 180 and predetermined atomic operations circuitry 185.
- the memory hazard clear (reset) circuit 190, write merge circuit 180 and the predetermined atomic operations circuitry 185 may each be implemented as state machines with other combinational logic circuitry (such as adders (and subtracters), shifters, comparators, AND gates, OR gates, XOR gates, etc.) or other logic circuitry, and may also include one or more registers or buffers to store operand or other data, for example.
- the programmable atomic operations circuitry 135 may be implemented as one or more processor cores and control circuitry, and various state machines with other
- combinational logic circuitry such as adders, shifters, etc.
- other logic circuitry may also include one or more registers, buffers, and/or memories to store addresses, executable instructions, operand and other data, for example, or may be implemented as a processor 110, or a processor more generally (as described below).
- the memory hazard clear (reset) circuit 190 is not required to be a separate circuit in the atomic and merge operation circuits 165, 165 A and instead may be part of the memory hazard control circuit 230.
- the network communication interface 170 includes network input queues 205 to receive data packets (including read and write request packets) from the communication network 150; network output queues 210 to transfer data packets (including read and write response packets) to the communication network 150; a data packet decoder circuit 215 to decode incoming data packets from the communication network 150, to take data (in designated fields, such as request type, source address, and payload data) and transfer the data provided in the packet to the second memory control circuit 160; and data packet encoder circuit 220 to encode outgoing data packets (such as responses to requests to the first memory circuit 125), for transmission on the communication network 150.
- the data packet decoder circuit 215 and the data packet encoder circuit 220 may each be implemented as state machines or other logic circuitry.
- the first memory circuit 125 and the second memory circuit 175 may be any type or kind of memory circuit, as discussed in greater detail below, such as, for example and without limitation, such as RAM, FLASH, DRAM, SDRAM, SRAM, MRAM, FeRAM, ROM, EPROM or E 2 PROM, or any other form of memory device.
- the first memory circuit 125 is DRAM, typically an external DRAM memory device
- the second memory circuit 175 is an SRAM data cache.
- the first memory circuit 125 may be a separate integrated circuit in its own packaging, or a separate integrated circuit which may be included in packaging with the first and second memory controller circuits 100, 100A, such as by sharing a common interposer.
- first memory circuits 125 may be optionally included.
- the first memory circuit 125 may be a Micron GDDR6 memory IC or a Micron NGM memory IC (Micron’s next generation DRAM device), currently available from Micron Technology, Inc., 8000 S. Federal Way,
- Such a GDDR6 device is a JEDEC standard with 16 Gb density, and a peak 64 GB/s per device.
- the second memory circuit 175 (e.g., an SRAM cache) is a memory side cache and is accessed by physical addresses.
- the second memory circuit 175 may be 1 MB in size with 256B line size.
- the 256B line size is chosen to minimize the reduction in achievable bandwidth due to ECC support. Larger line size is possible based on application simulations. Having a memory line size of 256B has the benefit of reducing energy as compared to smaller line sizes, assuming the majority of the accessed second memory circuit 175 is eventually used.
- the requests from the communication network 150 will access the second memory circuit 175 with accesses sized from a single byte up to 64 bytes.
- the tags of the second memory circuit 175 (e.g., an SRAM cache) should be able to handle partial line reads and writes.
- the second memory circuit 175 (as a cache) is beneficial for repetitive atomic operations to the same memory line.
- An application will use a barrier synchronization operation to determine when all threads of a process have finished processing a section of an application.
- An in-memory atomic counting operator is used to determine when all threads have entered the barrier. There are as many atomic counting operations as there are threads in the section of the application. Performing atomic operations on the data within the cache can allow these barrier counting operations to complete with just a few clocks per operation.
- a second, high benefit use of the second memory circuit 175 is caching accesses from the configurable processing circuits 105.
- the configurable processing circuits 105 do not have a cache, but rather data is streamed into and out of internal memory.
- the second memory circuit 175 allows accesses to the same cache line to be efficiently handled.
- the first and second memory controller circuits 100, 100A may receive a data read (data load) request from within the computing system 50, 50A, which has a physical memory address, and which is decoded in the data packet decoder circuit 215 of the network communication interface 170, and transferred to the second memory control circuit 160.
- the second memory control circuit 160 will determine if the requested data corresponding to the physical memory address is within the second memory circuit 175, and if so, will provide the requested data (along with the corresponding request having the address of the requestor (source) to the first memory control circuit 155 and eventually on to the data packet encoder circuit 220 to encode outgoing data packets for transmission on the communication network 150. When the requested data corresponding to the physical memory address is not within the second memory circuit 175, the second memory control circuit 160 will provide the request (and/or the physical memory address) to the first memory control circuit 155, which will access and obtain the requested data from the first memory circuit 125. In addition to providing the requested data to the data packet encoder circuit 220 to encode outgoing data packets for transmission on the
- the first memory control circuit 155 provides the data to the write merge circuit 180, which will also write the data to the second memory circuit 175.
- This additional writing of the requested data to a local cache, such as to the second memory circuit 175, provides a significant reduction in latency, and is a significant and novel feature of the representative embodiments.
- this requested data may be required more frequently than other stored data, so having it stored locally reduces the latency (i.e.. period of time involved) which would otherwise be required to fetch the data from the first memory circuit 125.
- the use of the second memory circuit 175 as a local cache provides reduced latency for repetitively accessed memory locations (in the first memory circuit 125).
- the second memory circuit 175 provides a read buffer for sub-memory line accesses, i.e., accesses to the first memory circuit 125 which do not require the entire memory line of the first memory circuit 125. This use of the second memory circuit 175 is also particularly beneficial for compute elements in the system 50, 50A which have small or no data caches.
- the first and/or second memory controller circuit 100, 100A is responsible for optimally controlling the first memory circuit 125 (e.g ., GDDR6 RAM) to load the second memory circuit 175 (as a cache) with requested data upon a cache miss, and store data from the second memory circuit 175 when a cache line is transferred out of the second memory circuit 175, i.e., evicted to make room for other incoming data.
- the GDDR6 device as representative embodiment of a first memory circuit 125, for example and without limitation, has two independent channels, each 16-bits wide running at 16 GT/s. A single GDDR6 device can support a peak bandwidth of 64 GB/s.
- the GDDR6 device has a channel burst length of 16, resulting in a 32B burst of data.
- Four bursts from each open row i.e., 128 bytes) are required to achieve full memory bandwidth.
- the bandwidth may be reduced when some of the bits are utilized for error correction coding (“ECC”).
- ECC error correction coding
- the second memory control circuit 160 will reserve a cache line in the second memory circuit 175, by setting a hazard bit (in hardware), so that cache line cannot be read, overwritten or modified by another process. As discussed in greater detail below, this process may also remove or“evict” the data currently occupying the reserved cache line, which will then be provided to the first memory control circuit 155 to write (store) this data to be replaced or“evicted” from the second memory circuit 175 and stored in or to the first memory circuit 125. Following the additional writing of the requested data to the second memory circuit 175, any corresponding hazard bit which was set will be cleared (reset) by the memory hazard clear (reset) circuit 190.
- the first and second memory controller circuits 100, 100A may receive a data write (data store) request from within the computing system 50, 50A, which has a physical memory address, and which is decoded in the data packet decoder circuit 215 of the network communication interface 170, and transferred to the second memory control circuit 160.
- the second memory control circuit 160 will write (store) the locally, in the second memory circuit 175.
- the second memory control circuit 160 may reserve a cache line in the second memory circuit 175, by setting a hazard bit (in hardware), so that cache line cannot be read by another process while it is in transition.
- this process may also remove or“evict” the data currently occupying the reserved cache line, which will also be written (stored) to the first memory circuit 125.
- any corresponding hazard bit which was set will be cleared (reset) by the memory hazard clear (reset) circuit 190.
- Predetermined types of atomic operations may also be performed by the predetermined atomic operations circuitry 185 of the atomic and merge operation circuits 165, involving requests for a predetermined or“standard” atomic operation on the requested data, such as a comparatively simple, single cycle, integer atomics, e.g., fetch-and-increment or compare-and-swap, which will occur with the same throughput as a regular memory read or write operation not involving an atomic operation, such as an increment by one atomic operation, .
- the second memory control circuit 160 will reserve a cache line in the second memory circuit 175, by setting a hazard bit (in hardware), so that cache line cannot be read by another process while it is in transition.
- the data is obtained from either the first memory circuit 125 or the second memory circuit 175, and is provided to the predetermined atomic operations circuitry 185 to perform the requested atomic operation.
- the predetermined atomic operations circuitry 185 provides the resulting data to the write merge circuit 180, which will also write the resulting data to the second memory circuit 175.
- any corresponding hazard bit which was set will be cleared (reset) by the memory hazard clear (reset) circuit 190.
- Customized or programmable atomic operations may be performed by the programmable atomic operations circuitry 135 (which may be part of the first memory controller circuit 100 or a processor 110A), involving requests for a programmable atomic operations on the requested data. Any user may prepare any such programming code to provide such customized or programmable atomic operations, subject to various constraints described below.
- the programmable atomic operations may be comparatively simple, multi-cycle operations such as floating point addition, or comparatively complex, multi-instruction operations such as a bloom filter insert.
- the programmable atomic operations can be the same as or different than the predetermined atomic operations, insofar as they are defined by the user rather than a system vendor.
- the second memory control circuit 160 will reserve a cache line in the second memory circuit 175, by setting a hazard bit (in hardware), so that cache line cannot be read by another process while it is in transition.
- the data is obtained from either the first memory circuit 125 or the second memory circuit 175, and is provided to the programmable atomic operations circuitry 135 ( e.g ., within the first memory controller circuit 100 or on dedicated communication link 60 to a processor 110A) to perform the requested programmable atomic operation.
- the programmable atomic operations circuitry 135 will provide the resulting data to the network communication interface 170 (within the first memory controller circuit 100 or within a processor 110A) to directly encode outgoing data packets having the resulting data for transmission on the communication network 150.
- the programmable atomic operations circuitry 135 will provide the resulting data to the second memory control circuit 160, which will also write the resulting data to the second memory circuit 175.
- any corresponding hazard bit which was set will be cleared (reset) by the second memory control circuit 160.
- the approach taken for programmable (i.e..“custom”) atomic operations is to provide multiple, generic, custom atomic request types that can be sent through the communication network 150 to the first and/or second memory controller circuits 100, 100A, from an originating source such as a processor 110 or other system 50, 50A component.
- the first and second memory controller circuits 100, 100A identify the request as a custom atomic and forward the request to the programmable atomic operations circuitry 135, either within the first memory controller circuit 100 or within a processor 110A.
- the programmable atomic operations circuitry 135 (1) is a programmable processing element capable of efficiently performing a user defined atomic operation; (2) can perform load and stores to memory, arithmetic and logical operations and control flow decisions; and (3) leverages the RISC-V ISA with a set of new, specialized instructions to facilitate interacting with the first and/or second memory controller circuits 100, 100A or their components to atomically perform the user-defined operation.
- the RISC-V ISA contains a full set of instructions that support high level language operators and data types.
- the programmable atomic operations circuitry 135 may leverage the RISC-V ISA, but generally support a more limited set of instructions and limited register file size to reduce the die size of the unit when included within a first memory controller circuit 100.
- a second memory control circuit 160 comprises a second memory access control circuit 225; a memory hazard control circuit 230 having memory hazard registers 260; a network request queue 250; an atomic operation return queue 255; an inbound request multiplexer 245; an optional delay circuit 235, and an inbound control multiplexer 240.
- the second memory access control circuit 225 is coupled to the second memory circuit 175 (e.g., SRAM) and comprises state machine and logic circuits to read and write to the second memory circuit 175 with corresponding addressing, to provide signaling to the memory hazard control circuit 230 to set or clear the various memory hazard bits, and to generate cache “eviction” requests when a cache line of the second memory circuit 175 contains data which is to be overwritten by other data and which is to be written to the first memory circuit 125.
- the second memory circuit 175 e.g., SRAM
- the second memory access control circuit 225 is coupled to the second memory circuit 175 (e.g., SRAM) and comprises state machine and logic circuits to read and write to the second memory circuit 175 with corresponding addressing, to provide signaling to the memory hazard control circuit 230 to set or clear the various memory hazard bits, and to generate cache “eviction” requests when a cache line of the second memory circuit 175 contains data which is to be overwritten by other data and which is to
- the memory hazard control circuit 230 comprises memory hazard registers
- the memory hazard control circuit 230 maintains a table of hazard bits in the memory hazard registers 260 indicating which cache lines of the second memory circuit 175 are unavailable for access.
- An inbound request that tries to access such a cache line with a hazard bit set is held by the memory hazard control circuit 230 (or, equivalently, the memory hazard clear (reset) circuit 190) until the hazard is cleared. Once the hazard is cleared then the request is resent through the inbound request multiplexer 245 for processing.
- the tag address of the cache line of the second memory circuit 175 is hashed to a hazard bit index.
- the number of hazard bits is generally chosen to set the hazard collision probability to a sufficiently low level.
- the network request queue 250 provides a queue for inbound requests (e.g., load, store) from the communication network 150.
- the atomic operation return queue 255 provides a queue for resulting data from programmable atomic operations.
- the inbound request multiplexer 245 selects and prioritizes between inbound memory request sources, which are, in order of priority, requests from the memory hazard clear (reset) circuit 190, requests from the atomic operation return queue 255, and requests from the network request queue 250, and provides these requests to the second memory access control circuit 225.
- the optional delay circuit 235 is a pipeline stage to mimic the delay for a read operation from the second memory circuit 175.
- the inbound control multiplexer 240 selects from an inbound network request which requires accessing the first memory circuit 125 (i.e., a cache“miss”, when the requested data was not found in the second memory circuit 175), and a cache “eviction” request from the second memory circuit 175 when a cache line of the second memory circuit 175 contains data which is to be written to the first memory circuit 125 prior to being overwritten by other incoming data (from either a read or write request).
- a first memory control circuit 155 comprises a scheduler circuit 270; one or more first memory bank queues 265; a first memory access control circuit 275; one or more queues for output data and request data, namely, a second memory“hit” request queue 280, a second memory“miss” request queue 285, a second memory“miss” data queue 290, and a second memory“hit” data queue 295; a request selection multiplexer 305, and a data selection multiplexer 310.
- the first memory bank (request) queues 265 are provided so that each separately managed bank of the first memory circuit 125 has a dedicated bank request queue 265 to hold requests until they can be scheduled on the associated bank of the first memory circuit 125.
- the scheduler circuit 270 selects across the bank queues 265 to choose a request for an available bank of the first memory circuit 125, and provides that request to the first memory access control circuit 275.
- the first memory access control circuit 275 is coupled to the first memory circuit 125 (e.g., DRAM) and comprises state machine and logic circuits to read (load) and write (store) to the first memory circuit 125 with corresponding addressing, such as row and column addressing, using the physical addresses of the first memory circuit 125.
- the second memory“hit” data queue 295 holds read data provided directly from the second memory circuit 175 (on communication line(s) 234), i.e., data which was held in and read from the second memory circuit 175, until the requested data is selected for provision in a response message.
- the second memory“miss” data queue 290 holds read data provided from the first memory circuit 125 , i.e., data which was held in and read from the first memory circuit 125 which was not in the second memory circuit 175, also until the requested data is selected for provision in a response message.
- the second memory“hit” request queue 280 holds request packet information (e.g., the source requestor’s identifier or address used to provide addressing for a response packet) when the requested data was available in the second memory circuit 175, until the request is selected for preparation of a response message.
- the second memory“miss” request queue 285 holds request packet information (e.g., the source requestor’s identifier or address used to provide addressing for a response packet) when the requested data was available in the first memory circuit 125 (and not in the second memory circuit 175), until the request is selected for preparation of a response message.
- the data selection multiplexer 310 selects between first memory circuit 125 read data (held in the second memory“miss” data queue 290) and second memory circuit 175 read data (held in the second memory“hit” data queue 295). The selected data is also written to the second memory circuit 175, as mentioned above. Corresponding request data is then selected, using request selection multiplexer 305, which correspondingly selects between response data held in the second memory“miss” request queue 285 and response data held in the second memory“hit” request queue 280. That read data is then matched with
- the outbound response multiplexer 315 selects between (1) read data and request data provided either by the data selection multiplexer 310 and the request selection multiplexer 305; and (2) data generated by the programmable atomic operations circuitry 135 (when included in an atomic and merge operation circuits 165 of a first memory controller circuit 100) and the request data provided by the request selection multiplexer 305.
- the read or generated data and the request data is provided by the outbound response multiplexer 315 to the network communication interface 170, to encode and prepare a response or return data packet for transmission on the communication network 150.
- the processor 110A performing the programmable atomic operation may itself directly encode and prepare a response or return data packet for transmission on the communication network 150.
- the atomic and merge operation circuits 165, 165A comprise a write merge circuit 180, predetermined atomic operations circuitry 185, and a memory hazard clear (reset) circuit 190, with the atomic and merge operation circuits 165 further comprising programmable atomic operations circuitry 135.
- the write merge circuit 180 receives the read data from the data selection multiplexer 310 and the request data from the request selection multiplexer 305, and merges the request data and read data (to create a single unit having the read data and the source address to be used in the response or return data packet), which it then provides: (1) to the write port of the second memory circuit 175 (on line 236) (or, equivalently, to the second memory access control circuit 225 to write to the second memory circuit 175); (2) optionally, to an outbound response multiplexer 315, for selection and provision to the network
- communication interface 170 to encode and prepare a response or return data packet for transmission on the communication network 150; or (3) optionally, to the network
- the outbound response multiplexer 315 may receive and select the read data directly from the data selection multiplexer 310 and the request data directly from the request selection multiplexer 305, for provision to the network communication interface 170, to encode and prepare a response or return data packet for transmission on the communication network 150.
- predetermined atomic operations circuitry 185 receives the request and read data, either from the write merge circuit 180 or directly from the data selection multiplexer 310 and the request selection multiplexer 305.
- the atomic operation is performed, and using the write merge circuit 180, the resulting data is written to (stored in) the second memory circuit 175, and also provided to the outbound response multiplexer 315 or directly to the network communication interface 170, to encode and prepare a response or return data packet for transmission on the communication network 150.
- the predetermined atomic operations circuitry 185 handles predefined atomic operations such as fetch-and-increment or compare-and-swap (e.g., atomic operations listed in Table 1). These operations perform a simple read-modify -write operation to a single memory location of 32-bytes or less in size.
- Atomic memory operations are initiated from a request packet transmitted over the communication network 150.
- the request packet has a physical address, atomic operator type, operand size, and optionally up to 32-bytes of data.
- the atomic operation performs the read-modify -write to a second memory circuit 175 cache memory line, filling the cache memory if necessary.
- the atomic operator response may be a simple completion response, or a response with up to 32-bytes of data.
- Table 1 shows a list of example atomic memory operators in a representative embodiment.
- the request packet size field will specify the operand width for the atomic operation.
- the various processors e.g., programmable atomic operations circuitry 135, processor 110, 110A), hybrid threading processor(s) 115, configurable processing circuit(s) 105) are capable of supporting 32 and 64-bit atomic operations, and in some instances, atomic operations with 16 and 32 bytes.
- the set hazard bit for the reserved cache line is to be cleared, by the memory hazard clear (reset) circuit 190. Accordingly, when the request and read data is received by the write merge circuit 180, a reset or clear signal may be transmitted by the memory hazard clear (reset) circuit 190 to the memory hazard control circuit 230 (on communication line 226), to reset or clear the set memory hazard bit for the reserved cache line in the registers 260.
- the write merge circuit 180 may transmit a reset or clear signal to the memory hazard control circuit 230 (on communication line 226), also to reset or clear the set memory hazard bit for the reserved cache line in the registers 260. Also as mentioned above, resetting or clearing of this hazard bit will also release a pending read or write request involving the designated (or reserved) cache line, providing the pending read or write request to the inbound request multiplexer 245 for selection and processing.
- FIGs. 7A, 7B and 7C (collectively referred to as FIG.
- FIGs. 7A and 7B showing a representative method of receiving and decoding of a request and of performing a read or load request from a first memory circuit
- FIG. 7C showing a representative method of performing a read or load request from a second memory circuit
- FIGs. 8A, 8B, 8C, and 8D are flow charts showing a representative method of performing an atomic operation as part of an atomic operation request.
- FIG. 9 is a flow chart showing a representative method of performing a data eviction from the second memory circuit as part of a read (or load) request or as part of a write (or store) request.
- FIG. 10 is a flow chart of a representative method of performing a write or store request.
- the first and/or second memory controller circuits 100 are identical to each other.
- Table 2 shows a list of example read, write and atomic operations, and corresponding requests, in a representative embodiment (with“...” indicating that the requests for other operations may be specified using the immediately preceding request type and pattern, e.g., an AmoXor request for a Fetch-and-XOR atomic operation, an AmoAnd request for a Fetch-and-AND atomic operation, for example and without limitation).
- Table 3 shows a list of example responses from the first and/or second memory controller circuits 100, 100A to read, write and atomic requests, which responses are transmitted as data packets over the communication network 150, in a representative embodiment.
- the source entity or device i.e., the entity or device issuing the read or write request, such as the various processors (e.g., processor 110), hybrid threading processor(s) 115, configurable processing circuit(s) 105), generally will not have any information and does not require any information concerning whether the requested read data or requested write data is or will be held in the first memory circuit 125 or the second memory circuit 175, and simply may generate a read or write request to memory and transmit the request over the communication network 150 to the first and/or second memory controller circuits 100, 100A.
- the representative method of performing a representative method of receiving and decoding of a request and performing a read or load request begins with the reception of a request (e.g., a request from Table 2) by the first and/or second memory controller circuits 100, 100A, start step 400.
- a request e.g., a request from Table 2
- the packet decoder circuit 215 the received request is decoded, the type of request is determined (read, write, atomic operation), and the request is placed in a corresponding queue (network request queue 250 or atomic operation request queue 255), step 402.
- a packet decoder circuit 215 if a packet decoder circuit 215 is not included, then the request is placed in a single request queue (a combined network request queue 250 and atomic operation request queue 255), and the steps of decoding the received request and determining the type of request, of step 402 is performed by the second memory access control circuit 225.
- a request is selected from the queue by the inbound request multiplexer 245, step 404, and when the request is a read request, step 406, the second memory access control circuit 225 determines whether the requested data is stored in the second memory circuit 175, step 408.
- the second memory access control circuit 225 determines whether it is a write request, step 410, and if so, proceeds with step 540 illustrated and discussed with reference to FIG. 10.
- the received request is neither a read request nor a write request from the network request queue 250, it is an atomic operation request from the atomic operation queue 255, and the second memory access control circuit 225 proceeds with step 456 illustrated and discussed with reference to FIG. 8.
- steps 400, 402, 404, and 406 or 410 are generally applicable to all read, write, and/or atomic operations, not just the read operation illustrated in FIG. 7.
- the method will have completed steps 400, 402, 404, and 410, and will have determined that the request selected from the network request queue 250 is a write request.
- steps 406 and 410, determining whether the request is a read request or a write request may occur in any order; as a consequence, completion of step 406 is not required for the commencement of a write operation.
- the determination as to whether the request is an atomic operation request may occur as a separate step (not illustrated), and not merely by a process of elimination that the request is not a read request and is not a write request.
- only two of the steps of determining whether the request is a read request, a write request, or an atomic operation request is required, with any third type of request automatically determined by elimination, that it is not a first type of request and is not a second type of request. All such variations are considered equivalent and within the scope of the disclosure.
- the second memory access control circuit 225 selects a cache line in the second memory circuit 175, step 411, and, using the memory hazard control circuit 230, determines whether that particular cache line in the second memory circuit 175 has a hazard bit set in the memory hazard registers 260, step 412. If the hazard bit is set for that cache line in the second memory circuit 175, the second memory access control circuit 225 determines whether another cache line is available (which does not have a hazard bit set), step 414, and if so, selects that available cache line in the second memory circuit 175, step 416.
- the second memory access control circuit 225 queues the read request in the memory hazard control circuit 230, step 418, until a hazard bit has been reset or cleared for a cache line in the second memory circuit 175, step 420, and the second memory access control circuit 225 selects that cache line with the reset or cleared hazard bit, returning to step 416.
- the second memory access control circuit 225 determines whether there is data already stored in the selected cache line, step 422, and if there is data in that cache line, performs a data eviction process, step 423 (i.e., performs steps 522 - 534 for a data eviction from the second memory circuit 175, illustrated and discussed with reference to FIG. 9).
- the second memory access control circuit 225 When the selected cache line either had no data already stored (step 422) or the data eviction process has completed (step 423), the second memory access control circuit 225 generates a signal to the memory hazard control circuit 230 to set a hazard bit for the selected cache line in the second memory circuit 175, step 424, to block other requests from accessing the same cache line, as the data in that cache line will be in the process of transitioning and another read or write process should not access it, providing memory coherency.
- the second memory access control circuit 225 transfers the read request to the optional delay circuit 235 (to match the amount of time taken by the second memory access control circuit 225 to access the second memory circuit 175 and determine a cache miss) (or as another option, transfers the request directly to the inbound control multiplexer 240), so that the request will then be selected by the inbound control multiplexer 240 and stored in the first memory bank queues 265, to queue the read request for access to the first memory circuit 125, step 426.
- the scheduler circuit 270 eventually selects the read request from the first memory bank queues 265 and schedules (or initiates) accessing the memory bank of the first memory circuit 125, step 428.
- the requested data from the first memory circuit 125 is read or obtained and is provided to the second memory“miss” data queue 290, and the corresponding request (or request data, such as source address) is provided to the second memory“miss” request queue 285, step 430.
- the read data and the corresponding request are selected and paired together using the write merge circuit 180, step 432, with the write merge circuit 180 then writes the read data to the selected cache line in the second memory circuit 175 (via communication line 236) (or, equivalently, to the second memory access control circuit 225 to write to the second memory circuit 175), step 434.
- “pairing” together of the read data and the corresponding request simply means selecting them together or matching them together, using the data selection multiplexer 310 and the request selection multiplexer 305, such that both the data and the request can be utilized together or concurrently, such as for an atomic operation or to prepare an outgoing response data packet, for example (i.e.. to avoid read data from being paired with the wrong request and being sent in error to the source of the wrong request).
- the previously set hazard bit is reset or cleared for the selected cache line, step 436.
- a read response data packet (e.g., a response from Table 3) having the requested read data is prepared and transmitted to the source address, generally over the communication network 150, step 438, and the read operation from the first memory circuit 125 may end, return step 440.
- the second memory access control circuit 225 determines whether that particular cache line in the second memory circuit 175 has a hazard bit set in the memory hazard registers 260, step 442. If the hazard bit is set for that cache line in the second memory circuit 175, the second memory access control circuit 225 queues the read request in the memory hazard control circuit 230, step 444, until a hazard bit has been reset or cleared for that cache line in the second memory circuit 175, step 446.
- the second memory access control circuit 225 reads or obtains the requested data from that cache line and transfers it directly to the second memory“hit” data queue 295, step 448.
- the second memory access control circuit 225 transfers the read request to the optional delay circuit 235 (to match the amount of time taken by the second memory access control circuit 225 to access the second memory circuit 175 and obtain the data) and the corresponding request (or request data, such as source address) is provided to the second memory“hit” request queue 280.
- the data selection multiplexer 310 and the request selection multiplexer 305 the read data and the corresponding request are selected and paired together, step 450.
- a read response data packet (e.g., a response from Table 3) having the requested read data is prepared and transmitted to the source address, generally over the communication network 150, step 452, and the read operation from the second memory circuit 175 may end, return step 454.
- a read response data packet e.g., a response from Table 3
- the read operation from the second memory circuit 175 may end, return step 454.
- communication interface 170 to encode and prepare a response or return data packet for transmission on the communication network 150; or (2) optionally, using the write merge circuit 180, providing the read data and the corresponding request to the network
- the outbound response multiplexer 315 may receive and select the read data directly from the data selection multiplexer 310 and the request data directly from the request selection multiplexer 305, for provision to the network communication interface 170, to encode and prepare a response or return data packet for transmission on the communication network 150.
- the incoming request to the first and/or second memory controller circuits 100, 100A may be for an atomic operation, which is essentially a read request to obtain the operand data, followed by an atomic operation on the operand date, followed by a write request to save the resulting data to memory.
- the read operation portion of the atomic operation request generally tracks the read operation previously discussed, with the additional step of setting a hazard bit for the selected cache line of the second memory circuit 175, through step 432 for a cache miss or through step 450 for a cache hit, i.e., through obtaining the data from either the first memory circuit 125 or the second memory circuit 175 and providing the data and request into the corresponding queues 280, 285, 290, 295.
- these steps are also discussed below.
- the second memory access control circuit 225 determines whether the requested operand data is stored in the second memory circuit 175, step 456.
- the second memory access control circuit 225 has determined (in step 456) that the requested data is stored in a cache line in the second memory circuit 175, i.e., a cache hit, using the memory hazard control circuit 230, the second memory access control circuit 225 determines whether that particular cache line in the second memory circuit 175 has a hazard bit set in the memory hazard registers 260, step 458.
- the second memory access control circuit 225 queues the atomic operation request in the memory hazard control circuit 230, step 460, until a hazard bit has been reset or cleared for that cache line in the second memory circuit 175, step 462.
- the second memory access control circuit 225 sets the hazard bit for that cache line in the second memory circuit 175 (as that data will be updated following the atomic operation), step 464, obtains the requested data from that cache line and transfers it directly to the second memory“hit” data queue 295, step 466, performing the“fetch” portion of the atomic operation (e.g., of a Fetch-and-AND or of Fetch-and-Swap (Exchange), for example).
- the“fetch” portion of the atomic operation e.g., of a Fetch-and-AND or of Fetch-and-Swap (Exchange), for example.
- the second memory access control circuit 225 transfers the atomic operation request to the optional delay circuit 235 (to match the amount of time taken by the second memory access control circuit 225 to access the second memory circuit 175 and obtain the data) and the corresponding request (or request data, such as source address) is provided to the second memory“hit” request queue 280.
- the data selection multiplexer 310 and the request selection multiplexer 305 the read operand data and the corresponding atomic operation request are selected and paired together, step 468.
- the second memory access control circuit 225 selects a cache line in the second memory circuit 175, step 470, and, using the memory hazard control circuit 230, determines whether that particular cache line in the second memory circuit 175 has a hazard bit set in the memory hazard registers 260, step 472. If the hazard bit is set for that cache line in the second memory circuit 175, the second memory access control circuit 225 determines whether another cache line is available (which does not have a hazard bit set), step 474, and if so, selects that available cache line in the second memory circuit 175, step 476.
- the second memory access control circuit 225 queues the atomic operation request in the memory hazard control circuit 230, step 478, until a hazard bit has been reset or cleared for a cache line in the second memory circuit 175, step 480, and the second memory access control circuit 225 selects that cache line with the reset or cleared hazard bit, returning to step 476.
- the second memory access control circuit 225 determines whether there is data already stored in the selected cache line, step 482, and if there is data in that cache line, performs a data eviction process, step 484 (i.e., performs steps 522 - 534 for a data eviction from the second memory circuit 175, illustrated and discussed with reference to FIG. 9).
- the second memory access control circuit 225 When the selected cache line either had no data already stored (step 482) or the data eviction process has completed (step 484), the second memory access control circuit 225 generates a signal to the memory hazard control circuit 230 to set a hazard bit for the selected cache line in the second memory circuit 175, step 486, to block other requests from accessing the same cache line, as the data in that cache line will be in the process of transitioning and another read or write process should not access it, providing memory coherency.
- the second memory access control circuit 225 transfers the atomic operation request to the optional delay circuit 235 (to match the amount of time taken by the second memory access control circuit 225 to access the second memory circuit 175 and determine a cache miss) (or as another option, transfers the request directly to the inbound control multiplexer 240), so that the request will then be selected by the inbound control multiplexer 240 and stored in the first memory bank queues 265, to queue the atomic operation request for access to the first memory circuit 125, step 488.
- the scheduler circuit 270 eventually selects the atomic operation request from the first memory bank queues 265 and schedules (or initiates) accessing the memory bank of the first memory circuit 125, step 490.
- the requested data from the first memory circuit 125 is obtained (read) and is provided to the second memory“miss” data queue 290, and the corresponding atomic operation request (including request data, such as source address) is provided to the second memory “miss” request queue 285, step 492, performing the“fetch” portion of the atomic operation.
- the read data and the corresponding request are selected and paired together, step 494.
- step 468 or step 494 there is an available cache line in the second memory circuit 175 which has been reserved (e.g., hazard bit set), operand data has been read (obtained) from either the second memory circuit 175 or the first memory circuit 125, and the read data has been paired or matched to its corresponding atomic operation request.
- reserved e.g., hazard bit set
- step 496 the data selection multiplexer 310 and the request selection multiplexer 305 transfer the data and request to the predetermined atomic operations circuitry 185, step 498, and the predetermined atomic operations circuitry 185 performs the requested atomic operation to produce resulting data, step 500, e.g., Fetch-and- AND, Fetch-and-OR, Fetch-and-XOR, Fetch-and-Add, Fetch-and-Subtract, Fetch-and- Increment, Fetch-and-Decrement, Fetch-and-Minimum, Fetch-and-Maximum, Fetch-and-Swap (Exchange), Compare-and-Swap.
- the resulting data is written to the selected cache line in the second memory circuit 175 (via communication line 236) (or, equivalently, to the second memory access control circuit 225 to write to the second memory circuit 175), step 502.
- the previously set hazard bit is reset or cleared for the selected cache line, step 504, using the memory hazard clear (reset) circuit 190 or the memory hazard control circuit 230.
- an atomic operation response data packet (e.g., a response from Table 3) having the requested resulting data is prepared and transmitted by the network communication interface 170 to the source address (provided in the request), generally over the communication network 150, step 506, and the predetermined atomic operation may end, return step 508.
- the atomic operation request is not for a predetermined atomic operation in step 496, i.e.. is for a programmable or custom atomic operation
- the atomic operation request and the read data are transferred, as part of a“work descriptor” discussed in greater detail below, to the programmable atomic operations circuitry 135, step 510, either over communication line or bus 60 to the processor 110A or over communication line or bus 60A to the programmable atomic operations circuitry 135 within the atomic and merge operation circuits 165.
- the programmable atomic operations circuitry 135 performs the requested programmable atomic operation to produce resulting data, step 512, as discussed in greater detail below, and transfers the resulting data with the programmable atomic operation request to the atomic operation request queue 255, step 514.
- the resulting data is written to the selected cache line in the second memory circuit 175, essentially as a write operation, step 516.
- a programmable atomic operation response data packet (e.g., a response from Table 3) having the requested resulting data is prepared and transmitted to the source address, generally over the communication network 150, step 520, and the programmable atomic operation may end, return step 505.
- the programmable atomic operation response data packet may be prepared and transmitted.
- the programmable atomic operations circuitry 135 is included within the atomic and merge operation circuits 165
- the programmable atomic operation response data packet may be prepared and transmitted the same way any other response packet is prepared and transmitted, as described above (e.g., using the packet encoder 220 of the network communication interface 170, and so on).
- the programmable atomic operation response data packet may be prepared and transmitted directly by the processor 110A, which also generally has a similar or identical network communication interface 170.
- the data currently held in a selected cache line in the second memory circuit 175 may have to be transferred to the first memory circuit 125, to preserve and continue to store the currently held data and to allow the selected cache line in the second memory circuit 175 to be utilized for other data, a process referred to herein as a data “eviction” (and is used in step 423 (read operation), step 484 (atomic operation), or step 556 (write (or store) request (illustrated and discussed with reference to FIG. 10)).
- step 422 read operation
- step 482 atomic operation
- step 554 write (or store) request
- the data eviction process begins, start step 522, and the second memory access control circuit 225 generates a signal to the memory hazard control circuit 230 to set a hazard bit (in registers 260) for the selected cache line in the second memory circuit 175, step 524, to block other requests from accessing the same cache line, as the data in that cache line will be in the process of transitioning to the replacement data.
- the second memory access control circuit 225 reads the current data from the selected cache line in the second memory circuit 175, and with its corresponding memory address, step 526, queues the evicted data and memory address (effectively as or equivalently to a write request) for writing (storing) to the first memory circuit 125, step 528, e.g., transfers the current read data and memory address to the inbound control multiplexer 240, so that the data and request (with memory address) will then be selected by the inbound control multiplexer 240 and stored in the first memory bank queues 265.
- the scheduler circuit 270 eventually selects the evicted data and memory address from the first memory bank queues 265 and schedules (or initiates) accessing the memory bank of the first memory circuit 125, step 530.
- the evicted data is stored to the first memory circuit 125 at the specified memory address, step 532, and the memory hazard control circuit 230 (or the memory hazard clear (reset) circuit 190) resets or clears the hazard bit (in registers 260) for the selected cache line in the second memory circuit 175, step 534.
- the data eviction process may end, return step 536, and the selected cache line in the second memory circuit 175 may then be overwritten, without any loss of the previous data, which is now stored in the first memory circuit 125.
- step 534 or 536 upon clearing or resetting of the hazard bit, the process which which required the data eviction process may resume (e.g., proceeding to step 424 for a reading (or loading) process, proceeding to step 558 for a writing process, or proceeding to step 486 for an atomic operation process).
- steps 400, 402, 404, 406, and 410 discussed above are generally applicable to all read, write, and/or atomic operations, not just the read operation illustrated in FIG. 7.
- the method will have completed steps 400, 402, 404, and 410, and will have determined that the request selected from the network request queue 250 is a write request.
- steps 406 and 410, determining whether the request is a read request or a write request may occur in any order; as a consequence, completion of step 406 is not required for the commencement of a write operation.
- any determination that the request is a write request may also occur through a process of elimination or by default, i.e.. a determination that it is not a first type of request (read) and is not a second type of request (atomic operation).
- the write (or store) operation begins, start step 540, and the second memory access control circuit 225 selects a cache line in the second memory circuit 175, step 542, and using the memory hazard control circuit 230, determines whether that particular cache line in the second memory circuit 175 has a hazard bit set in the memory hazard registers 260, step 544.
- the second memory access control circuit 225 determines whether another cache line is available (which does not have a hazard bit set), step 546, and if so, selects that available cache line in the second memory circuit 175, step 548.
- the second memory access control circuit 225 queues the write request in the memory hazard control circuit 230, step 550, until a hazard bit has been reset or cleared for a cache line in the second memory circuit 175, step 552, and the second memory access control circuit 225 proceeds to step 548 to select that cache line with the reset or cleared hazard bit.
- the second memory access control circuit 225 determines whether there is data already stored in the selected cache line, step 554, and if there is data in that cache line, performs a data eviction process, step 556 (i.e., performs steps 522 - 534 for a data eviction from the second memory circuit 175, illustrated and discussed with reference to FIG. 9.
- the second memory access control circuit 225 When the selected cache line either had no data already stored (step 554) or the data eviction process has completed (step 556), the second memory access control circuit 225 generates a signal to the memory hazard control circuit 230 to set a hazard bit for the selected cache line in the second memory circuit 175, step 558, to block other requests from accessing the same cache line, as the data in that cache line will be in the process of transitioning and another read or write process should not access it, providing memory coherency.
- the write data of the write request is then written to (stored in) the second memory circuit 175, using the address specified in the write request, step 560, and the previously set hazard bit is reset or cleared, step 562.
- steps of writing to the second memory circuit 175 and clearing the hazard bit may occur in any of several ways, for example: (1) using the second memory access control circuit 225 to store the write data in the second memory circuit 175 and generate a signal to the memory hazard control circuit 230 to reset or clear the hazard bit for the selected cache line in the second memory circuit 175; or (2) routing the request through the second memory“hit” request queue 280 and routing the write data to the second memory“hit” data queue 295, followed by using the write merge circuit 180 to write the write data to the selected cache line in the second memory circuit 175 (via communication line 236), and to generate a signal to the memory hazard control circuit 230 to reset or clear the hazard bit for the selected cache line in the second memory circuit 175.
- a write operation response data packet having an acknowledgement (or completion) is prepared and transmitted to the source address, generally over the communication network 150, step 564, and the write operation may end, return step 566. Any of the methods and components mentioned previously may be utilized to prepare and transmit the response data packet.
- the second memory circuit 175 will store the most recent data written to the second memory circuit 175, until that data may be subsequently“evicted” and moved or re-stored in the first memory circuit 125. As that most recent data may be utilized again comparatively promptly, storing the data in the second memory circuit 175 also serves to reduce latency.
- the user defined, programmable atomic operations allow a user to define an atomic operation that is of value to a single application.
- the user-defined atomic operation is performed by execution of a single processor instruction.
- a user may create a set of programmable atomic operation instructions that allow an application to issue user defined atomic operations to the memory controller circuit 100, 100A.
- the atomic operations are issued similar to predetermined atomic operations.
- the programmable atomic operation request includes a physical memory address, a programmable atomic operation identifier and typically some number of thread state register values.
- the memory controller circuit 100, 100A receives the programmable atomic operation request, the second memory control circuit 160 places a hazard (sets a hazard bit) on the target cache line of the second memory circuit 175, and then passes the programmable atomic operation request information to the programmable atomic operations circuitry 135 within or coupled to the memory controller circuit 100, 100A.
- the programmable atomic operations circuitry 135 initializes its register state from the provided programmable atomic operation request information (including memory address, 64-bit memory value located at the memory address, and the thread state registers).
- programmable atomic operations circuitry 135 executes a series or instructions to perform the programmable atomic operation. Results of the programmable atomic operation are stored back to the target memory line or the second memory circuit 175 and possibly returned to the requesting source or processor in a response packet.
- the requesting source or processor generally informs the system 50, 50A through a system call that a programmable atomic operation is required.
- the operating system processes the system call by loading the provided set of instructions in the memory associated with the programmable atomic operations circuitry 135.
- the programmable atomic operations circuitry 135 begins executing a user- defined, programmable atomic operation by executing the loaded instructions starting at a location obtained from the programmable atomic operation atomic identifier in the
- the programmable atomic operations circuitry 135 forces all memory requests to the memory line of the second memory circuit 175 covered by the hazard bit originally set by the second memory control circuit 160 upon receipt of the atomic operation. Additionally, the programmable atomic operations circuitry 135 should limit the number of instructions for execution to a finite number to ensure that the programmable atomic operation completes. The programmable atomic operations circuitry 135 also detects accesses to out of bounds memory lines and execution of too many instructions, and responds back to the requesting source or processor with a failure status. The requesting source or processor would then issue a trap to notify the system of the failure.
- example programmable atomic operation instructions have representative formats as shown in Table 4, for example. [0134] Table 4:
- the programmable atomic operations circuitry 135 is based on a RISC-V processor ISA with modifications to efficiently perform programmable atomic memory operations. These modifications allow programmable atomic operations circuitry 135 or other processor 110,
- HTP 115 such as a RISC-V processor
- the modifications include barrel style instruction processing across multiple threads. Barrel processing allows the programmable atomic operations circuitry 135 or other processor 110, 110A to hide memory access latencies by switching to other ready-to-execute threads. Barrel processing results in the overall time for a single atomic operation to increase, however this style of processing greatly increases atomic operation throughput. This tradeoff is appropriate for an application that must perform custom atomic operations across a large number of memory locations. An application that performs atomic operations on a single memory location (such as a counting barrier operation) could use a predetermined atomic operation. For this situation, the memory location will be cached in the second memory circuit 175 and the memory hazard bit (lock) will be set for a comparatively minimal amount of time.
- the RISC-V architecture supports this style of memory interface with acquire/release instruction functionality.
- the programmable atomic operations circuitry
- HTP 115 is embodied in an HTP 115, as mentioned above, described in greater detail in described in U.S. Patent Application No. 16/176,434.
- the HTP 115 is utilized as a processor 110A, and is also provided with a direct communication link (or line) 60 to and from the second memory controller circuit 100A.
- FIG. 11 is a block diagram of a representative programmable atomic operations circuitry 135 (or other processor 110, 110A) embodiment, including when such programmable atomic operations circuitry 135 may be included in an HTP 115.
- the programmable atomic operations circuitry 135 comprises a memory controller interface circuit 720, core control circuitry 610, and one or more processor cores 605.
- the memory controller interface circuit 720 manages communication between the programmable atomic operations circuitry 135 and either the balance of the first memory controller circuit 100 or the separate second memory controller circuit 100A, such as (1) providing the programmable atomic operation request and read data to control logic and thread selection circuitry 630, core control and thread memory 615, and/or data buffers 715 of the programmable atomic operations circuitry 135; and (2) providing the resulting data generated upon completion of the programmable atomic operation by the processor core 605 together with the programmable atomic operation request data to the atomic operation request queue to store the resulting data in the second memory circuit 175.
- the core control circuitry 610 comprises control logic and thread selection circuitry 630 (to manage threads for the corresponding instructions that execute on the processor core 605), an instruction cache 640 storing instructions to perform the programmable atomic operation, and various types of memory and registers, including an execution queue 645, core control and thread memory 615 and data buffers 715.
- the programmable atomic operations circuitry 135 is embodied as a processor 110A or an HTP 115
- the processor 110A or HTP 115 further comprises a network communication interface 170, as previously described, for communication over the communication network 150.
- the control logic and thread selection circuitry 630 comprises circuitry formed using combinations of any of a plurality of various logic gates (e.g ., NAND, NOR, AND, OR, EXCLUSIVE OR, etc.) and various state machine circuits (control logic circuit(s) 631, thread selection control circuitry 705), and multiplexers (e.g., input multiplexer 687, thread selection multiplexer 685), for example and without limitation.
- various logic gates e.g ., NAND, NOR, AND, OR, EXCLUSIVE OR, etc.
- various state machine circuits control logic circuit(s) 631, thread selection control circuitry 705
- multiplexers e.g., input multiplexer 687, thread selection multiplexer 685
- the network communication interface 170 includes input queues 205 to receive data packets from the communication network 150; output queues 210 to transfer data packets (including response packets) to the communication network 150; a data packet decoder circuit 215 to decode incoming data packets from the communication network 150, take data (in designated fields) and transfer the data provided in the packet to the relevant registers of the core control and thread memory 615; and data packet encoder circuit 220 to encode outgoing data packets (such as programmable atomic memory operation response packets, requests to the first memory circuit 125) for transmission on the communication network 150.
- the data packet decoder circuit 215 and the data packet encoder circuit 220 may each be implemented as state machines or other logic circuitry. Depending upon the selected embodiment, there may be separate core control circuitry 610 and separate core control and thread memory 615 for each processor core 605, or single core control circuitry 610 and single core control and thread memory 615 may be utilized for multiple processor cores 605.
- the programmable atomic operation request and read data is provided to the memory controller interface circuit 720 from the request selection multiplexer 305 and the data selection multiplexer 310, respectively, using communication bus or line 60, 60A.
- the request includes information on the source of the request (e.g., a source address), and an atomic memory operator identifier, which is a designation of the specific programmable atomic operation to be performed.
- a source address e.g., a source address
- an atomic memory operator identifier which is a designation of the specific programmable atomic operation to be performed.
- Collectively, the source address, the atomic memory operator identifier, and the read data (as operand data) comprise a“work descriptor”.
- the programmable atomic operations allow the system 50, 50A user to define a set of atomic memory operations that are specific to a set of target applications. These programmable atomic operations comprise program instructions stored in the instruction cache 640 for execution by the processor core 605.
- the memory controller interface circuit 720 includes a set of registers 710 containing a translation table that translates the atomic memory operator identifier to a (virtual) instruction address, for selection of the instruction to begin execution of the programmable atomic operation by the processor core 605.
- the core control and thread memory 615 includes the set of registers 710 containing a translation table that translates the atomic memory operator identifier to a (virtual) instruction address, for selection of the instruction to begin execution of the programmable atomic operation by the processor core 605.
- the control logic and thread selection circuitry 630 assigns an available thread identifier (ID) to the thread of the word descriptor, from thread ID pool registers (not separately illustrated), with the assigned thread ID used as an index to the other registers of the core control and thread memory 615 which are then populated with corresponding data from the work descriptor, and typically the program count and one or more arguments.
- the control logic and thread selection circuitry 630 initializes the remainder of the thread context state autonomously in preparation for starting the thread executing instructions for the programmable atomic operation, such as loading the data buffers 715, as needed.
- the data buffers 715 are utilized to minimize requests (such as requests to first memory circuit 125) by storing read request data and write data, pre-fetched data, and any interim results which may be generated during execution of the programmable atomic operation.
- That thread ID is given a valid status (indicating it is ready to execute), and the thread ID is pushed to the first priority queue 655 of the execution (ready-to-run) queue(s) 645, as threads for the programmable atomic operations are typically assigned the highest (first) priority available.
- the programmable atomic operations circuitry 135 is embodied as an HTP 115 which may also be performing other operations, a dedicated queue for thread IDs for programmable atomic operations is provided, which again has the highest priority for rapid selection and execution.
- Selection circuitry of the control logic and thread selection circuitry 630 such as a multiplexer 685, selects the next thread ID in the execution (ready-to-run) queue(s) 645, which is used as in index into the core control and thread memory 615 (the program count registers and thread state registers), to select the instruction from the instruction cache 640 which is then provided to the execution pipeline 650 for execution.
- the execution pipeline then executes that instruction for the programmable atomic operation.
- the same triplet of information can be returned to the first priority queue 655 of the execution (ready -to-run) queue(s) 645, for continued selection for execution, to continue executing instructions for the programmable atomic operation, depending upon various conditions. For example, if the last instruction for a selected thread ID was a return instruction (indicating that thread execution for the programmable atomic operation was completed and resulting atomic operation data is being provided), the control logic and thread selection circuitry 630 will return the thread ID to the available pool of thread IDs in the thread ID pool registers, to be available for use by another, different thread.
- the valid indicator could change, such as changing to a pause state (such as while the thread may be waiting for information to be returned from or written to first memory circuit 125 or waiting for another event), and in which case, the thread ID (now having a pause status) is not returned to the execution (ready-to-run) queue(s) 745 until the status changes back to valid or a predetermined amount of time has elapsed (to avoid stalling or halting the programmable atomic operations circuitry 135 if the instructions provided by the user for the programmable atomic operation are problematic in any way).
- the return information (thread ID and return arguments) is then pushed by the execution pipeline 650 to the network command queue 690, which is typically implemented as first-in, first out (FIFO).
- the thread ID is used as an index into the thread return registers to obtain the return information, such as the transaction ID and the source (caller) address (or other identifier), and the packet encoder circuit (in the network communication interface 170 of the first memory controller circuit 100 or in the network communication interface 170 of the processor 110A) then generates an outgoing return data packet (on the communication network 150).
- the resulting data is provided to the memory controller interface circuit 720, for the resulting data to be written to the second memory circuit 175.
- the set of programmable atomic operations supported by the system 50, 50A (and corresponding instructions) are established at boot time.
- the table in the registers 710 that translates the atomic memory operator identifier to a (virtual) instruction address is loaded by the operating system.
- Various constraints may also be placed upon the programmable atomic operations, such as limiting the corresponding write address space to areas of memory for which the given programmable atomic operation has write privileges, and limiting the number of instruction which may be executed in the programmable atomic operation, to ensure that the programmable atomic operation completes (and the hazard bit is subsequently cleared or reset when the resulting data is written to the second memory circuit 175).
- the hazard bit may be subsequently cleared or reset automatically after a predetermined period of time, to not allow the programmable atomic operation to reserve a cache line in the second memory circuit 175 indefinitely.
- the incoming work descriptor is also utilized to initialize a new thread (for the programmable atomic operation) in thread registers in the core control and thread memory 615, such as with information shown in Table 5, for example.
- the created thread executes standard and custom
- RISC-V instructions to perform the programmable (custom) atomic operation.
- the RISC-V instructions would perform the defined operation on Al and A2, writing the result to the address provided in A0.
- the address in A0 would also be used for clearing the memory line hazard lock.
- a load non-buffered instruction (Load Non-Buffered (NB)) checks for a buffer hit in data buffers 715, but on a buffer miss will issue a memory request for just the requested operand and not put the obtained data in a buffer. Instructions of this type have an NB suffix (non-buffered).
- the NB load instructions are expected to be used in runtime libraries written in assembly.
- Example load instructions are listed in Table 6, which shows a Load Non-Buffered Instruction Format.
- Programmable (custom) atomic operations set a hazard bit (as a“lock”) on the cache line of the second memory circuit 175 (as the provided address) when an atomic operation is received by the first or second memory controller circuit 100, 100A.
- the programmable atomic operations circuitry 135 indicates when the lock should be cleared. This should occur on the last store operation that the programmable atomic operations circuitry 135 performs for the programmable (custom) atomic operation (or on an atomic return instruction if no store is required).
- the programmable atomic operations circuitry 135 indicates that the hazard bit is to be cleared or reset by executing a hazard clear store operation.
- the Atomic Return (AR) instruction (AR) is used to complete the executing thread of a programmable atomic operation, optionally clear the hazard bit (lock), and optionally provide a response back to the source that issued the programmable atomic operation.
- the AR instruction can send zero, one, or two 8-byte argument values back to the issuing compute element.
- the number of arguments to send back is determined by the ac2 suffix (Al or A2).
- No suffix means zero arguments
- Al implies a single 8 -byte argument
- A2 implies two 8-byte arguments.
- the arguments, if needed, are obtained from RISC-V X registers al and a2.
- the AR instruction is also able to clear the hazard bit previously set for the cache line of the second memory circuit 175 (as the provided address) associated with the atomic instruction.
- the AR uses the value in the aO register as the address to send the clear lock operation.
- the clear lock operation is issued if the instruction contains the suffix CL.
- the clear lock operation uses the value in register A0 as the address to be used to clear the lock.
- the AR instruction has two variants, depending on whether the hazard bit lock associated with the atomic operation is to be cleared or not.
- the proposed instructions are shown in Table 8, which is a diagram illustrating Atomic Return Instruction Formats. [0163] Table 8:
- Table 9 shows the AC2 suffix options for the Atomic Return instruction.
- Example 1 shows an atomic fetch and XOR operation implemented as a programmable (custom) atomic operation.
- al, a2 // al contains memory value, a2 contains value to be
- the first instruction, xor.d performs the XOR operation on the atomic operand and accessed memory values.
- the second instruction, sd.cl stores the 64-bit value in register a2 to the atomic operation address provided in register aO. Additionally, the store operation is used to clear the hazard bit previously set for the cache line of the second memory circuit 175 (as the provided address).
- the last instruction, ar causes the thread to be terminated. It should be noted that the sd.cl instruction stores the 64-bit value to a thread write buffer but does not force the write buffer to memory.
- the ar instruction is used to force all dirty write buffers to memory. This implies that the sd.cl instruction writes its 64-bit value to a write buffer, then marks the write buffer as needing to clear the associated hazard hazard when it is written to the second memory circuit 175 by the ar instruction.
- Example 2 shows a Double Compare and Swap (DCAS) operation implemented as a programmable atomic operation.
- DCAS Double Compare and Swap
- Example 2 shows the use of the non-buffered load instruction, Id.nb.
- the non-buffered load is used to pull in from memory just the required 8-bytes.
- Using the non-buffered load avoids prefetching the full memory buffer (e.g., 64-bytes).
- Example 2 also shows the use of a sequence of store instructions, sd and sd.cl.
- the first instruction sd writes 64-bits to a write buffer.
- the second instruction, sd.cl writes a second 64-bit value to the write buffer.
- the ar instruction writes the 16-bytes of dirty data in the write buffer back to memory as a single request, tagged with the need to clear the hazard bit.
- This implementation of the DCAS returns a single value indicating success or failure.
- the second ar instruction is used to clear the hazard bit since no previous store performed the operation.
- the representative apparatus, system and method provide for a memory controller 100, 100A which has high performance and is energy efficient.
- Representative embodiments of the memory controller 100, 100A provide support for compute intensive kernels or operations which require considerable and highly frequent memory accesses, such as for performing Fast Fourier Transform (“FFT”) operations, finite impulse response (“FIR”) filtering, and other compute intensive operations typically used in larger applications such as synthetic aperture radar, 5G networking and 5G base station operations, machine learning, Al, stencil code operations, and graph analytic operations such as graph clustering using spectral techniques, for example and without limitation.
- Representative embodiments of the memory controller 100, 100A are optimized for high throughput and low latency, including high throughput and low latency for atomic operations, including a wide range of atomic operations, both predetermined atomic operations and also programmable or user-defined atomic operations.
- Representative embodiments of the memory controller 100, 100A produced dramatic results compared to a state-of-the-art X86 server platform. For example,
- representative embodiments of the memory controller 100, 100A provided over a three-fold (3.48x) better atomic update performance using a standard GDDR6 DRAM memory, and provided a seventeen-fold (17.6x) better atomic update performance using a modified GDDR6 DRAM memory (having more memory banks.
- the representative embodiments of the memory controller 100, 100A also provided for very low latency and high throughput memory read and write operations, generally only limited by the memory bank availability, error correction overhead, and the bandwidth (Gb/s) available over communication network 150 and the memory 125, 175 devices themselves, resulting in a flat latency until maximum bandwidth is achieved.
- Representative embodiments of the memory controller 100, 100A also provide very high performance (high throughput and low latency) for programmable or user-defined atomic operations, comparable to the performance of predetermined atomic operations.
- Additional, direct data paths provided for the programmable atomic operations circuitry 135 executing the programmable or user-defined atomic operations allow for additional write operations without any limitations imposed by the bandwidth of the communication network 150 and without increasing any congestion of the communication network 150.
- a“processor core” 605 may be any type of processor core, and may be embodied as one or more processor cores configured, designed, programmed or otherwise adapted to perform the functionality discussed herein.
- a“processor” 110, 110A may be any type of processor, and may be embodied as one or more processors configured, designed, programmed or otherwise adapted to perform the functionality discussed herein.
- a processor 110, 110A may include use of a single integrated circuit ("IC"), or may include use of a plurality of integrated circuits or other components connected, arranged or grouped together, such as controllers, microprocessors, digital signal processors ("DSPs”), array processors, graphics or image processors, parallel processors, multiple core processors, custom ICs, application specific integrated circuits ("ASICs”), field programmable gate arrays (“FPGAs”), adaptive computing ICs, associated memory (such as RAM, DRAM and ROM), and other ICs and components, whether analog or digital.
- DSPs digital signal processors
- ASICs application specific integrated circuits
- FPGAs field programmable gate arrays
- adaptive computing ICs associated memory (such as RAM, DRAM and ROM), and other ICs and components, whether analog or digital.
- processor or controller should be understood to equivalently mean and include a single IC, or arrangement of custom ICs, ASICs, processors, microprocessors, controllers, FPGAs, adaptive computing ICs, or some other grouping of integrated circuits which perform the functions discussed herein, with associated memory, such as microprocessor memory or additional RAM, DRAM, SDRAM, SRAM, MRAM, ROM, FLASH, EPROM or E 2 PROM.
- a processor 110, 110A, with associated memory may be adapted or configured (via programming, FPGA interconnection, or hard wiring) to perform the methodology of the invention, as discussed herein.
- the methodology may be programmed and stored, in a processor 110, 110A with its associated memory (and/or memory 125) and other equivalent components, as a set of program instructions or other code (or equivalent configuration or other program) for subsequent execution when the processor 110, 110A is operative (i.e., powered on and functioning).
- the processor 110, 110A may implemented in whole or part as FPGAs, custom ICs and/or ASICs, the FPGAs, custom ICs or ASICs also may be designed, configured and/or hard-wired to implement the methodology of the invention.
- the processor 110, 110A may be implemented as an arrangement of analog and/or digital circuits, controllers, microprocessors, DSPs and/or ASICs, collectively referred to as a“processor” or“controller”, which are respectively hard-wired, programmed, designed, adapted or configured to implement the methodology of the invention, including possibly in conjunction with a memory 125.
- the first memory circuit 125 and the second memory circuit 175, which may include a data repository (or database), may be embodied in any number of forms, including within any computer or other machine-readable data storage medium, memory device or other storage or communication device for storage or communication of information, currently known or which becomes available in the future, including, but not limited to, a memory integrated circuit (“IC”), or memory portion of an integrated circuit (such as the resident memory within a processor 110, 110A or processor IC), whether volatile or non-volatile, whether removable or non-removable, including without limitation RAM, FLASH, DRAM, SDRAM, SRAM, MRAM, FeRAM, ROM, EPROM or E 2 PROM, or any other form of memory device, such as a magnetic hard drive, an optical drive, a magnetic disk or tape drive, a hard disk drive, other machine-readable storage or memory media such as a floppy disk, a CDROM, a CD-RW, digital versatile disk (DVD) or other optical memory, or any other type of memory, storage medium
- the processor 110, 110A is hard-wired or programmed, using software and data structures of the invention, for example, to perform the methodology of the present invention.
- the system and related methods of the present invention may be embodied as software which provides such programming or other instructions, such as a set of instructions and/or metadata embodied within a non-transitory computer readable medium, discussed above.
- metadata may also be utilized to define the various data structures of a look up table or a database.
- Such software may be in the form of source or object code, by way of example and without limitation. Source code further may be compiled into some form of instructions or object code (including assembly language instructions or configuration information).
- the software, source code or metadata of the present invention may be embodied as any type of code, such as C,
- a“construct”,“program construct”,“software construct” or“software”, as used equivalently herein means and refers to any programming language, of any kind, with any syntax or signatures, which provides or can be interpreted to provide the associated functionality or methodology specified (when instantiated or loaded into a processor or computer and executed, including the processor 110, 110A, for example).
- the software, metadata, or other source code of the present invention and any resulting bit file may be embodied within any tangible, non-transitory storage medium, such as any of the computer or other machine-readable data storage media, as computer-readable instructions, data structures, program modules or other data, such as discussed above with respect to the memory 125, e.g., a floppy disk, a CDROM, a CD-RW, a DVD, a magnetic hard drive, an optical drive, or any other type of data storage apparatus or medium, as mentioned above.
- any tangible, non-transitory storage medium such as any of the computer or other machine-readable data storage media, as computer-readable instructions, data structures, program modules or other data, such as discussed above with respect to the memory 125, e.g., a floppy disk, a CDROM, a CD-RW, a DVD, a magnetic hard drive, an optical drive, or any other type of data storage apparatus or medium, as mentioned above.
- the communication interface(s) 130 are utilized for appropriate connection to a relevant channel, network or bus; for example, the communication interface(s) 130 may provide impedance matching, drivers and other functions for a wireline or wireless interface, may provide demodulation and analog to digital conversion for a wireless interface, and may provide a physical interface, respectively, for the processor 110, 110A and/or memory 125, with other devices.
- the communication interface(s) 130 are used to receive and transmit data, depending upon the selected embodiment, such as program instructions, parameters, configuration information, control messages, data and other pertinent information.
- the communication interface(s) 130 may be implemented as known or may become known in the art, to provide data communication between the system 50, 50A and any type of network or external device, such as wireless, optical, or wireline, and using any applicable standard (e.g., one of the various PCI, USB, RJ 45, Ethernet (Fast Ethernet, Gigabit Ethernet, 300ase-TX, 300ase-FX, etc.), IEEE 802.11, Bluetooth, WCDMA, WiFi, GSM,
- any applicable standard e.g., one of the various PCI, USB, RJ 45, Ethernet (Fast Ethernet, Gigabit Ethernet, 300ase-TX, 300ase-FX, etc.), IEEE 802.11, Bluetooth, WCDMA, WiFi, GSM,
- the communication interface(s) 130 may also be configured and/or adapted to receive and/or transmit signals externally to the system 50, 50A, such as through hard-wiring or RF or infrared signaling, for example, to receive information in real-time for output on a display, for example.
- the communication interface(s) 130 may provide connection to any type of bus or network structure or medium, using any selected architecture.
- such architectures include Industry Standard Architecture (ISA) bus, Enhanced ISA (EISA) bus, Micro Channel Architecture (MCA) bus, Peripheral Component Interconnect (PCI) bus, SAN bus, or any other communication or signaling medium, such as Ethernet, ISDN, Tl, satellite, wireless, and so on.
- each intervening number there between with the same degree of precision is explicitly contemplated.
- the numbers 7 and 8 are contemplated in addition to 6 and 9, and for the range 6.0-7.0, the number 6.0, 6.1, 6.2, 6.3, 6.4, 6.5, 6.6, 6.7, 6.8, 6.9, and 7.0 are explicitly contemplated.
- every intervening sub-range within range is contemplated, in any combination, and is within the scope of the disclosure.
- the sub-ranges 5 - 6, 5 - 7, 5 - 8, 5 - 9, 6 - 7, 6 - 8, 6 - 9, 6 - 10, 7 - 8, 7 - 9, 7 - 10, 8 - 9, 8 - 10, and 9 - 10 are contemplated and within the scope of the disclosed range.
- Figures can also be implemented in a more separate or integrated manner, or even removed or rendered inoperable in certain cases, as may be useful in accordance with a particular application. Integrally formed combinations of components are also within the scope of the invention, particularly for embodiments in which a separation or combination of discrete components is unclear or indiscernible.
- use of the term“coupled” herein, including in its various forms such as“coupling” or“couplable”, means and includes any direct or indirect electrical, structural or magnetic coupling, connection or attachment, or adaptation or capability for such a direct or indirect electrical, structural or magnetic coupling, connection or attachment, including integrally formed components and components which are coupled via or through another component.
- a metric is a measure of a state of at least part of the regulator or its inputs or outputs.
- a parameter is considered to represent a metric if it is related to the metric directly enough that regulating the parameter will satisfactorily regulate the metric.
- a parameter may be considered to be an acceptable representation of a metric if it represents a multiple or fraction of the metric.
- any signal arrows in the drawings/ Figures should be considered only exemplary, and not limiting, unless otherwise specifically noted. Combinations of components of steps will also be considered within the scope of the present invention, particularly where the ability to separate or combine is unclear or foreseeable.
- the disjunctive term“or”, as used herein and throughout the claims that follow, is generally intended to mean “and/or”, having both conjunctive and disjunctive meanings (and is not confined to an “exclusive or” meaning), unless otherwise indicated.
- “a”,“an”, and“the” include plural references unless the context clearly dictates otherwise.
- the meaning of“in” includes“in” and“on” unless the context clearly dictates otherwise.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computer Hardware Design (AREA)
- Human Computer Interaction (AREA)
- Computing Systems (AREA)
- Microelectronics & Electronic Packaging (AREA)
- Multimedia (AREA)
- Memory System Of A Hierarchy Structure (AREA)
Abstract
Un circuit contrôleur de mémoire (100) peut être couplé à un premier circuit de mémoire (125), tel qu'une DRAM, et comprend : un premier circuit de contrôle de mémoire (155) pour lire ou écrire dans le premier circuit de mémoire; un second circuit de mémoire (175), tel qu'une SRAM; un second circuit de contrôle de mémoire (160) conçu pour lire à partir du second circuit de mémoire en réponse à une demande de lecture lue lorsque les données demandées sont stockées dans le second circuit de mémoire, et sinon pour transférer la demande de lecture au premier circuit de contrôle de mémoire; des circuits d'opérations atomiques prédéterminées (185); et des circuits d'opérations atomiques programmables (135) conçus pour effectuer au moins une opération atomique programmable. Le second circuit de contrôle de mémoire transfère également une demande d'opération atomique programmable reçue aux circuits d'opérations atomiques programmables et définit un bit aléatoire pour une ligne de cache du second circuit de mémoire.
Priority Applications (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201980010364.9A CN111656335B (zh) | 2018-01-29 | 2019-01-28 | 存储器控制器 |
EP19705276.4A EP3746902B1 (fr) | 2018-01-29 | 2019-01-28 | Contrôleur de mémoire |
EP23199323.9A EP4276625A3 (fr) | 2018-01-29 | 2019-01-28 | Contrôleur de mémoire |
KR1020227018968A KR20220083849A (ko) | 2018-01-29 | 2019-01-28 | 메모리 컨트롤러 |
KR1020207024833A KR102407128B1 (ko) | 2018-01-29 | 2019-01-28 | 메모리 컨트롤러 |
Applications Claiming Priority (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201862623331P | 2018-01-29 | 2018-01-29 | |
US62/623,331 | 2018-01-29 | ||
US16/259,862 | 2019-01-28 | ||
US16/259,862 US10956086B2 (en) | 2018-01-29 | 2019-01-28 | Memory controller |
US16/259,879 | 2019-01-28 | ||
US16/259,879 US10915271B2 (en) | 2018-01-29 | 2019-01-28 | Memory controller with programmable atomic operations |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019148129A1 true WO2019148129A1 (fr) | 2019-08-01 |
Family
ID=65411950
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2019/015463 WO2019148129A1 (fr) | 2018-01-29 | 2019-01-28 | Contrôleur de mémoire |
PCT/US2019/015467 WO2019148131A1 (fr) | 2018-01-29 | 2019-01-28 | Contrôleur de mémoire à opérations atomiques programmables |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2019/015467 WO2019148131A1 (fr) | 2018-01-29 | 2019-01-28 | Contrôleur de mémoire à opérations atomiques programmables |
Country Status (2)
Country | Link |
---|---|
KR (2) | KR20220083849A (fr) |
WO (2) | WO2019148129A1 (fr) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070005908A1 (en) * | 2005-06-29 | 2007-01-04 | Sridhar Lakshmanamurthy | Method and apparatus to enable I/O agents to perform atomic operations in shared, coherent memory spaces |
CN104516831A (zh) * | 2013-09-26 | 2015-04-15 | 想象技术有限公司 | 原子存储器更新单元和方法 |
US20170153975A1 (en) * | 2015-11-27 | 2017-06-01 | Arm Limited | Apparatus and method for handling atomic update operations |
-
2019
- 2019-01-28 KR KR1020227018968A patent/KR20220083849A/ko active IP Right Grant
- 2019-01-28 KR KR1020227018954A patent/KR20220083848A/ko active IP Right Grant
- 2019-01-28 WO PCT/US2019/015463 patent/WO2019148129A1/fr active Search and Examination
- 2019-01-28 WO PCT/US2019/015467 patent/WO2019148131A1/fr active Search and Examination
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070005908A1 (en) * | 2005-06-29 | 2007-01-04 | Sridhar Lakshmanamurthy | Method and apparatus to enable I/O agents to perform atomic operations in shared, coherent memory spaces |
CN104516831A (zh) * | 2013-09-26 | 2015-04-15 | 想象技术有限公司 | 原子存储器更新单元和方法 |
US20170153975A1 (en) * | 2015-11-27 | 2017-06-01 | Arm Limited | Apparatus and method for handling atomic update operations |
Also Published As
Publication number | Publication date |
---|---|
KR20220083849A (ko) | 2022-06-20 |
WO2019148131A1 (fr) | 2019-08-01 |
KR20220083848A (ko) | 2022-06-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11461048B2 (en) | Memory controller with programmable atomic operations | |
US11880687B2 (en) | System having a hybrid threading processor, a hybrid threading fabric having configurable computing elements, and a hybrid interconnection network | |
US20220276794A1 (en) | On-chip Atomic Transaction Engine | |
KR101753913B1 (ko) | 기계 비전 알고리즘을 위한 멀티프로세서 시스템온칩 | |
US5586294A (en) | Method for increased performance from a memory stream buffer by eliminating read-modify-write streams from history buffer | |
US5388247A (en) | History buffer control to reduce unnecessary allocations in a memory stream buffer | |
US8683140B2 (en) | Cache-based speculation of stores following synchronizing operations | |
US10572179B2 (en) | Speculatively performing memory move requests with respect to a barrier | |
US11023410B2 (en) | Instructions for performing multi-line memory accesses | |
US10140052B2 (en) | Memory access in a data processing system utilizing copy and paste instructions | |
US20180052788A1 (en) | Memory move supporting speculative acquisition of source and destination data granules | |
US12019920B2 (en) | Memory controller with programmable atomic operations | |
WO2019148129A1 (fr) | Contrôleur de mémoire | |
US9996298B2 (en) | Memory move instruction sequence enabling software control |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19705276 Country of ref document: EP Kind code of ref document: A1 |
|
DPE1 | Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101) | ||
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 20207024833 Country of ref document: KR Kind code of ref document: A |
|
ENP | Entry into the national phase |
Ref document number: 2019705276 Country of ref document: EP Effective date: 20200831 |