EP3977296B1 - Verfahren und vorrichtung zur erleichterung der schreibfehlerzwischenspeicherung in einem zwischenspeichersystem - Google Patents

Verfahren und vorrichtung zur erleichterung der schreibfehlerzwischenspeicherung in einem zwischenspeichersystem

Info

Publication number
EP3977296B1
EP3977296B1 EP20815004.5A EP20815004A EP3977296B1 EP 3977296 B1 EP3977296 B1 EP 3977296B1 EP 20815004 A EP20815004 A EP 20815004A EP 3977296 B1 EP3977296 B1 EP 3977296B1
Authority
EP
European Patent Office
Prior art keywords
cache
data
write
address
storage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
EP20815004.5A
Other languages
English (en)
French (fr)
Other versions
EP3977296A4 (de
EP3977296A1 (de
Inventor
Naveen Bhoria
Timothy David ANDERSON
Pete Michael HIPPLEHEUSER
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Texas Instruments Inc
Original Assignee
Texas Instruments Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Texas Instruments Inc filed Critical Texas Instruments Inc
Priority to EP25193762.9A priority Critical patent/EP4636601A2/de
Publication of EP3977296A1 publication Critical patent/EP3977296A1/de
Publication of EP3977296A4 publication Critical patent/EP3977296A4/de
Application granted granted Critical
Publication of EP3977296B1 publication Critical patent/EP3977296B1/de
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/12Replacement control
    • G06F12/121Replacement control using replacement algorithms
    • G06F12/128Replacement control using replacement algorithms adapted to multidimensional cache systems, e.g. set-associative, multicache, multiset or multilevel
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1008Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices
    • G06F11/1064Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's in individual solid state devices in cache or content addressable memories
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0215Addressing or allocation; Relocation with look ahead addressing means
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0223User address space allocation, e.g. contiguous or non contiguous base addressing
    • G06F12/023Free address space management
    • G06F12/0238Memory management in non-volatile memory, e.g. resistive RAM or ferroelectric memory
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0223User address space allocation, e.g. contiguous or non contiguous base addressing
    • G06F12/0292User address space allocation, e.g. contiguous or non contiguous base addressing using tables or multilevel address translation means
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0804Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with main memory updating
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0811Multiuser, multiprocessor or multiprocessing cache systems with multilevel cache hierarchies
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0815Cache consistency protocols
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0815Cache consistency protocols
    • G06F12/0817Cache consistency protocols using directory methods
    • G06F12/082Associative directories
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0844Multiple simultaneous or quasi-simultaneous cache accessing
    • G06F12/0853Cache with multiport tag or data arrays
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0844Multiple simultaneous or quasi-simultaneous cache accessing
    • G06F12/0855Overlapped cache accessing, e.g. pipeline
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0864Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches using pseudo-associative means, e.g. set-associative or hashing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0877Cache access modes
    • G06F12/0884Parallel mode, e.g. in parallel with main memory or CPU
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0888Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches using selective caching, e.g. bypass
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0891Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches using clearing, invalidating or resetting means
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0893Caches characterised by their organisation or structure
    • G06F12/0895Caches characterised by their organisation or structure of parts of caches, e.g. directory or tag array
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0893Caches characterised by their organisation or structure
    • G06F12/0897Caches characterised by their organisation or structure with two or more cache hierarchy levels
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/10Address translation
    • G06F12/1027Address translation using associative or pseudo-associative address translation means, e.g. translation look-aside buffer [TLB]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/12Replacement control
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/12Replacement control
    • G06F12/121Replacement control using replacement algorithms
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/12Replacement control
    • G06F12/121Replacement control using replacement algorithms
    • G06F12/126Replacement control using replacement algorithms with special data handling, e.g. priority of data or instructions, handling errors or pinning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/12Replacement control
    • G06F12/121Replacement control using replacement algorithms
    • G06F12/126Replacement control using replacement algorithms with special data handling, e.g. priority of data or instructions, handling errors or pinning
    • G06F12/127Replacement control using replacement algorithms with special data handling, e.g. priority of data or instructions, handling errors or pinning using additional replacement algorithms
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/16Handling requests for interconnection or transfer for access to memory bus
    • G06F13/1605Handling requests for interconnection or transfer for access to memory bus based on arbitration
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/16Handling requests for interconnection or transfer for access to memory bus
    • G06F13/1605Handling requests for interconnection or transfer for access to memory bus based on arbitration
    • G06F13/1642Handling requests for interconnection or transfer for access to memory bus based on arbitration with request queuing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/16Handling requests for interconnection or transfer for access to memory bus
    • G06F13/1668Details of memory controller
    • G06F13/1673Details of memory controller using buffers
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/16Handling requests for interconnection or transfer for access to memory bus
    • G06F13/1668Details of memory controller
    • G06F13/1689Synchronisation and timing concerns
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8053Vector processors
    • G06F15/8061Details on data memory access
    • G06F15/8069Details on data memory access using a cache
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • G06F9/30043LOAD or STORE instructions; Clear instruction
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • G06F9/30047Prefetch instructions; cache control instructions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C29/08Functional testing, e.g. testing during refresh, power-on self testing [POST] or distributed testing
    • G11C29/12Built-in arrangements for testing, e.g. built-in self testing [BIST] or interconnection details
    • G11C29/38Response verification devices
    • G11C29/42Response verification devices using error correcting codes [ECC] or parity check
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C29/08Functional testing, e.g. testing during refresh, power-on self testing [POST] or distributed testing
    • G11C29/12Built-in arrangements for testing, e.g. built-in self testing [BIST] or interconnection details
    • G11C29/44Indication or identification of errors, e.g. for repair
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/52Protection of memory contents; Detection of errors in memory contents
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/70Masking faults in memories by using spares or by reconfiguring
    • G11C29/72Masking faults in memories by using spares or by reconfiguring with optimized replacement algorithms
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/70Masking faults in memories by using spares or by reconfiguring
    • G11C29/76Masking faults in memories by using spares or by reconfiguring using address translation or modifications
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C5/00Details of stores covered by group G11C11/00
    • G11C5/06Arrangements for interconnecting storage elements electrically, e.g. by wiring
    • G11C5/066Means for reducing external access-lines for a semiconductor memory clip, e.g. by multiplexing at least address and data signals
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C7/00Arrangements for writing information into, or reading information out from, a digital store
    • G11C7/10Input/output [I/O] data interface arrangements, e.g. I/O data control circuits, I/O data buffers
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C7/00Arrangements for writing information into, or reading information out from, a digital store
    • G11C7/10Input/output [I/O] data interface arrangements, e.g. I/O data control circuits, I/O data buffers
    • G11C7/1015Read-write modes for single port memories, i.e. having either a random port or a serial port
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C7/00Arrangements for writing information into, or reading information out from, a digital store
    • G11C7/10Input/output [I/O] data interface arrangements, e.g. I/O data control circuits, I/O data buffers
    • G11C7/1051Data output circuits, e.g. read-out amplifiers, data output buffers, data output registers, data output level conversion circuits
    • G11C7/106Data output latches
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C7/00Arrangements for writing information into, or reading information out from, a digital store
    • G11C7/10Input/output [I/O] data interface arrangements, e.g. I/O data control circuits, I/O data buffers
    • G11C7/1075Input/output [I/O] data interface arrangements, e.g. I/O data control circuits, I/O data buffers for multiport memories each having random access ports and serial ports, e.g. video RAM
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C7/00Arrangements for writing information into, or reading information out from, a digital store
    • G11C7/10Input/output [I/O] data interface arrangements, e.g. I/O data control circuits, I/O data buffers
    • G11C7/1078Data input circuits, e.g. write amplifiers, data input buffers, data input registers, data input level conversion circuits
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C7/00Arrangements for writing information into, or reading information out from, a digital store
    • G11C7/10Input/output [I/O] data interface arrangements, e.g. I/O data control circuits, I/O data buffers
    • G11C7/1078Data input circuits, e.g. write amplifiers, data input buffers, data input registers, data input level conversion circuits
    • G11C7/1087Data input latches
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C7/00Arrangements for writing information into, or reading information out from, a digital store
    • G11C7/22Read-write [R-W] timing or clocking circuits; Read-write [R-W] control signal generators or management 
    • G11C7/222Clock generating, synchronizing or distributing circuits within memory device
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1016Performance improvement
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1016Performance improvement
    • G06F2212/1021Hit rate improvement
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1016Performance improvement
    • G06F2212/1024Latency reduction
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1041Resource optimization
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1041Resource optimization
    • G06F2212/1044Space efficiency improvement
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/30Providing cache or TLB in specific location of a processing system
    • G06F2212/301In special purpose processing node, e.g. vector processor
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/45Caching of specific data in cache memory
    • G06F2212/454Vector or matrix data
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/60Details of cache memory
    • G06F2212/603Details of cache memory of operating mode, e.g. cache mode or local memory mode
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/60Details of cache memory
    • G06F2212/6032Way prediction in set-associative cache
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/60Details of cache memory
    • G06F2212/6042Allocation of cache space to multiple users or processors
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/60Details of cache memory
    • G06F2212/608Details relating to cache mapping
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/62Details of cache specific to multiprocessor cache arrangements
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C2029/0409Online test
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C2029/0411Online error correction
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C29/08Functional testing, e.g. testing during refresh, power-on self testing [POST] or distributed testing
    • G11C29/12Built-in arrangements for testing, e.g. built-in self testing [BIST] or interconnection details
    • G11C29/44Indication or identification of errors, e.g. for repair
    • G11C29/4401Indication or identification of errors, e.g. for repair for self repair
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • Computing systems include one or more processing cores to execute instructions by accessing data stored in memory.
  • the amount of time it takes for the processing core to access data from the memory can be significant.
  • most computing systems include a cache which stores an amount of data from the memory (e.g., frequently smaller than the total amount of data in the memory) that has a high probability of being accessed by the processing core in the future.
  • the cache can provide the data to the processing core faster than the processing core retrieving the data from the memory, thereby increasing the speed and efficiency of the computing system.
  • Document "Adaptive placement and migration policy for an STT-RAM-based hybrid cache” discloses a system structure comprising a core cache and a Last Level Cache (LLC).
  • the LLC has a hybrid STT-RAM and SRAM based architecture. Dirty data are evicted from the core cache and written back to the LLC.
  • a core-write is a write from a core.
  • a selected core cache miss and a core-write request trigger a prediction from a prediction table.
  • any part e.g., a layer, film, area, region, or plate
  • any part indicates that the referenced part is either in contact with the other part, or that the referenced part is above the other part with one or more intermediate part(s) located therebetween.
  • Some local memory devices include one or more victim caches.
  • a victim cache is an additional storage included in or connected to a cache.
  • Victim caches improve (e.g., reduce) the cache miss rate, and particularly reduce conflict misses, by storing data recently evicted from the corresponding cache.
  • the addition of a victim cache can have a similar impact on cache performance. The benefit is most evident in cases when a victim cache is added to a direct mapped cache, because a direct mapped cache has a relatively high rate of conflict misses.
  • FIG. 1 illustrates an example computing system 100.
  • the example computing system 100 includes an example CPU 102, example processing cores 104a-104n, an example extended memory 106, and an example data cache 108.
  • the example data cache 108 includes an example level one (L1) cache 110, an example level two (L2) cache 112, and an example level three (L3) cache 114.
  • L1 cache 110 an example level one (L1) cache 110
  • L2 cache 112 an example level two cache 112
  • L3 cache 114 an example level three cache 114.
  • L1 level one
  • L2 level two
  • L3 cache 114 an example level three cache 114.
  • L3 cache 114 an example level three cache 114.
  • the example computing system 100 of FIG. 1 includes N processing cores and three levels of cache.
  • the example computing system 100 may include any number of processing cores and/or levels of cache.
  • one or more of the example components of the computing system 100 may be implemented on the same die and/or different dies in the same chip
  • a scalar read operation may be transmitted via the scalar data while the data corresponding to the scalar read may be transmitted over the vector interface.
  • either the scalar interface and/or the vector interface may be used if the other interface is busy.
  • the CPU 102 may be connected to the data cache 108 using a different number and/or different types of interfaces.
  • the example core 104a transmits the read and/or write instructions to the example data cache 108.
  • the data cache 108 includes the data corresponding to the instructions from the core 104a (e.g., corresponding to a cache hit)
  • the data cache 108 fulfills the request and/or instructions from the processing core 104a. If the data cache 108 does not include the data corresponding to the instructions from the cores 104a (e.g., corresponding to a cache miss), the data cache 108 interfaces with the example extended memory 106 to perform the transaction from the core 104a.
  • the CPU 102 transmits memory operations to the example L1 cache 110 and if the memory operation cannot be served by the L1 cache 110, the L1 cache 110 transmits the memory operation to L2 cache 112, and so on.
  • the L3 cache 114 e.g., the highest level cache
  • the L3 cache 114 interacts with the extended memory 106 to read or write the corresponding data to the memory address.
  • the extended memory 106 may be on chip or off chip memory (e.g., DDR) and the interface to the extended memory may be 2 N bits, where N depends on the type of extended memory used.
  • any of the data caches which can pull data from the example extended memory 106 before execution of a problem to be stored locally at the cache before the CPU 102 executes any instructions, the memory 106 provides copies of the data stored in the memory to the example data cache 108.
  • the data cache 108 may request additional information and/or instruct the extended memory 106 to adjust the stored data in the extended memory 106 periodically, aperiodically, and/or based on a trigger, based on instructions from the CPU 102.
  • the example data cache 108 of FIG. 1 stores blocks of data (e.g., a cached subset of the data stored in the extended memory 106) from the example extended memory 106 to reduce the time needed for the example CPU 102 to access the cached subset, thereby improving system performance. For best performance, attempts are made so that the data in the data cache 108 corresponds to the data most likely to be used by the CPU 102.
  • the data cache 108 provides access to the cached data when called upon by the CPU 102 during a cache hit (e.g., when the requested data is stored in the data cache 108). If the CPU 102 requests data that is not included in the data cache 108 (e.g., a cache miss), the data cache 108 retrieves the corresponding data from the extended memory 106.
  • the example data cache 108 includes the example L1 cache 110, the example L2 cache 112, and the example L3 cache 114.
  • the levels of the cache may be based on speed and/or size.
  • the example L1 cache 110 may be the fastest cache and smallest, followed by L2 112 (e.g., slower than L1 110 but larger) and L3 114 (e.g., slower than L2 112 but larger).
  • L2 112 e.g., slower than L1 110 but larger
  • L3 114 e.g., slower than L2 112 but larger
  • the instruction from the CPU 102 is first sent to the L1 cache 110 and, if the corresponding data is not stored in the L1 cache 110, then the instruction is sent to the L2 cache 112. If the corresponding data is not stored in the L2 cache 112, the instruction is sent to the L3 cache 114. If the corresponding data is not stored in the L3 cache 114, the example data cache 108 accesses the data from the extended memory 106.
  • the CPU interface 202 When the CPU interface 202 obtains instructions corresponding to particular data stored at a particular address, the CPU interface 202 interfaces with the cache controller 220 and the main tag RAM access 204 to determine whether the corresponding data is stored in the main storage 214 and/or the victim storage 218 to perform the transaction. Also, for some types of transactions (e.g., read transactions) the example CPU interface 202 returns corresponding data to the example CPU 102.
  • the cache controller 220 and the main tag RAM access 204 to determine whether the corresponding data is stored in the main storage 214 and/or the victim storage 218 to perform the transaction. Also, for some types of transactions (e.g., read transactions) the example CPU interface 202 returns corresponding data to the example CPU 102.
  • the main components e.g., the example main tag RAM access 204, the example tag RAM 208, the example main cache store queue 212, the example main storage 214, and the example main cache controller 222
  • the victim components e.g., the example tag RAM access 206, the example tag RAM 210, the example victim cache store queue 216, the example victim storage 218, and the example victim cache controller 224
  • the main components e.g., the example main tag RAM access 204, the example tag RAM 208, the example main cache store queue 212, the example main storage 214, and the example main cache controller 222
  • the victim components e.g., the example tag RAM access 206, the example tag RAM 210, the example victim cache store queue 216, the example victim storage 218, and the example victim cache controller 224
  • the example main tag RAM access 204 of FIG. 1 is coupled to the tag RAM 208 and the cache controller 220.
  • the victim tag RAM access 206 is coupled to the tag RAM 210 and the cache controller 220.
  • the main tag RAM access 204 accesses the tag RAM 208 to determine whether the data from a memory address corresponding to the instructions from the CPU 102 is present in the main storage 214.
  • the example victim tag RAM access 206 accesses the tag RAM 210 to determine whether the data from a memory address corresponding to the instructions from the CPU 102 is present in the victim storage 218 in parallel with the main tag RAM access 204.
  • the main tag RAM access 204 is implemented in the tag RAM 208 and the victim tag RAM access 206 is implemented in the tag RAM 210.
  • the main tag RAM access 204 and/or the victim tag RAM access 206 determines address(es) corresponding to the instructions from the CPU 102 is/are present in the respective tag RAM 208,210, the main tag RAM access 204 and/or the victim tag RAM access 206 transmits the results (e.g., the determination and/or any corresponding data) to the example cache controller 220.
  • the example tag RAM 208 of FIG. 2 is coupled to the example cache controller 220 and the example main storage 214.
  • the example tag RAM 208 stores a table that records the entries in the example main storage 214 that correspond to memory addresses in the extended memory 106. In this manner, the example main tag RAM access 204 can review the table to determine if data corresponding to instructions from the CPU 102 is available in the main storage 214.
  • the example tag RAM 210 is coupled to the example cache controller 220 and the example victim storage 218.
  • the example tag RAM 210 stores a table that records the entries in the example victim storage 218. In this manner, the example victim tag RAM access 206 can review the table to determine if data corresponding to instructions from the CPU 102 is available in the victim storage 218.
  • the example victim-side tag RAM 210 may be a content addressable memory (CAM).
  • the victim storage 218 is fully-associative (e.g., any location of the victim storage 218 can be used to store data from any CPU address).
  • the example victim tag RAM 210 compares the provided memory address to all the entries of the tag RAM 210. If there is a match between the provided address and the entries stored in the tag RAM 210, then the address of the corresponding location in the victim storage 218 is output by the tag RAM 210. The address is used to obtain the data from the victim storage 218 that corresponds to the CPU instruction.
  • the victim cache store queue 216 may process read, modify, and/or write operations from the cache controller 220 that were transmitted in response to a retirement point met (e.g., when one or more cache lines is removed from the L1 cache 110 to the L2 cache 112), in other examples described herein, the victim cache store queue 216 may process read, modify, and/or write operations from the cache controller 220 that were transmitted directly from the CPU 102.
  • the example victim cache store queue 216 is further described below.
  • the example cache controller 220 of FIG. 2 is coupled to the components of the L1 to control how data is read and/or written in the example storages 214, 216, and/or how data is updated in the example storages 214, 218. For example, when a read request, a write request, an atomic request, a read-modify-write request, etc. is received at the example CPU interface 202, the cache controller 220 obtains the request and instructs the other components accordingly. For example, during a read request for data at a particular location of the extended memory 106, the example cache controller 220 instructs the main tag RAM access 204 to access the tag RAM 208 to determine if the main storage 214 is storing the data corresponding to the location of the extended memory 106 from the read request.
  • the example cache controller 220 may transmit the old victim to the L2 cache 112 via the L2 interface 228 to be stored in the L2 cache.
  • FIGS. 3A-3D illustrate an example circuit implementation of the L1 cache 110 of the example computing system 100 of FIG. 1 .
  • the example implementation of FIGS. 3A-3D includes the example CPU interface 202, the example tag RAMs 208, 210, the example main cache store queue 212, the example main storage 214, the example victim cache store queue 216, the example victim storage 218, and the example cache controller 220 of FIG. 2 .
  • the example CPU interface 202 includes two interfaces (e.g., one scalar and one vector interface, both interfaces having two parts, one for input data from the CPU 102 and one for output data to the CPU 102).
  • the input CPU interface 202 of FIGS. 3A-3D includes an elastic buffer to buffer incoming data from the CPU 102, a multiplexer to select between the buffered data from an elastic buffer in case there are pending CPU instructions in the elastic buffer and instructions coming directing from the CPU 102 in case the elastic buffer queue is empty, and breaks the incoming instructions into the corresponding address, operation (e.g., read, write, etc.) and write data (e.g., if the instructions correspond to a write operation).
  • the output CPU interface 202 of FIGS. 3A-3D transmits data back to the CPU 102.
  • the example main cache store queue 212 of FIGS. 3A-3D includes blocks that correspond to operations of the main cache store queue 212.
  • the main cache store queue 212 includes blocks to implement a read-modify-write operation, write merging, write data forwarding, writing operation, complete parity block write data, weighted histogram operations, load and increment operations, and compare and swap operations.
  • the example main cache store queue 212 is further described below in conjunction with FIG. 4A .
  • the example main cache store queue 212 operates in conjunction with the example main storage 214.
  • the main storage 214 is data RAM (DRAM).
  • Exclusive is when the cache line contains data that is not stored in any other similar-level cache and the data is clean (e.g., matches the data in the extended memory 106).
  • Shared indicates that the cache line contains data that may be stored in other caches and is clean (e.g., the line may be discarded because it is present in another cache).
  • Invalid indicates that the cache line is invalid or unused.
  • the MESI RAM 300 may be called upon when updates to the main storage 214 and/or the extended memory 106.
  • the example MESI RAM 300 for victim cache is implemented in conjunction with the example tag RAM 210.
  • the example address processing components 302a-c of FIGS. 3A-3D are connected to the CPU interface 202, the example main storage 214, the example main cache store queue 212 (e.g., via the MUX 318), the example victim storage 218 (e.g., via the example MUX 320) and each other.
  • the example address processing components 302a-c include an example first address processing component 302a, a second address processing component 302b, and a third address processing component 302c.
  • the first address processing component 302a performs address translation
  • the second address processing component 302b performs data rotation
  • the third address processing component 302c facilitates bank organization.
  • the address processing components 302a-c may use a memory address from a CPU operation to determine which banks of the main cache store queue 212, the main storage 214, the victim cache store queue 216, and the victim storage 218 is broken up into multiple banks would be needed for the given CPU operation.
  • the example bank processing logic 303 is coupled to the CPU interface 202, the example main storage 214, the example main cache store queue 212 (e.g., via the MUX 318), and the example victim storage 218 (e.g., via the example MUX 320).
  • the bank processing logic 303 is configured to analyze read, modify, and/or write instructions from the CPU interface 202. In this manner, the bank processing logic 303 is configured to determine the nature of the read, modify, and/or write instructions to facilitate efficient partial bank read, modify, and/or write instructions.
  • the bank processing logic 303 detects whether incoming write instructions indicate a write of an entire bank, or a write of a partial bank. In this manner, the bank processing logic 303 can indicate whether to operate a read-modify-write operation, while negating to transmit the read instruction. Example description of bank processing logic 303 operation is described below.
  • the example hit/miss comparison logic 304 of FIGS. 3A-3D is connected to the input CPU interface 202, the tag RAM 208, the main storage 214, the main cache store queue 212, the cache controller 220, and/or the example MUX circuit 314 (e.g., via a data forward latch).
  • the hit/miss comparison logic 304 obtains the address from the tag RAM 208 and an address of the instruction from the CPU 102 and compares the two (e.g., using exclusive nor (XNOR) logic) to determine whether the address from the instruction hit or missed (e.g., the data corresponding to the address is stored in the example DRAM 214 or not).
  • the example hit-miss comparison logic 304 includes TAG compare logic to output the result of the comparison to the example main cache store queue 212, the example cache controller 220, and/or to the example MUX circuit 314.
  • the example hit/miss comparison logic 306 of FIGS. 3A-3D is connected to the input CPU interface 202, the tag RAM 210, the victim cache store queue 216, and/or the example replacement policy component 308.
  • the hit/miss comparison logic 306 obtains the entry number of the victim cache (e.g., location) from the tag RAM 210 and an address from the instruction from the CPU interface 202 and compares the two to determine if the access (e.g., the instruction from the CPU interface 202) is a hit or miss (e.g., the data corresponding to the address is stored in the example victim storage 218 or not).
  • the example hit-miss comparison logic 306 outputs the result to the replacement policy component 308, the address encoder 326, the multiplexer 330, and/or the victim cache store queue 216.
  • the example flush engine 309 (e.g., the flush engine component) 309 is coupled to the replacement policy 308.
  • the flush engine 309 is used and/or otherwise invoked to flush out write misses stored inside the victim storage 218 at a pre-defined periodicity.
  • a read allocate is when the L1 cache 110 stores the data in the main storage 214, updates the tag RAM 208, etc., to identify that the data for the address is now stored in the main data storage.
  • the L1 cache 110 may return the data to the CPU 102 and/or wait for the CPU 102 to send out a subsequent read request for the same address. If the CPU 102 sends out a subsequent read request for the same address, the tag RAM 208 will identify that the data for the address is now present in the main storage 214, thereby resulting in a read hit. If the CPU 102 does a write to the same address, the tag RAM 208 will identify a write hit because the address is stored in the main storage 214. For a write hit, the CPU 102 will provide data to write, and the L1 cache 110 will write the data into the main storage 214 corresponding to the address.
  • the L1 cache 110 can perform a write miss.
  • the L1 cache 110 sends the write miss out to the higher level cache (e.g., L2 cache 112, L3 cache 114, etc.) and/or extended memory 106 to retrieve the data from the memory address, stories the data in the main storage 214, and then writes the data from the CPU 102 in the main storage 214 at a location corresponding to the memory address.
  • the higher level cache e.g., L2 cache 112, L3 cache 114, etc.
  • the CPU may only write a few number of bytes per write instruction and the interface between the L1 cache and higher level caches and/or the extended memory is capable of sending a larger number of bytes (e.g., 64 byte bandwidth). Accordingly, the transmission of a few number of bytes per cycle on a large byte interface is inefficient.
  • the example victim storage 218 is a victim cache and a write miss buffer.
  • the section of the victim storage is called the write miss cache.
  • the write miss cache may be additionally or alternatively implemented in the main storage 214.
  • the write miss cache is a 128 bytes of a cache line. The write miss cache stores all the write miss data until the write miss cache is full and/or there is more than a first threshold number of bytes that can be sent to higher level cache and/or extended memory.
  • the victim storage 218 combines a second threshold amount of the write miss data in the write miss cache into one signal that is sent to the higher level cache (e.g., via the example L2 interface 228) to be written in the address stored in the higher level cache (e.g., the L2 cache 112) and/or the extended memory 106. In this manner most or all of the bandwidth of the interface can be utilized in a particular cycle.
  • the second threshold may be the same as or different than the first threshold.
  • the write data is stored locally in the main storage 214 or the victim storage 218.
  • the cache controller 220 can read and/or write the data to the corresponding address within the write miss cache before it gets transmitted to higher level cache and/or the extended memory 106.
  • the structure of the write miss cache in the victim storage 218 includes a byte enable register file that represents the value bytes (e.g., the bytes to be written) of the write miss information. For example, if a write miss corresponding to writing data for a first byte and a third byte of a memory address is stored in the write miss cache, the victim storage 218 stores the write miss data for the first and third byte in conjunction with the memory address and populates the corresponding entry of byte enable register file with a first value (e.g., '1') for the elements of the entry that correspond to the first and third byte and a second value (e.g., '0') for the remaining elements of the entry.
  • a first value e.g., '1'
  • a second value e.g., '0'
  • the byte enable bits of the entry are included in the transmission so that the higher level cache knows which data is valid (e.g., which bytes are to be written to) and which data is invalid (e.g., which bytes should not be written to).
  • the memory address of the read request of the first datapath maps to a single location in the main storage 214. If there is already cached data in the single location, the already cached data is evicted from the main storage 214 to the victim storage 218 to the pre-generated location within the victim storage 218. If this pre-generated location is the same location the cache write of the second datapath is a hit on, a conflict occurs. This conflict may be detected by the cache controller 220.
  • address generation for a location within the victim storage 218 occurs before it is known whether the address of cache request is a hit or a miss, thus there is an address generated for a second location within the victim storage 218 for the cache write of the second datapath before the determination that the cache write is a hit. Based on the detection of the conflict, this second location within the victim cache may be used to store the data evicted from the main storage 214 by the read miss.
  • a request to obtain the memory address of the read request is issued to a higher level cache or memory and the already cached data is evicted from the main storage 214 to the victim storage 218 to a pre-generated location, here location A, within the victim storage 218.
  • the cache write of the second datapath hits on location A within the victim storage 218 as well, resulting in a set conflict.
  • One possible solution to such a conflict is to load the requested read miss from the higher level cache or memory directly to the victim cache in a separate location.
  • a cache read may be received on the first datapath for the victim storage 218 and a cache write may be received on the second datapath for the victim storage 218.
  • the cache read and cache write proceed in parallel without conflicts.
  • the cache read and cache write also proceed in parallel without conflicts.
  • the cache read may use an address generated for a location within the victim storage 218 for the cache write as described above.
  • both the cache read and the cache write use addresses generated for locations within the victim storage 218.
  • the cache read and the cache write may proceed in parallel without conflicts.
  • the cache read may be a miss for a first address of a set of addresses stored in the victim storage 218.
  • the cache write may be a hit for a second address of the same set of addresses stored in the victim storage 218.
  • the cache read may be stalled until after the cache write of the second datapath completes to the location in the victim storage 218 and is evicted to a higher level cache or memory.
  • the cache read then proceeds to read the set of addresses from the higher level cache or memory into the victim storage 218.
  • the cache read may be a miss for a first address of a set of addresses stored in the victim storage 218.
  • the cache write may also be a miss for a second address of the same set of addresses stored in the victim storage 218. In such a case, the cache read and the cache write may proceed in parallel without conflicts.
  • a cache read may be received on the first datapath for the victim storage 218 and a cache write may be received on the second datapath for the victim storage 218.
  • the cache read may be a hit for an address stored in the victim storage 218.
  • the cache write may also be a hit for the same address stored in the victim storage 218.
  • the cache read may proceed first and the cache write may be stalled until after the cache read completes.
  • the order of the cache write and cache read may be based on the datapath on which the cache write and cache read are received, with the cache operation arriving on a lower (or higher) numbered datapath being completed before the other cache operation.
  • the cache read may be a miss for an address stored in the victim storage 218.
  • the cache write may also be a miss for the same address stored in the victim storage 218.
  • the cache write operation may be forwarded to a higher level cache or memory and then the cache read may obtain the data from the higher level cache or memory after the cache write operation completes for storage into the victim storage 218.
  • a first cache read may be received on the first datapath for the victim storage 218 and a second cache read may be received on the second datapath for the victim storage 218. If the first cache read and the second cache read are for different memory addresses, then there are no conflicts for either hits nor misses. In certain cases, the first cache read may be a miss for a first address of a set of addresses. The second cache read may also be a miss for a second address of the same set of addresses. If the first cache read and the second cache read have different priority levels, a higher level cache or memory is accessed based on the higher of the different priority levels. Otherwise, the higher level cache or memory is accessed and the set of memory addresses obtained for storage in the victim storage 218. The case where the first cache read and the second cache read are for the same address is handled identically.
  • FIG. 4A is an example circuit implementation of the main cache store queue 212 of FIGS. 2 and/or 3.
  • the main cache store queue 212 includes an example latches 402a, 402b, 402c, 402d, 402e, example merge circuits 403a-c, an example arithmetic component 404, an example atomic compare component 406, an example read-modify-write merge component 408, an example select multiplexer 410, and example ECC generator 412, an example arbitration manager 414, an example pending store address data store 416, an example priority multiplexer 418, an example read port 424, and an example write port 426.
  • the example merge circuits 403a-d include an example comparator(s) 420, and example switches 422.
  • FIG. 4A illustrates a single pipeline of the main cache store queue 212.
  • the main storage element 214 may be arranged to support more than one independent copy of the pipeline with respect to different banks as indicated by the dashed box 400. Accordingly, the pipeline of FIG. 4A may be reproduced multiple times for different banks, as further described below.
  • the example latch 402c is coupled to the latch 402b, the priority multiplexer 418, the arithmetic component 404, the atomic compare component 406, and the read-modify-write merge component 408. This coupling enables the latch 402c to transmit the value obtained from the read, modify, and/or write instruction (e.g., the byte value, the bit value, etc.) to the arithmetic component 404, the atomic compare component 406, and/or the read-modify-write merge component 408 in response to a subsequent clock cycle of the cache controller 220.
  • the read, modify, and/or write instruction e.g., the byte value, the bit value, etc.
  • the example merging circuit 403a is coupled to the latch 402d, the merging circuit 403b, the arithmetic component 404, the atomic compare component 406, and the read-modify-write merge component 408.
  • the example merging circuit 403b is coupled to the merging circuit 403a, the priority multiplexer 418, and the merging circuit 403c.
  • the example merging circuit 403c is coupled to the merging circuit 403b and the latch 402b.
  • the example merging circuits 403a-c facilitate the comparison of read operations in different sections of the main cache store queue 212 to potentially reroute write operations to be merged with write operations corresponding to the same memory address location, as further described below.
  • the merging circuits 403a-c include three merging circuits 403a-c, there may be additional merging circuits to merge write operations from other sections of the main cache store queue 212 (e.g., a merging circuit coupling the output of the latch 402d to the output of latch 402b and/or latch 402a, etc.).
  • the merging circuits 403a-c are combined into a single circuit that compares the write operations from the different latches 402b-d and reroutes based on matching memory addresses in any two or more of the different latches 402b-d.
  • any other type of circuitry may additionally or alternatively be used such as, for example, one or more analog or digital circuit(s), logic circuits, programmable processor(s), application specific integrated circuit(s) (ASIC(s)), programmable logic device(s) (PLD(s)), field programmable logic device(s) (FPLD(s)), digital signal processor(s) (DSP(s)), etc. Operation of the example arithmetic component 404 is further described below.
  • the atomic compare component 406 is coupled to the latch 402c, the first multiplexer 410, and to the ECC logic 310 to compare data at a memory address to a key and, in the event the data at the memory address matches the key, replace the data.
  • the example atomic compare component 406 the illustrated example of FIG. 4A is implemented by a logic circuit such as, for example, a hardware processor.
  • any other type of circuitry may additionally or alternatively be used such as, for example, one or more analog or digital circuit(s), logic circuits, programmable processor(s), application specific integrated circuit(s) (ASIC(s)), programmable logic device(s) (PLD(s)), field programmable logic device(s) (FPLD(s)), digital signal processor(s) (DSP(s)), etc. Operation of the example atomic compare component 406 is further described below.
  • the read-modify-write merge component 408 is coupled to the latch 402c, the first multiplexer 410, and to the ECC logic 310 to facilitate the read, modify, and/or write instruction(s) sent by the cache controller 220.
  • the read-modify-write merge component 408 is coupled to the ECC logic 310 to obtain the currently stored word that is to be affected by the read, modify, and/or write instruction(s).
  • the read-modify-write merge component 408 is configured to update the currently stored word obtained from the ECC logic 310 with the new bit(s), byte(s), etc., obtained from the latch 402c. Additional description of the read-modify-write merge component 408 is described below.
  • the example read-modify-write merge component 408 of the illustrated example of FIG. 4A is implemented by a logic circuit such as, for example, a hardware processor.
  • a logic circuit such as, for example, a hardware processor.
  • any other type of circuitry may additionally or alternatively be used such as, for example, one or more analog or digital circuit(s), logic circuits, programmable processor(s), application specific integrated circuit(s) (ASIC(s)), programmable logic device(s) (PLD(s)), field programmable logic device(s) (FPLD(s)), digital signal processor(s) (DSP(s)), etc.
  • the example first multiplexer 410 is coupled to the arithmetic component 404, the atomic compare component 406, and the read-modify-write merge component 408 to transmit, based on an indication from the cache controller 220, the output of either the arithmetic component 404, the atomic compare component 406, or the read-modify-write merge component 408 to the latch 402d.
  • the cache controller 220 indicates to perform a write function (e.g., the cache control transmits a write request to the latch 402b)
  • an indication is sent by the cache controller 220 to the first multiplexer 410 to select the input connected to the read-modify-write merge component 408 to be transmitted to the latch 402d.
  • the ECC generator 412 is coupled to the latch 402d and to the merging circuit 403a to facilitate error detection and correction in the value (e.g., byte(s), bit(s), etc.) stored in the latch 402d.
  • the ECC generator 412 is configured to regenerate the ECC (E.g., generate error detection code) value which will be stored with the data (e.g., merged word output from the read-modify-write merge component 1108).
  • the ECC value is used by the error detection and correction circuit to determine whether the error occurred during a read and/or write operation, as further described above.
  • a logic circuit such as, for example, a hardware processor.
  • any other type of circuitry may additionally or alternatively be used such as, for example, one or more analog or digital circuit(s), logic circuits, programmable processor(s), application specific integrated circuit(s) (ASIC(s)), programmable logic device(s) (PLD(s)), field programmable logic device(s) (FPLD(s)), digital signal processor(s) (DSP(s)), etc.
  • the example arbitration manager 414 is coupled to the latch 402a, the latch 402b, the pending store address datastore 416, and the main storage 214 to facilitate the read, modify, and/or write instructions obtained from the cache controller 220.
  • the arbitration manager 414 is configured to transmit a read instruction of the corresponding currently stored word to the main storage 214.
  • the arbitration manager 414 is coupled to the main storage 214 to arbitrate between conflicting accesses of the main storage 214. When multiple operations attempt to access the main storage 214 in the same cycle, the arbitration manager 414 may select which operation(s) are permitted to access the main storage 214 according to a priority scheme.
  • the arbitration prioritizes read operations over write operations because write data that is in the main cache store queue 212 is available for use by subsequent operations even before it is written to the main storage 214.
  • the main cache store queue 214 fills with write data that has not yet been written back, the priority of the write operations may increase until they are prioritized over competing read operations.
  • the example arbitration manager 414 of the illustrated example of FIG. 4A is implemented by a logic circuit such as, for example, a hardware processor.
  • a logic circuit such as, for example, a hardware processor.
  • any other type of circuitry may additionally or alternatively be used such as, for example, one or more analog or digital circuit(s), logic circuits, programmable processor(s), application specific integrated circuit(s) (ASIC(s)), programmable logic device(s) (PLD(s)), field programmable logic device(s) (FPLD(s)), digital signal processor(s) (DSP(s)), etc.
  • the pending store address data store 416 is configured to store the address of the read, modify, and/or write instruction obtained from the cache controller 220. In this manner, the pending store address datastore 416 maintains a log of the addresses associated with each value stored in any of the latches 402a, 402b, 402c, 402d, 402e, and/or the merging circuits 403a, 403b, and/or 403c.
  • the example pending store address datastore 416 of the illustrated example of FIG. 4A may be implemented by any device for storing data such as, for example, flash memory, magnetic media, optical media, etc.
  • the data stored in the pending store address datastore 416 may be in any data format such as, for example, binary data, comma delimited data, tab delimited data, structured query language (SQL) structures, etc.
  • the example priority multiplexer 418 is coupled to the latch 402b, the latch 402c, the latch 402d, and the merging circuit 403a to facilitate read operations in the event either of the of the latch 402b, the latch 402c, the latch 402d, or the merging circuit 403a are storing a value corresponding to a write instruction.
  • the cache controller may initiate the following four write instructions regarding a four byte word having addresses A3, A2, A1, and A0: write address A0 with the byte 0x11, write address A1 with the byte 0x22, write address A3 with the byte 0x23, and write address A0 with the byte 0x44.
  • the priority multiplexer 418 is configured to obtain the byte value 0x11 stored in the merging circuit 403a, the byte value 0x22 stored in the latch 402d, the byte value 0x23 stored in the latch 402c, and the byte value 0x22 stored in the latch 402b. Also, the pending store address data store 416 transmits an instruction to the priority multiplexer 418 indicating which address value in is associated with the byte value stored in the latch 402b, the latch 402c, the latch 402d, and the merging circuit 403a.
  • the priority multiplexer 418 is configured to transmit a packet to the latch 402e indicating that address A0 is 0x44 (e.g., the most recent write instruction associated with the address A0), address A1 is 0x22, and address A3 is 0x23.
  • the MUX circuit 314 is configured to update the value of the currently stored word with the byte values obtained from the priority multiplexer 418. Such an operation ensures that a read instruction transmitted by the main cache store queue 212 probably indicates the correct word, even though the write instructions may not have fully propagated through the main cache store queue 212.
  • An example read path (e.g., the read input to the tag RAM 208) may run in parallel with the main cache store queue 212. Because a read operation (e.g., a read instruction) may refer to data in a write operation (e.g., a write instruction) that may not have completed yet, the main cache store queue 212 may include write forwarding functionality that allows the read path to obtain data from the main cache store queue 212 that has not yet been written back to the main storage 214.
  • a read operation e.g., a read instruction
  • a write operation e.g., a write instruction
  • the main cache store queue 212 includes a read-invalidate functionality that forwards in-flight data (e.g., data of the store queue 212 not yet stored in the main storage element 214) to the victim storage element 214 and/or the L2 cache 212 and invalidates the in-flight data remaining in the store queue 212.
  • in-flight data e.g., data of the store queue 212 not yet stored in the main storage element 21
  • the example read port 424 is coupled to the read path and the data store 416.
  • the read port 424 may be implemented by an interface that interfaces with the main cache controller 222 whenever a read-miss occurs.
  • the read port 424 is utilized to receive victim addresses and read-invalidate commands from the main cache controller 222.
  • the read port 424 is to send the victim addresses to the data store 416 to be compared against the pending addresses stored in the data store 416.
  • the ECC generator 412 operates on word granularity. Accordingly, the ECC generator 412 calculates the ECC syndrome for a block of data.
  • the block of data may four bytes (e.g., a word).
  • the main cache store queue 212 processes the write instruction by, at a first example cycle (e.g., to compete the first write request of replacing address A3 with the byte 0x33), because the ECC generator 412 operates on word granularity (e.g., a 4-byte or 32-bit word), the cache controller 220 initiates a read request of the currently stored byte in address A3 of the currently stored word.
  • the byte and address in the first write request (e.g., 0x33 and A3) is stored in the latch 402b.
  • the cache controller 220 transmits a read request of the entire currently stored word to the main storage 214.
  • a read request of the entire currently stored word is transmitted to the main storage 214 and the byte 0x33 is stored in the first latch 402b.
  • the byte from the first write request is transmitted to the latch 402c, the entire currently stored word is transmitted from the main storage 214 to the ECC logic 310, and the second write request (e.g., to replace address A1 with the byte 0x22) is transmitted by the cache controller 220 to be stored in the latch 402b.
  • the read-modify-write merge component 408 obtains the byte stored in the latch 402c and the entire currently stored word transmitted by the ECC logic 310. In this manner, the read-modify-write merge component 408 identifies the address of the byte in the currently stored word to be updated .
  • the read-modify-write merge component 408 identifies and/or otherwise obtains (a) the value (e.g., byte value, bit value, etc.) of the portion of the currently stored word to be updated from the latch 402c and the (b) currently stored word from the ECC logic 310
  • the read-modify-write merge component 408 writes (e.g., replaces, merges, etc.) the portion of the currently stored word with the value of the portion of the currently stored word obtained from the latch 402c.
  • the read-modify-write merge component 408 writes the value of the portion of the word to an address value corresponding to the portion of the word in the word.
  • Such an example written portion output by the read-modify-write merge component 408 may be referred to herein as the merged word.
  • such a merged word is provided by the read-modify-write merge component 1108 for writing to the victim storage 218.
  • the select multiplexer 410 transmits the merged word from the read-modify-write merge component 408 to be stored in the latch 402d.
  • the ECC generator 412 obtains the merged word from the latch 402d and generates the corresponding ECC syndrome bits.
  • the ECC generator 412 transmits the merged word though the merging circuits 403a, 403b, and 403c to be handled by the arbitration manager 414 to be stored in the main storage 214.
  • the example address line 462 (RD_ADDR) is coupled to the cache controller 220 to obtain an example read instruction from the CPU 102.
  • such an example address line 462 obtains the read instruction for the main cache store queue 212.
  • the main cache store queue 212 could forward any data from in-flight write transactions when executing the CPU 102 read instruction.
  • an instruction from the CPU 102 e.g., a read instruction and/or otherwise transaction, and/or a write instruction and/or otherwise transaction
  • the four addresses of the address stores 464a may include one or more valid bytes (e.g., bytes that are a logic high).
  • the main cache store queue 212 is 64 bits (e.g., 8 bytes) and, thus, the main cache store queue 212 may perform a write on any number of bytes, such as, from one to eight.
  • bits set to a logic high (e.g., 1) in any of the output lines 468 indicate that the corresponding byte of the corresponding address of the address stores 464a has valid data to be forwarded. For example, if the first output line of the output lines 468 includes dram_hit_dp0 [0], the byte value for the corresponding read instruction may be obtained from any of the addresses in the address store 464a.
  • the example of FIG. 4B includes example select logic 470 and example select lines 472.
  • the select lines 472 include eight, 2-byte outputs. Accordingly, there is one select signal of the select signals for each of the corresponding byte enables of the byten stores 464c.
  • the example select logic 720 selects the most recent data.
  • Such an output of the select logic 470 e.g., the select lines 472 control example multiplexers 474a-474h (multiplexers 474e-h not shown), respectively.
  • the multiplexers 474a-h include 8 1-byte input terminals.
  • any of the address line 462, the address stores 464a, the byten stores 464b, the data stores 464c, the compare logic 466, the output lines 468, the select logic 470, the select lines 472, and/or the multiplexers 474a-h may be implemented in the victim cache store queue 216.
  • the output terminals of the multiplexers 474a-h are coupled to an example cache multiplexers 476.
  • cache multiplexer 476 is also coupled to similar multiplexers implemented in this manner in association with the victim cache store queue 216.
  • the cache multiplexer 476 obtains a select signal from the cache controller (e.g., the main cache controller 222 or the victim cache controller 224) that transmitted the read instruction. In this manner, the cache multiplexer 476 facilitates data forwarding to the CPU 102.
  • the cache controller e.g., the main cache controller 222 or the victim cache controller 224.
  • any of the address line 462, the address stores 464a, the byten stores 464b, the data stores 464c, the compare logic 466, the output lines 468, the select logic 470, the select lines 472, and/or the multiplexers 474a-h may be implemented by the example write data forwarding component of the main cache store queue 212, and of the address line 462, the address stores 464a, the byten stores 464b, the data stores 464c, the compare logic 466, the output lines 468, the select logic 470, the select lines 472, and/or the multiplexers 474a-h, as implemented in association with the victim storage 216, may be implemented by the example write data forwarding component of the victim cache store queue 216.
  • the topology of FIG. 4B may correspond to the pending store address data store 418 and example priority multiplexer 418.
  • the address stores 464a, the byten stores 464b, and/or the data stores 464c may be implemented by the example pending store address data store 416.
  • any of the address line 462, the compare logic 466, the output lines 468, the select logic 470, the select lines 472, and/or the multiplexers 474a-h may be implemented by the example priority multiplexer 418.
  • the topology of FIG. 4B is utilized for each bank of the main storage 214 and the victim storage 218. For example, if the main storage 214 has 8 banks, the topology of FIG. 4B would be replicated 8 times, one for each bank.
  • Example methods, apparatus, systems, and articles of manufacture to facilitate fully pipelined read-modify-write support in level 1 data cache using store queue and data forwarding are described herein. Further examples and combinations thereof include the following:
  • such a write instruction may be transmitted with a corresponding read instruction, regardless of the size of the write instruction, in an attempt to execute a full read-modify-write cycle of such a write instruction.
  • a write instruction may be obtained by a CPU indicating to write 128 bits across two 64-bit memory banks, starting at address A0 of the first memory bank.
  • such an application maintains a read instruction to read the data currently stored in the two example memory banks.
  • such an approach is inefficient as twice the processing power (e.g., a write and a read instruction) is used.
  • such an approach does not provide any control logic and/or processing circuitry to analyze the write instruction.
  • the write instruction can be implemented without initiating a read instruction.
  • the bank processing logic 303 may detect that such a write of the entirety of multiple banks is to be performed and, thus, indicate to the cache controller 220 to initiate the read-modify-write operation, negating to transmit the read instruction.
  • the cache controller 220 may transmit a write instruction to write 130 bits of data (or any write instruction indicating to write to a subset of the memory banks).
  • a write instruction to write 130 bits of data (or any write instruction indicating to write to a subset of the memory banks).
  • 64 bits of data may be written to a first bank
  • 64 bits of data may be written to a second bank
  • 2 bits of data may be written to a third bank of the main storage (e.g., a write instruction indicating to write a 130 bit work starting with the first address of the first bank and ending with the second address of the third bank).
  • the bank processing logic 303 detects that all addresses of the first bank and the second bank of the main storage 214 are to be written entirely and, thus, indicate to the cache controller to initiate the read-modify-write operations for the first bank and the second bank of the main storage, negating to transmit the read instruction.
  • the bank processing logic 303 may detect (e.g., determine) that a subset of the memory banks of the main storage 214 (e.g., the third bank of the memory storage) is to be partially written (e.g., two addresses of the 64 addresses are to be written), and, thus, indicate to the cache controller 220 to initiate a full read-modify-write operation of the third bank of the main storage 214.
  • the bank processing logic 303 determines whether to cause a read operation to be performed (e.g., whether to initiate a full read-modify-write operation) in response to the write operation based on whether a number of addresses in the subset of the plurality of memory banks to write satisfies a threshold.
  • the threshold is not satisfied when the number of addresses in the subset of the plurality of memory banks is greater than 0 and/or less than the number of addresses in the memory bank.
  • the bank processing logic 303 generates an indication to the CPU 102 to execute the write instruction as a full read-modify-write transaction.
  • the threshold is satisfied when the number of addresses in the subset of the plurality of memory banks is equal to the number of addresses in the memory bank.
  • the bank processing logic 303 generates an indication to the CPU 102 to execute the write instruction as a partial read-modify-write transaction (e.g., negating the read). Example description of a read-modify-write operation is described above.
  • Example methods, apparatus, systems, and articles of manufacture to reduce read-modify-write cycles for non-aligned writes are described herein. Further examples and combinations thereof include the following:
  • the example main cache store queue 212 stores a number of write operations at different sections of the main cache store queue 212 (e.g., at the example latches 402a-e). For example, when the CPU 102 transmits three separate write operations in a row, the first write operation that the CPU 102 provided is stored at the first latch 402b and moved to the second latch 402c when the second operation is received at the first latch 402b.
  • the first latch 402b will store and/or output the last write operation with respect to time (e.g., which is last to be stored in the main storage 214), the second latch 402c will have the second write operation (e.g., which is second to be stored in the main storage 214), and the third latch 402d will have the first write operation (e.g., which was the first to be stored in the example main storage 214).
  • the example arbitration manager 414 reserves a cycle for the data to be written into the example main storage 214. Accordingly, during the reserved cycle, the main storage 214 may not be available to perform read operations.
  • the data operations stored in two or more of the latches 402b, 402c, 402d correspond to the same memory address
  • the data can be merged in order to write the data into the memory address of the main storage 214 once, instead of two or three times. For example, if the write operation stored in the latch 402d corresponds to writing a byte of the memory address and the write operation stored in the latch 402c corresponds to writing the same byte to the memory address, the second write will overwrite the first write.
  • the main cache store queue 212 merges the two writes into one write, so that only one cycle is used to write the second transaction (e.g., to avoid reserving a cycle for the first write).
  • Such an aggressive merge reduces the number of cycles reserved for write operations. In this manner, the main storage 214 will have extra cycles to perform read operations, thereby decreasing the latency of the overall systems.
  • the output of the example latches 402b-402d are coupled to the example merging circuits 403a-403c.
  • the output of the third latch 402d may be coupled to the merging circuit 403a
  • the output of the second latch 402c may be coupled to the merging circuit 403b
  • the output of the first latch 402b may couple to the merging circuit 403c.
  • the output of the merging circuit 403a may additionally be coupled to the output of the second latch 402c and the merging circuit 403b
  • the merging circuit 403b may be coupled to the merging circuit 403c
  • the merging circuit 403c may be coupled to the input of the first latch 402b.
  • the example merging circuits 403a-c include example comparator(s) 420 and example switches 422.
  • the comparator(s) 420 compare the memory address locations for each write operation that is stored in the respective laches 402b-402d to determine whether any of the write operations in the example store queue correspond to the same memory address.
  • the example comparator 420 may be one comparator to compare all the write operations of the latches 402b-402d or may be separate comparators 420, to compare two of the latches 402b-d (e.g., a first comparator to the memory address of latch 402b to the memory address of latch 402c, a second comparator to the memory address of 402b to the memory address of latch 402d, etc.).
  • the example switch(es) 422 reroute the write operations in the example latches 402b-402d based on the comparison. For example, if the memory address of the write operation stored in the example latch 402d is the same as the memory address stored in the latch 402c, the example switch(es) 422 enable and/or disable to reroute the output of the latch 402d to latch 402c, instead of routing to the example arbitration manager 414. In this manner, the two write operations are combined and written into the main storage 214 in a subsequent cycle as a single write operation instead of two write operations.
  • the switch(es) 422 may be electrical switches, transistors (e.g., MOSFETS), demultiplexers, and/or any other component that can reroute a signal in a circuit.
  • the example merging circuit 403a merges the two write operations to keep the writing data stored in latch 402c (e.g., the write to byte0 and byte2) and include the write data from latch 402d that doesn't overlap (e.g., byte2). In this example, the merging circuit 403a discards the write data of byte 0 from the latch 404d as part of the merging operation because the data to be written at byte 0 from the latch 404d will be overwritten by the write instructions of the latch 402c.
  • FIG. 4D illustrates a hardware implementation of the merging circuit 402c of FIG. 4A and/or 1102c of FIG. 11A (e.g., to merge data from the latch 402d to the latch 402c). Similar hardware setups can be implemented to merge data between any two latches.
  • the example of FIG. 4D includes the example latches (e.g., stores) 402b-402d and hardware components for the comparator 420 and the switch 422 of FIG. 4A .
  • the example comparator circuit 420 includes comparators and/or logic gates 480a-480f and the switch circuit 422 includes an OR gate 482 and a MUX 484.
  • the latches 1102a-d, example comparator 1120 and example switch 322 of FIG. 11A could be used.
  • the example latch 402d outputs the stored data to the example storage (e.g., the main storage 214 or the victim storage 218 via the arbitration manager 414, 1114), which locks its bank for a first cycle.
  • FIG. 4D illustrates the write-merge locking old data in the bank of the storage when the old data is preceded by another store/latch including new write data to the same address.
  • FIG 4C illustrates a merge between the example latch 402d and the example latch 402c, a similar structure may be used to merge data between any of the latches 402a-402d.
  • the latch 402 can merge its data with the data at latch 402b.
  • the data at three or more latches may be merged into a single latch if the data at the three or more latches correspond to the same address.
  • the data at the particular latch is invalid (e.g., by setting a bit to a value corresponding to invalided) or discarded so that the arbitration is not performed for that data to be locked in a bank in the storage.
  • Example 5 includes the apparatus of example 4, wherein the part of the first memory operation are bytes that the second memory operation is to write to.
  • Example 6 includes the apparatus of example 4, wherein the part is a first part, the store queue operable to merge the first memory operation and the second memory operation by maintaining a second part of the first memory operation.
  • Example 7 includes the apparatus of example 6, wherein the second part of the first memory operation are bytes that the second memory operation is not to write to.
  • Example 8 includes the apparatus of example 1, wherein the first cache storage is a main cache storage and the second cache storage is a victim cache storage.
  • Example 9 includes a system comprising a central processing unit coupled in parallel to a first cache storage and a second cache storage, a store queue coupled to at least one of the first cache storage and the second cache storage and operable to process a first memory operation from the central processing unit, the first memory operation for storing the first set of data in at least one of the first cache storage and the second cache storage, before storing the first set of data in the at least one of the first cache storage and the second cache storage, merge the first memory operation and a second memory operation corresponding to a same memory address.
  • Example 10 includes the system of example 9, wherein the first memory operation specifies a first set of data, the second memory operation specifies a second set of data, and the store queue is operable to before storing the first set of data in the at least one of the first cache storage and the second cache storage, merge the first set of data and the second set of data to produce a third set of data, and provide the third set of data for storing in at least one of the first cache storage and the second cache storage.
  • Example 11 includes the apparatus of example 10, further including a store queue to store the third set of data in the at least one of the first cache storage or the second cache storage in one cycle.
  • Example 14 includes the system of example 12, wherein the part is a first part, the store queue operable to merge the first memory operation and the second memory operation by maintaining a second part of the first memory operation.
  • Example 15 includes the system of example 14, wherein the second part of the first memory operation are bytes that the second memory operation is not to write to.
  • Example 16 includes the system of example 9, wherein the first cache storage is a main cache storage and the second cache storage is a victim cache storage.
  • Example 19 includes the method of example 18, further including storing the third set of data in the at least one of the first cache storage or the second cache storage in one cycle.
  • Atomic operations are further examples of multi-part memory operations.
  • an atomic compare and swap operation manipulates a value stored at a memory location based on the results of a comparison of the existing value stored at the memory location.
  • the CPU 102 may want to replace the data stored in the L1 cache 110 with a new value if the existing value stored in the L1 cache 110 matches a specific value.
  • the CPU when a CPU wanted to perform an atomic operation, the CPU sent a read operation to a memory address, performed the manipulation on the read data, and then executed a write operation to the same memory address to store the manipulated data. Also, in such example systems, the L1 cache paused, rejected, blocked, and/or halted any transactions from other devices (e.g., other cores of the CPU, higher level cache, the extended memory, etc.) until the atomic operation was complete (e.g., to avoid manipulation of the memory address corresponding to the atomic operation during the atomic operation). Accordingly, such example techniques required lots of effort on behalf of the CPU and lots of reserved cycles that increase latency.
  • FIG. 4C illustrates an example circuit diagram of parts of the main cache store queue 212 of FIG. 4A and/or parts of the victim cache store queue 216 of FIG. 11A .
  • FIG. 4C illustrates a detailed circuit diagram of the arithmetic unit 404, 1104.
  • the arithmetic unit 404 may be used for other types of memory transactions such as histogram operations.
  • a histogram operation retrieves a value stored in memory that may represent a bin of a histogram, the histogram operation then modifies the values before storing it back to the same memory address or an alternative address.
  • a first data set contains the values [0, 0, 2, 0, 0, 3]
  • a second data set contains bins representing the number of occurrences of respective values within the first data set.
  • the CPU reads each in the first data set and increments the second data set for each value.
  • the CPU may perform 10 reads. Then to determine how many 1s are in the same data set, the CPU will perform an additional 10 reads.
  • N the size of the section of memory (e.g., 10 bytes) being read and M is the number of values that could be store in each byte.
  • the L1 SRAM may have to block, pause, halt, discard, etc. all other read and/or write operations until the histogram operation is complete.
  • the arithmetic unit 404 may be used to perform the same operation with a single transaction from the CPU.
  • Component 451 of FIG. 4C selects a bin read out of the ECC component 310 for the bank illustrated in FIG. 4C .
  • Component 452 selects the weight to be added to the bin from the vector of weights provided by the CPU 102.
  • Cnt_value is the sum of the bin value from component 451 and the weight provided by the CPU 102.
  • Component 453, component 454 and component 458 are used as part of the saturation circuit.
  • Component 453 receives the histogram size (byte, halfword, or word) and the count value (the sum of the outputs of components 451, 452) and determines if a signed bin will saturate.
  • the CPU 102 instructs the main storage 214 to perform the histogram operation. Thereby changing the number of cycles that the CPU 102 has to reserve for the operation from (N)(M) to 1. Also, because the atomic operation protocol is already implemented in the store queue, the histogram operation can be performed using the arithmetic component 404 by performing N reads for the N size of the memory and incrementing a count for each value in the example main cache store queue 212, thereby reducing the number or read operation from (N)(M) operations to N operations.
  • the CPU 102 transmits a histogram operation corresponding to a section (e.g., a SRAM line) of the main storage 214
  • the histogram operation is stored in the example latch 402a while the tag RAM 208 verifies whether the memory address corresponding to the histogram operation is available in the main storage 214.
  • the example cache controller 220 facilitates the read operation for each byte of the section identified in the histogram operation (e.g., where histogram bins are accessed in parallel by reading up to 128 Bytes at the same time).
  • the tag RAM 208 instructs the main storage 214 to output the data at a first byte of the section of the main storage 214 while the histogram operation is output by the example latch 402a to the example latch 402b.
  • the example main storage 214 outputs the data that has been read from the memory address to the example latch 322a
  • the latch 402b outputs the histogram operation to the example latch 402c.
  • the ECC logic 310 performs the error detection and correction functionality, the data read at the byte is sent to the example arithmetic component 404.
  • the L1 cache 110 supports functionality where a histogram bin can saturate after the histogram bin includes more than a threshold limit of the bin size (e.g., a byte, a halfword, a word, etc.).
  • a threshold limit of the bin size e.g., a byte, a halfword, a word, etc.
  • Table 1 illustrates an example of saturation values. Using this functionality, the histogram bin values will not roll over once they reach the maximum value.
  • Example methods, apparatus, systems, and articles of manufacture to facilitate optimized atomic and histogram operations are described herein. Further examples and combinations thereof include the following:
  • Example 1 includes a system comprising a cache storage coupled to an arithmetic component, and a cache controller coupled to the cache storage, wherein the cache controller is operable to receive a memory operation that specifies a set of data, retrieve the set of data from the cache storage, utilize the arithmetic component to determine a set of counts of respective values in the set of data, generate a vector representing the set of counts, and provide the vector.
  • Example 3 includes the system of example 1, wherein the cache controller is operable to provide the vector to a processor.
  • Example 6 includes the system of example 1, wherein the arithmetic component is to obtain (a) the set of data from the cache storage via an error detection and correction circuit and (b) the memory operation from a central processing unit via a latch.
  • Example 7 includes the system of example 1, wherein the cache controller is operable to provide the vector to a central processing unit based on a single instruction from the central processing unit at a single cycle.
  • Example 8 includes a system comprising a cache storage, and a cache controller coupled to the cache storage and an arithmetic component, wherein the cache controller is operable to receive a memory operation specifying a first set of data and an arithmetic operation, retrieve the first set of data from the cache storage, utilize the arithmetic component to perform the arithmetic operation on the first set of data to produce a second set of data, and provide the second set of data.
  • Example 9 includes the system of example 8, wherein the cache controller is operable to provide the second set of data for storing in the cache storage.
  • Example 10 includes the system of example 8, wherein the cache controller is operable to provide the second set of data to a processor.
  • Example 11 includes the system of example 8, further including a store queue coupled to the cache controller, the store queue including the arithmetic component.
  • Example 12 includes the system of example 8, wherein the cache storage is at least one of a main cache storage or a victim cache storage.
  • Example 13 includes the system of example 8, wherein the arithmetic component is to obtain (a) the first set of data from the cache storage via an error detection and correction circuit and (b) the memory operation from a central processing unit via a latch.
  • Example 14 includes the system of example 8, wherein the cache controller is operable to provide the second set of data to a central processing unit based on a single instruction from the central processing unit at a single cycle.
  • Example 15 includes a method comprising obtaining a memory operation that specifies a set of data, obtaining the set of data from a cache storage, determining a set of counts of respective values in the set of data, generating a vector representing the set of counts, and providing the vector.
  • Example 16 includes the method of example 15, wherein the vector is provided to the cache storage.
  • Example 17 includes the method of example 15, wherein the vector is provided to a processor.
  • Example 14 includes the storage queue of example 13, further including an arbitration manager to, if the first data matches the key, store the first set of data at the memory address after the exclusive state request has been granted from the other cache.
  • the read port 424 of the store queue 212 obtains the read-invalidate operation and obtains the address of the victim.
  • the read port 424 sends the address of the victim to the data store 416 to be compared to all of the addresses stored in the latches 402a-d. If the data store 416 determines any of the addresses stored in the latches 402a-d match the address of the victim, the data store 416 outputs an operation to the priority multiplexer 418 to send the data corresponding to the victim address to the latch 402e.
  • the latch 402e forwards the data to the MUX circuit 314 to send to the victim storage element 218 and/or the L2 cache 112.
  • the example scalar interface 502 is an interface coupling the L1 cache 110 of the data cache 108 of FIG. 1 to the example processing core 104a.
  • the scalar interface 502 is an interface corresponding to a first data path (DPO) in the dual data path victim cache system.
  • the scalar interface 502 is an interface corresponding to a second data path (DP1) in the dual data path cache system.
  • the example scalar interface 502 is a 64-bit wide bidirectional and/or unidirectional interface.
  • the example scalar interface 502 may support a different quantity of bits (e.g., 32 bits, 128 bits, etc.).
  • the vector interface 504 sends data from the victim storage 218 to the core 104a.
  • the example vector interface 504 is coupled to the example tag RAM 210, the snoop address component 506, and comparison logic 306b to compare an address from the CPU 102 to addresses from the tag RAM 210.
  • the scalar interface 502 and the vector interface 504 are implemented by the CPU interface 202 ( FIG. 2 ).
  • the scalar interface 502 and the vector interface 504 can be included in the CPU interface 202.
  • the example replacement policy component 308 is coupled to the comparison logic 306a, 306b.
  • the example replacement policy component 308 is control/decision making logic.
  • the example replacement policy component 308 dictates the entries (e.g., the data) of the example victim storage 218 based on a plurality of inputs. For example, the replacement policy component 308 can determine whether the cache controller 220 ( FIG. 2 ) is to remove and/or enter entries to/from the victim storage 218.
  • the control logic of the replacement policy component 308 is configured to resolve address conflicts between the 2 addresses (e.g., scalar and vector) in such a way that data-consistency is maintained.
  • FIG. 6 illustrates the control logic of the example replacement policy component 308.
  • the example victim storage 218 is a fully associative cache.
  • the fully associated victim storage 218 can place data, when data is fetched (e.g., victimized) from the main storage 214, in any unused block of the cache.
  • the placement of the data in the victim storage 218 is based on the replacement policy component 308.
  • the replacement policy component 308 can determine when and where a line of data from the main storage 214 should be placed in the victim storage 218.
  • the victim storage 218 outputs a response.
  • the example snoop address component 506 is implemented by a snoop data path and/or otherwise interface.
  • the L1 cache 110 includes the snoop data path to add coherency to the L1 cache 110.
  • the example snoop address component 506 is connected to the tag RAM 210 and comparison logic 306c.
  • the snoop address component 506 obtains an example snoop request address issued by a higher-level data cache (e.g., the L2 data cache 112) that issues an address read to the tag RAM 210.
  • the snoop address component 506 attempts to read a memory address from the tag RAM 210.
  • the multi-bank structure of the victim cache store queue 216, the victim storage 218, and/or, more generally, the first encapsulated data cache system 700 can service read and write operations that are sent to the banks in parallel.
  • each bank arbitrates its own processes in response to the read and/or write operations.
  • operation of the first encapsulated data cache system 700 is more efficient since an entire cache line is not locked up when a request is received. Rather, only the portion of the cache line allocated to the bank that received such a request would be locked.
  • the victim cache multi-bank structure 720 is a data or memory structure that includes 16 example banks (Banks 0-15) 722, with each of the banks 722 having a data width of 64 bytes (e.g., bytes 0-7).
  • the victim cache multi-bank structure 720 is independently addressable by bank. For example, the first row of the rows 724 has a starting row address of 0 and an ending row address of 127, a second row of the rows 724 has a starting row address of 128 and an ending row address of 255, etc.
  • a cache line can be 128 bytes of data that fits in a width of memory (e.g., DRAM) or storage unit (e.g., the main storage 214, the victim storage 218, etc.).
  • a cache line can consume an entire row of the victim cache bank structure 720.
  • a cache line can use one of the rows 724 of 16 banks, where each bank is 8 bytes wide.
  • the victim cache bank structure 720 can enable 16 different cache lines to access data stored therein.
  • FIG. 8A illustrates a schematic illustration of example victim cache tag (VCT) random access memory (RAM) 800.
  • the VCT RAM 800 can be an example implementation of the tag ram 210 of FIG. 2 .
  • the VCT RAM 800 can store addresses of data stored in the victim cache store queue 216, the victim storage 218, etc., of FIG. 2 .
  • the VCT RAM 800 is a multi-bank VCT RAM.
  • the VCT RAM 800 can include a plurality of banks (e.g., data banks, memory banks, etc.), such as 16 banks, although the VCT RAM 800 can have a different quantity of banks.
  • the VCT RAM 800 includes example allocation ports 802, 804, 806 including a first example allocation port (AP0) 802, a second example allocation port (AP1) 804, and a third example allocation port (AP2) 806.
  • the VCT RAM 800 includes example read ports 808, 810, 812 including a first example read port (RP0) 808, a second example read port (RP1) 810, and a third example read port (RP2) 812.
  • the VCT RAM 800 includes an example LRU read port 814.
  • the VCT RAM 800 includes example output ports 816, 818, 820, 822 including a first example output port (OP0) 816, a second example output port (OP1) 818, a third example output port (OP2) 820, and a fourth example output port (OP3) 822.
  • VCT RAM 800 may fewer or more allocation ports, read ports, LRU read ports, and/or output ports than depicted in FIG. 8A .
  • the control signal can be generated to inform the VCT RAM 800 that the CPU interface 202 has a cache line to be moved from the CPU interface 202 to the victim storage 218 and, thus, the CPU interface 202 has an address to be moved from the CPU interface 202 to the tag ram 210.
  • the first data 824 includes VTAG_WR_TAG0 data, which can be representative of an address (e.g., a tag address) of the VCT RAM 800 that can correspond to an address of data to be stored in the victim cache 218.
  • the first data 824 includes VTAG_WR_SET0 data, which can be representative of the address of the victim cache 218 of where to store the data (e.g., the victim cache tag for DP0).
  • the second allocation port 804 is configured to receive second example data 826.
  • the second allocation port 804 can receive the second data 826 from the write state machine associated with the vector data path (DP1).
  • the second data 826 includes WRM_TAG_UPDATE1 data, which can be representative of a control signal generated from the CPU interface 202 of FIG. 2 (e.g., the vector data path (DP1)).
  • the control signal can be generated to inform the VCT RAM 800 that the CPU interface 202 has a cache line to be moved from the CPU interface 202 to the victim storage 218 and, thus, the CPU interface 202 has an address to be moved from the CPU interface 202 to the tag ram 210.
  • the third allocation port 806 is configured to receive third example data 828.
  • the third data 828 includes ARB_EVT_TAG_UPDATE data, which can be representative of a control signal generated from the main storage 214.
  • the control signal is an arbitration (ARB) evict (EVT) tag update control signal, which can be generated to inform the VCT RAM 800 that the main storage 214 has a cache line to be moved from the main storage 214 to the victim storage 218 and, thus, the main storage 214 has an address to be moved from the tag ram 208 to the tag ram 210.
  • ARB_EVT_TAG_UPDATE data which can be representative of a control signal generated from the main storage 214.
  • the control signal is an arbitration (ARB) evict (EVT) tag update control signal, which can be generated to inform the VCT RAM 800 that the main storage 214 has a cache line to be moved from the main storage 214 to the victim storage 218 and, thus, the main storage 214 has an address to
  • the third data 828 includes ADP_EVT_WR_TAG data, which can be representative of an address (e.g., a tag address) of the VCT RAM 800 that can correspond to an address of data to be stored in the victim cache 218.
  • the third data 828 includes ADP_EVT_WR_SET data, which can be representative of the address of the victim cache 218 of where to store the data (e.g., the victim cache tag for the line moved from the main cache to the victim cache).
  • ADP_EVT_WR_TAG and ADP_EVT_WR_SET data can be referred to as address datapath (ADP) data.
  • the first data 824, the second data 826, and/or the third data 828 can be one or more data packets, one or more signals based on a communication protocol (e.g., an inter-integrated circuit (I2C) protocol), etc.
  • I2C inter-integrated circuit
  • the second read port 810 is configured to receive fifth example data 832.
  • the second read port 810 can receive the fifth data 832 from the vector interface 504 of the CPU 102.
  • the fifth data 832 includes ADP_ADDR_E2_DP1 data, which can be representative of an address of the victim storage 218 that the vector interface 504 requests access to.
  • the LRU read port 814 is configured to receive seventh example data 836.
  • the LRU read port 814 can receive the seventh data 836 from the replacement policy component 308 of FIGS. 3A-3D .
  • the seventh data 836 includes LRU_SET_DP0 and LRU_SET_DP1, which can be representative of respective addresses associated with the least randomly used (LRU) cache lines of the victim storage 218.
  • the LRU read port 814 can be a victim least randomly used (VLRU) read port configured to receive LRU data from the replacement policy component 308.
  • VLRU victim least randomly used
  • the VCT RAM 800 includes the output ports 816, 818, 820, 822 to transmit outputs to external hardware (e.g., the CPU 102, the main storage 214, etc.) in response to a read request or a write request (e.g., an allocation request) associated with the victim storage 218.
  • the first output port 816 is configured to transmit first example output data 838.
  • the first output port 816 can transmit the first output data 838 to the scalar interface 502.
  • the first output data 838 includes VTAG_HIT_DP0 data, which can indicate that data requested by the scalar interface 502 is stored in the victim storage 218.
  • the first output data 838 includes VTAG_MISS_DP0 data, which can indicate that the data requested by the scalar interface 502 is not stored in the victim storage 218.
  • the first output data 838 includes VTAG_SET_DP0 data, which can be representative of the address in the victim storage 218 where the data requested by the scalar interface 502 is stored.
  • the VCT RAM 800 is victim cache tag storage configured to store addresses (e.g., tag addresses) that correspond to the sets 846.
  • Each of the sets 846 is coupled to a respective one of first example comparators 850 and a respective one of second example comparators 852.
  • the first comparators 850 can be an example implementation of the comparison logic 306a of FIGS. 3 and/or 5.
  • the second comparators 852 can be an example implementation of the comparison logic 306b of FIGS. 3 and/or 5.
  • the first comparators 850 and the second comparators 852 are coupled to respective example address encoder logic circuits 854, 856 including a first example address encoder logic circuit 854 and a second example address encoder logic circuit 856.
  • the first comparators 850 are coupled to the first address encoder logic circuit 854.
  • the second comparators 852 are coupled to the second address encoder logic circuit 856.
  • the first multiplexer 858A has a first input to receive ADP_ADDR_E2_DP0, which is representative of an address requested by the DP0 interface from the E2 Arbitration stage of FIGS. 3A-3D .
  • the first multiplexer 858A has a second input to receive SNP_ADDR_E2_DP0, which is representative of a snoop address requested by the snoop interface from the E2 Arbitration stage of FIGS. 3A-3D .
  • the first multiplexer 858A has a select input to receive SNP_ADDR_EN_DP0, which is representative of an enable signal from the snoop interface that, when asserted, can invoke the first multiplexer 858A to select the second input.
  • Outputs of the first multiplexer 858A are coupled to a first input of the third comparator 870A.
  • An output of an example DP0 read finite-state machine (FSM) (READ_FSM_DP0) 873 and/or an output of an example DP0 write finite-state machine (WRITE_FSM_DP0) 874 is coupled to a second input of the third comparator 870A.
  • the DP0 read finite-state machine 873 and the DP0 write finite-state machine 874 are hardware implemented finite-state machines that execute logic on data from the scalar interface 502 of FIG. 5 .
  • the first decoder 860A is a 4x16 decoder.
  • the first decoder 860A has an input to receive VTAG_WR_SET0 data, which can be representative of an in-flight address from the scalar interface 502 to the victim storage 218.
  • the first decoder 860A has an output coupled to an input of the first inverter 862A.
  • the first decoder 860A can convert the in-flight address to a bit vector where each bit is inverted by one of the 16 instances of the first inverter 862A.
  • An output of the first inverter 862A is coupled to a first input of the first AND gate 864A.
  • a second input of the first AND gate 864A is coupled to the result bit of the tag comparison from the first comparators 850 with set 0 (e.g., VCT_ADDR[0]) and the output of the first multiplexer 858A.
  • the second input of the first AND gate 864A can be configured to receive HIT_DP0 data, which can be representative of a 16-bit vector, where each of the bits can correspond to whether the ADP_ADDR_E2_DP0 data is a hit (e.g., a bit value of 1) or a miss (e.g., a bit value of 0) in the victim storage 218.
  • the second decoder 860B is a 4x16 decoder.
  • the second decoder 860B has an input to receive VTAG_WR_SET1 data, which can be representative of an in-flight address from the vector interface 504 to the victim storage 218.
  • the second decoder 860B has an output coupled to an input of the second inverter 862B.
  • the second decoder 860B can convert the in-flight address to a bit vector where each bit is inverted by one of the 16 instances of the second inverter 862B.
  • An output of the second inverter 862B is coupled to a first input of the second AND gate 864B.
  • a second input of the second AND gate 864B is coupled to the result bit of the tag comparison from the second comparators 852 with set 0 (e.g., VCT_ADDR[0]) and ADP_ADDR_E2_DP1.
  • the second input of the second AND gate 864B can be configured to receive HIT_DP1 data, which can be representative of a 16-bit vector, where each of the bits can correspond to whether the ADP_ADDR_E2_DP1 data is a hit (e.g., a bit value of 1) or a miss (e.g., a bit value of 0) in the victim storage 218.
  • An output of the second AND gate 864B is coupled to a first input of the second OR gate 866B.
  • An output of the fifth comparator 870B is coupled to a second input of the second OR gate 866B.
  • An output of the sixth comparator 872B is coupled to a third input of the second OR gate 866B.
  • An output of the second OR gate 866B is coupled to an input of the second encoder 868B.
  • the second encoder 868B is a 16x4 encoder.
  • the second encoder 868B can generate HIT_ADDR1 data, which can be representative of VTAG_SET_DP1 of FIG. 8A .
  • HIT_ADDR1 can correspond to the second output data 840 of FIG. 8A .
  • the first AND gate 864A can assert a logic one in response to VTAG_WR_SET0 not matching the address in HIT_DP0 and, thus, does not convert the cache hit to a cache miss. In other examples, the first AND gate 864A can output a logic zero in response to VTAG_WR_SET0 matching the address in HIT_DP0 and, thus, converts the cache hit to a cache miss because the address requested in ADP_ADDR_E2_DP0 has been overwritten and is no longer available at that address.
  • the fourth comparator 872A and the sixth comparator 872B can be configured to convert a cache miss to a cache hit.
  • the fourth comparator 872A can determine that the first read address (ADP_ADDR_E2_DP0) in the VCT RAM 800 requested during the E2 pipeline stage is getting written in the E3 pipeline stage by the vector interface 504, which is represented by VTAG_WR_TAG1.
  • the fourth comparator 872A can assert a logic one in response to ADP_ADDR_E2_DP0 matching VTAG_WR_TAG1 and, thus, convert the cache miss to a cache hit and HIT_ADDR0 can be updated with VTAG_WR_SET1 because the data will be available when the ADP_ADDR_E2_DP0 address is read during the E3 pipeline stage.
  • the first OR gate 866A and the second OR gate 866B can be configured to generate an output to a corresponding one of the first encoder 868A or the second encoder 868B.
  • the first OR gate 866B can transmit a 16-bit vector representative of a cache miss (e.g., 16 bit values of 0) or a cache hit (e.g., 16-bit value of an address of the cache hit).
  • the first encoder 868A can encode the 16-bit value from the first OR gate 866A as a 4-bit address and, thus, generate HIT_ADDR0.
  • Such example operations can be applicable to the second OR gate 866B, the second encoder 868B, and/or, more generally, the second address encoder logic circuit 856.
  • Example methods, apparatus, systems, and articles of manufacture for multi-banked victim cache with dual datapath are described herein. Further examples and combinations thereof include the following:
  • Data cache architectures including a victim cache system enable the main cache (e.g., the main storage 214) to allocate data to a victim cache (e.g., the victim storage 218) when the main cache needs to create a victim.
  • the main cache e.g., the main storage 214.
  • the main cache may be a direct mapped cache such that the read-miss can only be stored in one location, indicated by the address of the read-miss.
  • the main cache may allocate data of the read-miss location to be moved to the victim cache when the data is dirty and evict data of the read-miss location to be sent out to higher level memory locations when the data of the location is clean.
  • An allocation policy of the main storage may instruct the main cache controller to elect to victimize a modified line because the data for the memory address is not located in higher level cache or is located in higher level cache but is outdated. Such an allocation policy may instruct the main cache controller to not allocate/victimize a clean and/or shared line in the main storage because that line includes data at the memory address that is already located in the higher level cache (e.g., L2 cache, L3 cache, extended memory, etc.).
  • the higher level cache e.g., L2 cache, L3 cache, extended memory, etc.
  • such an allocation policy creates latency (e.g., increased the time it would take for the CPU to retrieve the requested data) when only allocating dirty and/or modified lines in the L1 cache 110.
  • the latency is a result of using extra clock cycles to retrieve from higher level memory. For example, due to the parallel connection of the main storage 214 and the victim storage 218, retrieving data from the higher level memories takes more time than retrieving data from the victim storage 218.
  • the main cache controller 222 determines if the instruction is a read operation and if the read operation is a miss (e.g., determined based on the main tag RAM access 204 results). If the read operation is a miss, the main cache controller 222 determines that the main storage 214 needs to allocate the line, way, block, slot, etc. of data for allocation in the victim storage 218.
  • Example methods, apparatus, systems, and articles of manufacture for allocation of data are described herein. Further examples and combinations thereof include the following:
  • the snoop address component 506 is utilized to store the MESI state of every cache line in the victim storage 218 in the MESI RAM 300. By storing the MESI state of every cache line in the MESI RAM 300, the victim cache system supports coherency.
  • the corresponding store queue (e.g., the victim cache store queue 216) may be processing a write instruction to the address that is being read via the snoop address. Accordingly, while the victim storage 218 is servicing a snoop request (e.g., while the snoop request is being processed in response to the snoop address component 506 obtaining the snoop request), the victim cache store queue 216 forwards the data from the victim cache store queue 216 (e.g., the data stored in latch 402e) to the response multiplexer 508. In this manner, any state change obtained by the vector interface 504 due to the snoop address and any recently updated address obtained from the victim cache store queue 216 is forwarded to the higher-level data cache (e.g., the L2 data cache 112).
  • the higher-level data cache e.g., the L2 data cache 112
  • the coherency pipeline is longer than the victim cache pipeline to provide enough time for the victim cache controller 224 to properly order a potential snoop response and/or subsequent CPU 102 operation in the event such a snoop response and/or subsequent CPU 102 operation is issued to a higher level memory controller.
  • the victim storage 218 of the L1 data cache 110 is capable of issuing tag-updates to higher level cache controller in the event tracking of cache lines is requested. In this manner, the victim storage 218 can facilitate tracking of cache lines to distinguish between exclusive and modified cache elements.
  • any of the above-mentioned operations and/or elements may be implemented on any of the L2 data cache 112, the L3 data cache 114, and/or any additional level data cache in the data cache 108.
  • Example methods, apparatus, systems, and articles of manufacture to facilitate read-modify-write support in a coherent victim cache with parallel data paths are described herein. Further examples and combinations thereof include the following:
  • the main cache e.g., the main storage 214) victimizes (e.g., allocates) cache lines to the victim cache (e.g., victim storage 218) when the main cache needs to store new data.
  • the replacement policy e.g., replacement policy component 308 determines where the victim can be stored in the victim cache (e.g., the victim storage 218).
  • the victim cache is full and thus needs to evict data to the higher level cache memories (e.g., L2 cache 112, L3 cache 114, extended memory 106).
  • the victim cache e.g., victim storage 218) also evicts data to the higher level cache memories when a write-miss occurs.
  • the victim storage 218 includes a write-miss buffer that buffers write-miss data.
  • the replacement policy may utilize fixed schemes to determine what data to evict from the victim cache. For example, eviction schemes such as First In First Out (FIFO) scheme, Random scheme, and Least Recently Used (LRU) scheme. However, such eviction schemes are not configured to efficiently manage the eviction of data from the victim cache when there are two or more data paths.
  • FIFO First In First Out
  • Random scheme Random scheme
  • LRU Least Recently Used
  • the replacement policy component 308 speculatively locks (e.g., reserves) first and second victim cache lines (e.g., sets) that are specifically for eviction.
  • the first and second victim cache lines may be locked for specific data paths (e.g., first victim cache line locked for DP0 and second victim cache line locked for DP1).
  • the eviction logic implemented by the replacement policy component 308 is described in further detail below in connection with FIG. 6 .
  • the example replacement policy component 308 disregards the transaction of the first data path DP0 in the example third data path scenario 610 because the transaction is invalid.
  • DP0 Hit Way indicates a portion of the victim storage 218 and/or the write miss cache that should be accessed (e.g., read from, evicted, written to, etc.) by the first data path DP0 when the instruction is a hit.
  • DP1 Hit Way is a portion in the victim storage 218 and/or the write miss cache that should be accessed by the second data path DP1 when the instructions is a hit.
  • the variable 'Y' is the variable that indicates the location of the current way selected as the LRU and indicates where the first data path DP0 should remove data from.
  • Y is assigned to the DP0 pointer. For example, when DP0 needs to evict a portion in the victim storage 218, then DP0 pointer points to the location Y (e.g., the LRU way) for eviction.
  • the replacement policy component 308 is to store an indicator of the LRU way of the victim storage 218 to be replaced by DP0.
  • the replacement policy component 308 keeps an indicator, that can be accessed by the cache controller 220, that a particular way, not recently accessed, is available for eviction by the first data path DP0.
  • the terms “pointer” and “indicator” may be used interchangeably.
  • the first data path DP0 is to evict data from a portion (e.g., way) in the victim storage 218. Therefore, the DP0 pointer points to the location Y in the victim storage 218 that is to be evicted.
  • the LRU value and the next LRU value are incremented based on which location was evicted. For example, if DP0 evicted data from location Y+1 (e.g., the DP1 Hit Way matches the location of the DP0 pointer), the LRU value is incremented twice and the next LRU value is incremented twice. Otherwise, if DP0 evicted data from location Y (e.g., DP1 Hit Way did not match the location of DP0 pointer) the LRU value is incremented once and the next LRU value is incremented once.
  • DP0 evicted data from location Y+1 e.g., the DP1 Hit Way matches the location of the DP0 pointer
  • the LRU value is incremented once and the next LRU value is incremented once.
  • both data paths include valid transactions (e.g., indicated in first data path scenario 606), the first data path DP0 is a hit, and the second data path DP1 is a miss (e.g., the hit-miss action 616).
  • the first comparison logic 306a returns a "hit" result to the replacement policy component 308 and the second comparison logic 306b returns a "miss” result to the replacement policy component 308.
  • the DP0 Hit Way points to the way in the victim storage 218 that includes the hit/matching data.
  • the miss causes the second data path DP1 to evict a way to make room in the victim storage 218. Therefore, the DP1 pointer points to location Y+1 in the victim storage 218 that is to be evicted.
  • the replacement policy component 308 determines if the DP0 Hit Way matches the address of the next LRU way (e.g., location Y+1). If the replacement policy component 308 determines the DP0 Hit Way matches the address of the next LRU way (e.g., Y+1), the DP1 pointer points to the location of the DP0 pointer (e.g., location Y) so that the DP1 can evict data without conflicting with DP0 Hit Way. If the DP0 Hit Way does not match the address of the next LRU way, then the DP1 evicts data from location Y+1.
  • the LRU value and the next LRU value are incremented based on which location was evicted. For example, if DP1 evicted data from location Y (e.g., the DP0 Hit Way matches the location of the DP1 pointer), the LRU value is incremented once and the next LRU value is incremented once. Otherwise, if DP1 evicted data from location Y+1 (e.g., DP0 Hit Way did not match the location of DP1 pointer) the LRU value is incremented twice and the next LRU value is incremented twice.
  • DP1 evicted data from location Y e.g., the DP0 Hit Way matches the location of the DP1 pointer
  • the LRU value is incremented twice and the next LRU value is incremented twice.
  • both data paths include valid transactions (e.g., indicated in first data path scenario 606) and both data paths are flagged as misses (e.g., column 618).
  • the comparison logic 306a and 306b returns "miss" results to the replacement policy component 308 when both addresses in the data paths DP0 and DP1 are not found and/or matched with the addresses in the tag RAMs 208, 210.
  • both data paths DP0 and DP1 are to evict ways in the victim storage 218. Therefore, DP0 pointer points to location Y and DP1 pointer points to location Y+1.
  • the LRU value is incremented by two (e.g., Y+2) and the next LRU value is incremented by two (e.g., (Y+1) +2).
  • DP0 and DP1 are misses, DP0 Way points to the new LRU value (e.g., Y+2) and DP1 Way points to the next LRU value (e.g., (Y+1)+2).
  • the first data path DP0 is a valid transaction and the second data path DP1 is an invalid transaction (e.g., indicated in second data path scenario 608).
  • the first data path DP0 is a hit (e.g., indicated in the DP0 hit action 620).
  • the comparison logic 306a returns a "hit" result to the replacement policy component 308.
  • the DP0 Hit Way points to the way in the victim storage 218 that includes the matching data.
  • the LRU value (Y) remains the same because no data is to be evicted in the clock cycle.
  • the first data path DP0 is a valid transaction and the second data path DP1 is an invalid transaction (e.g., indicated in second data path scenario 608).
  • the first data path DP0 is a miss (e.g., indicated in the DP0 miss action 622).
  • the comparison logic 306a returns a "miss" result to the replacement policy component 308.
  • the first data path DP0 is to evict data from the victim storage 218.
  • the example DP0 pointer points to the location Y (e.g., the LRU way). After eviction, the LRU value is incremented (e.g., Y+1).
  • the first data path DP0 is an invalid transaction and the second data path DP1 is a valid transaction (e.g., indicated in third data path scenario 610).
  • the second data path DP1 is a hit (e.g., indicated in the DP1 hit action 624).
  • the comparison logic 306b returns a "hit" result to the replacement policy component 308.
  • the DP1 Hit Way points to the way in the victim storage 218 that includes the matching data.
  • the LRU value (Y) remains the same because no data is to be evicted in the clock cycle.
  • the first data path DP0 is an invalid transaction and the second data path DP1 is a valid transaction (e.g., indicated in third data path scenario 610).
  • the second data path DP1 is a miss (e.g., indicated in the DP1 miss action 626).
  • the comparison logic 306b returns a "miss" result to the replacement policy component 308.
  • the second data path DP1 is to evict data from the victim storage 218.
  • the DP1 pointer points to the location Y (e.g., the LRU way).
  • the DP1 pointer does not point to location Y+1 because of the invalid transaction of DP0.
  • DP1 always points to Y+1 (e.g., unless switched when DP0 Hit Way matches Y+1).
  • the LRU value is incremented (e.g., Y+1).
  • second table 604 illustrates the incrementation of the LRU value when the first data path DP0 and/or the second data path DP1 is allocating data into the victim storage 218. For example, when a read-miss occurs, the main storage 214 allocates a line of data to the victim storage 218 utilizing one of the data paths.
  • the second table 604 includes a first valid column 626, a second valid column 628, a first allocate column 630, a second allocate column 632, a first LRU interference 634, a second LRU interference 636, and an LRU increment column 638.
  • the example first valid column 626 corresponds to the validity of the second data path transaction. For example, a zero (0) indicates the DP1 transaction is invalid and a one (1) indicates that the DP1 transaction is valid.
  • the example second valid column 628 corresponds to the validity of the first data path transaction. For example, a zero (0) indicates the DP0 transaction is invalid and a one (1) indicates that the DP0 transaction is valid.
  • the example first allocate column 630 indicates the allocation status of the second data path DP1.
  • the allocation status corresponds to allocation of data from the main storage 214 to the victim storage 218 in a clock cycle. For example, a zero (0) indicates that the second data path DP1 is not allocating data into the victim storage 218 and a one (1) indicates that the second data path DP1 is allocating data into the victim storage 218.
  • the example second allocate column 632 indicates the allocation status of the first data path DP0. For example, a zero (0) indicates that the first data path DP0 is not allocating data into the victim storage 218 and a one (1) indicates that the first data path DP0 is allocating data into the victim storage 218.
  • the data path When a data path is allocating data into the victim storage 218, the data path evicts a way (e.g., slot, block, etc.) to make room for the data being allocated.
  • a way e.g., slot, block, etc.
  • data is allocated to the victim storage 218 when a read-miss occurs in the main storage 214.
  • the first LRU interference column 634 indicates whether the first data path DP0 hits the same location in the victim storage 218 as the location of the second data path allocate pointer. For example, the address of the first data path DP0 is located in the least recently used location of the victim storage 218.
  • the first LRU interference column 634 includes a one (1) to indicate that the first data path DP1 hit location equals the location of the second data path DP1 allocate pointer.
  • the second LRU interference column 636 indicates whether the second data path DP1 hits the same location in the victim storage 218 as the location of the second data path allocate pointer. For example, the address of the second data path DP1 is located in the least recently used location of the victim storage 218.
  • the second LRU interference column 636 includes a one (1) to indicate that the second data path DP1 hit location equals the location of the first data path allocate pointer.
  • the first data path allocate pointer points to the location Y (LRU value) when DP0 is to allocate and the second data path allocate pointer points to the location Y+1 (next LRU value) when the DP1 is to allocate.
  • the pointers notify the cache controller 220 to evict a portion of the victim storage 218 to the higher level caches (e.g., L2 112, L3 114, extended memory 106).
  • the example replacement policy component 308 may initialize the first data path allocate pointer to point to location Y (LRU portion) and initialize the second data path allocate pointer to point to Y+1 (next LRU portion).
  • the LRU increment column 628 indicates the incrementation of the LRU value, Y.
  • the replacement policy component 308 increments the LRU value by one (e.g., Y+1), by two (e.g., Y+2), or by nothing (e.g., Y).
  • the incrementation of the LRU value depends on the status of the data paths DP0 and DP1.
  • both the first data path DP0 and the second data path DP1 include valid transactions.
  • the example replacement policy component 308 determines if any of the data paths are allocating. For example, the cache controller 220 sends information to the replacement policy component 308 when the main storage 214 needs to allocate data.
  • the first data path DP0 is allocating data (e.g., moving data from the main storage 214 to the victim storage 218)
  • the first data path DP0 evicts data (e.g., indicated by the first data path allocate pointer) from the victim storage 214.
  • the replacement policy component 308 determines whether the second data path DP1 was a hit and where the hit location is. For example, the replacement policy component 308 analyzes the location of the address of the second data path DP1 and determines if that location matches the location of the first data path allocate pointer.
  • the replacement policy component 308 updates the first data path allocate pointer to point to the next LRU value (Y+1) (e.g., notifies the cache controller 220 to evict data of next LRU value).
  • the second data path DP1 reads/writes from the hit location Y and the first data path DP0 evicts data of the LRU location Y+1.
  • the first data path DP0 does not evict the read/write data of DP1.
  • the replacement policy component 308 increments the first data path allocate pointer by two and the second data path allocate pointer by two. For example, the replacement policy component 308 increments LRU value (Y) by two and the next LRU value (Y+1) by two because DP0 just evicted location Y+1, and therefore, the new LRU value will be Y+2. This operation is illustrated at row 640.
  • the replacement policy component 308 If the second data path hit location is not equal to the location of the first data path allocated pointer (e.g., DP1 hit location does not equal Y), the replacement policy component 308 notifies the cache controller 220 that location Y is to be evicted. In this manner, the cache controller 220 evicts data from the location Y in the victim storage 218. After eviction has occurred (e.g., eviction of data from Y in the victim storage 218), the replacement policy component 308 increments the first data path allocate pointer by one and the second data path allocate pointer by one.
  • the second data path DP1 is allocating data (e.g., moving data from the main storage 214 to the victim storage) and the second data path DP1 evicts data (e.g., indicated by the second data path allocate pointer) from the victim storage 214.
  • the replacement policy component 308 determines whether the first data path DP0 was a hit and where the hit location is. For example, the replacement policy component 308 analyzes the location of the address of the first data path DP0 and determines if that location matches the location of the second data path allocate pointer.
  • the replacement policy component 308 updates the second data path allocate pointer to point to the LRU value (Y) (e.g., notifies the cache controller 220 to evict data of LRU value).
  • the first data path DP0 reads/writes from the hit location Y+1 and the second data path DP1 evicts data of the LRU location Y.
  • the second data path DP1 does not evict the read/write data of DP0.
  • the replacement policy component 308 increments the first data path allocate pointer by one and the second data path allocate pointer by one. For example, the replacement policy component 308 increments LRU value (Y) by one and the next LRU value (Y+1) by one because DP1 just evicted location Y, and therefore, the new LRU value will be Y+1. This operation is illustrated at row 644.
  • the replacement policy component 308 If the first data path hit location is not equal to the location of the second data path allocated pointer (e.g., DP0 hit location does not equal Y+1), the replacement policy component 308 notifies the cache controller 220 that location Y+1 is to be evicted. In this manner, the cache controller 220 evicts data from the location Y+1 in the victim storage 218. After eviction has occurred (e.g., eviction of data from Y+1 in the victim storage 218), the replacement policy component 308 increments the first data path allocate pointer by two and the second data path allocate pointer by two.
  • the replacement policy component 308 increments the LRU value (Y) by two and the next LRU value (Y+1) by two because DP1 just evicted location Y+1, and therefore, the new LRU value will be Y+2. This operation is illustrated at row 646.
  • Example methods, apparatus, systems, and articles of manufacture for eviction in a victim storage are described herein. Further examples and combinations thereof include the following:
  • Example 1 includes an apparatus comprising a cache storage, a cache controller operable to receive a first memory operation and a second memory operation concurrently, comparison logic operable to identify if the first and second memory operations missed in the cache storage, and a replacement policy component operable to, when at least one of the first and second memory operations corresponds to a miss in the cache storage, reserve an entry in the cache storage to evict based on the first and second memory operations.
  • Example 2 includes the apparatus of example 1, wherein the replacement policy component is to speculatively lock the entry in the cache storage for eviction.
  • Example 3 includes the apparatus of example 1, wherein the replacement policy component is operable to store an indicator of a first way of the cache storage to be replaced, in response to the first memory operation missing in the cache storage and the second memory operation hitting in the cache storage determine whether the second memory operation is directed to the first way of the cache storage indicated by the indicator, and increment the indicator to indicate a second way of the cache storage based on the second memory operation being directed to the first way of the cache storage, and causing the second way of the cache storage to be evicted based on the first memory operation and the incremented indicator.
  • Example 4 includes the apparatus of example 1, wherein the replacement policy component is operable to store an indicator of a second way of the cache storage to be replaced, in response to the second memory operation missing in the cache storage and the first memory operation hitting in the cache storage determine whether the first memory operation is directed to the second way of the cache storage indicated by the indicator, and decrement the indicator to indicate a first way of the cache storage based on the first memory operation being directed to the second way of the cache storage, and causing the first way of the cache storage to be evicted based on the second memory operation and the incremented indicator.
  • Example 5 includes the apparatus of example 1, wherein the replacement policy component is operable to store a first indicator of a first way and a second indicator of a second way of the cache storage to be replaced, in response to the first memory operation missing in the cache storage and the second memory operation missing in the cache storage causing the first way of the cache storage to be evicted based on the first memory operation and the second way of the cache storage to be evicted based on the second memory operation.
  • Example 6 includes the apparatus of example 5, wherein the replacement policy component is operable to increment the first indicator by two locations and the second indicator by two locations after the first way and the second way of the cache storage are evicted.
  • Example 7 includes the apparatus of example 1, wherein the cache storage is a victim cache storage.
  • Example 8 includes the apparatus of example 1, further including a first interface and a second interface, the first interface to obtain the first memory operation from a central processing unit and the second interface to obtain the second memory operation from the central processing unit, the first interface and the second interface coupled to the comparison logic and the cache controller.
  • Example 9 includes the apparatus of example 8, wherein the first interface is a vector interface and the second interface is a scalar interface.
  • Example 10 includes a method comprising receiving a first memory operation and a second memory operation concurrently, identifying if the first and second memory operations missed in a cache storage, and when at least one of the first and second memory operations corresponds to a miss in the cache storage, reserving an entry in the cache storage to evict based on the first and second memory operations.
  • Example 11 includes the method of example 10, further including speculatively locking the entry in the cache storage for eviction.
  • Example 12 includes the method of example 10, further including storing an indicator of a first way of the cache storage to be replaced, in response to the first memory operation missing in the cache storage and the second memory operation hitting in the cache storage determining whether the second memory operation is directed to the first way of the cache storage indicated by the indicator, and incrementing the indicator to indicate a second way of the cache storage based on the second memory operation being directed to the first way of the cache storage, and causing the second way of the cache storage to be evicted based on the first memory operation and the incremented indicator.
  • Example 14 includes the method of example 10, further including storing a first indicator of a first way and a second indicator of a second way of the cache storage to be replaced, in response to the first memory operation missing in the cache storage and the second memory operation missing in the cache storage causing the first way of the cache storage to be evicted based on the first memory operation and the second way of the cache storage to be evicted based on the second memory operation.
  • Example 15 includes the method of example 14, further including incrementing the first indicator by two locations and the second indicator by two locations after the first way and the second way of the cache storage are evicted.
  • Example 16 includes a system comprising a central processing unit to concurrently output a first memory operation and a second memory operation, a cache coupled to the central processing unit, the cache further including a cache storage, a cache controller operable to receive a first memory operation and a second memory operation concurrently, comparison logic operable to identify if the first and second memory operations missed in the cache storage, and a replacement policy component operable to, when at least one of the first and second memory operations corresponds to a miss in the cache storage, reserve an entry in the cache storage to evict based on the first and second memory operations.
  • Example 17 includes the system of example 16, wherein the cache storage is a first cache storage, the cache further including a second cache storage coupled in parallel with the first cache storage.
  • Example 18 includes the system of example 16, wherein the cache storage is a victim cache storage.
  • Example 19 includes the system of example 16, wherein the cache further includes a first interface and a second interface, the first interface is a 64-bit wide bidirectional scalar interface and the second interface is a 512-bit wide vector interface.
  • Example 20 includes the system of example 16, wherein the replacement policy component is operable to adjust the entry reservations in the cache storage based on 1) a validity of the first and second memory operations, 2) whether the cache storage stores data for the first and second memory operations, and 3) whether the first and second memory operations are to allocate data to the cache storage or write data to the cache storage.
  • FIG. 11A is an example circuit implementation of the victim cache store queue 216 of FIGS. 2 and/or 3. In FIG.
  • the victim cache store queue 216 includes example latches 1102a, 1102b, 1102c, 1102d, 1102e, example merge circuits 1103a-c, an example arithmetic component 1104, an example atomic compare component 1106, an example read-modify-write merge component 1108, an example select multiplexer 1110, and example ECC generator 1112, an example arbitration manager 1114, an example pending store address data store 1116, an example priority multiplexer 1118, and an example write port 1126.
  • the example merge circuits 1103a-d include an example comparator(s) 1120, and example switches 1122.
  • FIG. 11A illustrates a single pipeline of the victim cache store queue 216. However, the victim storage element 216 may be arranged to support more than one independent copy of the pipeline with respect to different banks as indicated by the dashed box 1100. Accordingly, the pipeline of FIG.11A may be reproduced multiple times for different banks, as further described below.
  • Some monolithic storage devices do not support multiple accesses by a processor (e.g., a CPU) during the same clock cycle.
  • a request to access data in a single main storage can lock up the entire single main storage.
  • there is a single register file capable of supporting one full cache line access per clock cycle.
  • an entire cache line associated with the single main storage can be locked to service the request because the single register file is allocated to the storage data bank that received such a request.
  • FIG. 7B is a schematic illustration of a second example encapsulated data cache system 710.
  • the second encapsulated data cache system 710 can be an example circuit implementation of the L1 cache 110 of FIG. 1 or portion(s) thereof, and/or, more generally, the data cache 108 of FIG. 1 or portion(s) thereof.
  • the second encapsulated data cache system 710 is encapsulated to provide a unified storage view to an external system (e.g., one or more CPUs, one or more processors, external hardware, etc.).
  • an external system e.g., one or more CPUs, one or more processors, external hardware, etc.
  • the multi-bank structure of the main cache store queue 212, the main storage 214, and/or, more generally, the second encapsulated data cache system 710 can service read and write operations that are sent to the banks in parallel.
  • each bank arbitrates its own processes in response to the read and/or write operations.
  • operation of the second encapsulated data cache system 710 is more efficient since an entire cache line is not locked up when a request is received. Rather, only the portion of the cache line allocated to the bank that received such a request would be locked.
  • FIG. 7D depicts an example main cache multi-bank structure 730.
  • the L1 cache 110, the L2 cache 112, and/or the L3 cache 114 of FIG. 1 can have the main cache multi-bank structure 730.
  • the main cache store queue 212 of FIG. 2 and/or the main storage 214 of FIG. 2 can have the main cache multi-bank structure 730.
  • the main cache multi-bank structure 730 can be an example implementation of the main cache store queue 212 and/or the main storage 214.
  • the main cache multi-bank structure 730 is a data or memory structure that includes 16 example banks (Banks 0-15) 732, with each of the banks 732 having a data width of 64 bytes (e.g., bytes 0-7).
  • the main cache multi-bank structure 730 is independently addressable by bank. For example, the first row of the rows 734 has a starting row address of 0 and an ending row address of 127, a second row of the rows 734 has a starting row address of 128 and an ending row address of 255, etc.
  • a cache line can be 128 bytes of data that fits in a width of memory (e.g., DRAM) or storage unit (e.g., the main storage 214, the victim storage 218, etc.).
  • a cache line can consume an entire row of the main cache bank structure 730.
  • a cache line can use one of the rows 734 of 16 banks, where each bank is 8 bytes wide.
  • the main cache bank structure 730 can enable 16 different cache lines to access data stored therein.
  • FIG. 7E depicts an example unified cache multi-bank structure 740.
  • the L1 cache 110, the L2 cache 112, and/or the L3 cache 114 of FIG. 1 can have the unified cache bank structure 740.
  • the main cache store queue 212 of FIG. 2 , the main storage 214 of FIG. 2 , the victim cache store queue 216 of FIG. 2 , and/or the victim storage 218 of FIG. 2 can have the unified cache multi-bank structure 740.
  • the unified cache multi-bank structure 740 can be an example implementation of the main cache store queue 212, the main storage 214, the victim cache store queue 216, and/or the victim storage 218.
  • the unified cache multi-bank structure 740 is a data or memory structure that includes 16 example banks (Banks 0-15) 742, with each of the banks 742 having a data width of 64 bytes (e.g., bytes 0-7).
  • the unified cache multi-bank structure 740 is independently addressable by bank. For example, the first row of the rows 744 has a starting row address of 0 and an ending row address of 127, a second row of the rows 744 has a starting row address of 128 and an ending row address of 255, etc.
  • the address processing components 302a-c of FIGS. 3A-3D can use a memory address from the store instruction to determine which banks of the main cache store queue 212, the main storage 214, the victim cache store queue 216, and/or the victim storage 218 of FIG. 2 are needed for the first store instruction 902. For example, the address processing components 302a-c can determine that Addr 0 of the first store instruction 902 is indicative of 8 entire banks (e.g., 8 of the banks 722 of FIG. 7C ) to be read from and written to. In such examples, the address processing components 302a-c can determine that the number or quantity of banks to read from is 0x00FF and the number or quantity of banks to write to is 0x00FF.
  • the address processing components 302a-c can determine that Banks 0-7 of FIG. 7C need to be accessed, where each of the banks has a corresponding bit (e.g., a first bit for Bank 7, a second bit for Bank 6, a third bit for Bank 5, etc.).
  • the corresponding bit position has a bit value of 1 and a bit value of 0 otherwise.
  • the address processing components 302a-c can generate an address for the number of banks read of 0x00FF, which is 11111111 in binary, based on each of the bits for Banks 0-7 having a 1 value (e.g., Bank 7 is 1, Bank 6 is 1, etc.) indicative of that respective bank needed to be accessed for the first store instruction 902.
  • the bank processing logic 303 of FIGS. 3A-3D detects whether incoming store instructions, such as the first store instruction 902, indicate a write of an entire bank, or a write of a partial bank.
  • the bank processing logic 303 can determine that, since all of the needed banks are to be completely overwritten, then none of the banks are needed to be first read from. For example, the bank processing logic 303 can determine that the number of banks read is 0x0000, which is 000000000 in binary and is indicative of each of the banks not needed to be read from.
  • the bank processing logic 303 can reduce the number of banks to read from and, thus, improves efficiency and/or otherwise optimizes operation of the main cache store queue 212, the main storage 214, and/or, more generally, the encapsulated data cache system 700 of FIG. 7 by executing less read operations compared to previous implementations of cache systems.
  • the address processing components 302a-c can determine that Addr 3 of the second store instruction 912 is indicative of 8 entire banks (e.g., 8 of the banks 702 of FIG. 7 ) to be read from and written to. In such examples, the address processing components 302a-c can determine that the number or quantity of banks to read from is 0x01FF and the number or quantity of banks to write to is 0x01FF. For example, the address processing components 302a-c can determine that Banks 0-8 of FIG. 7 need to be accessed, where each of the banks has a corresponding bit (e.g., a first bit for Bank 8, a second bit for Bank 7, a third bit for Bank 6, etc.).
  • a corresponding bit e.g., a first bit for Bank 8, a second bit for Bank 7, a third bit for Bank 6, etc.
  • the address processing components 302a-c can generate an address for the number of banks read of 0x01FF, which is 111111111 in binary, based on each of the bits for Banks 0-8 having a 1 value (e.g., Bank 8 is 1, Bank 7 is 1, etc.) indicative of that respective bank needed to be accessed for the second store instruction 912.
  • the address processing components 302a-c can generate an address for the number of banks read of 0xC07F, which is 1100000001111111 in binary, based on each of the bits for Banks 0-6 and 14-15 having a 1 value (e.g., Bank 15 is 1, Bank 14 is 1, Bank 6 is 1, etc.) indicative of that respective bank needed to be accessed for the third store instruction 922.
  • the data cache system 1000 includes example arbitration logic 1008, 1010, and example multiplexer logic 1012, 1014, 1016.
  • the arbitration logic 1008, 1010 includes first example arbitration logic (e.g., a first arbiter) 1008 and second example arbitration logic (e.g., a second arbiter) 1010.
  • the first arbitration logic 1008 is a main storage read/write arbiter (MS R/W ARB[i]) and the second arbitration logic 1010 is a main cache store queue (STQ WRITE ARB[i]).
  • the example arbitration logic 1008, 1010 of the illustrated example of FIG. 10A is implemented by a logic circuit such as, for example, a hardware processor.
  • any other type of circuitry may additionally or alternatively be used such as, for example, one or more analog or digital circuit(s), logic circuits, programmable processor(s), application specific integrated circuit(s) (ASIC(s)), programmable logic device(s) (PLD(s)), field programmable logic device(s) (FPLD(s)), digital signal processor(s) (DSP(s)), etc.
  • ASIC application specific integrated circuit
  • PLD programmable logic device
  • FPLD field programmable logic device
  • DSP digital signal processor
  • the multiplexer logic 1012, 1014, 1016 includes a first example multiplexer (MUX1[i]) (e.g., a first multiplexer logic circuit) 1012, a second example multiplexer (MUX2[i] (e.g., a second multiplexer logic circuit) 1014, and a third example multiplexer (MUX3[i]) (e.g., a third multiplexer logic circuit) 1016.
  • the multiplexers 1012, 1014, 1016 have a select input (SEL[i]), data inputs (1-3), and an output.
  • the first data inputs (data input 1) of the multiplexers 1012, 1014, 1016 are coupled to the first address processing logic 1001 and the second address processing logic 1003.
  • the second data inputs (data input 2) of the multiplexers 1012, 1014, 1016 are coupled to the first address processing logic 1001 and the second address processing logic 1003.
  • the third data inputs (data input 3) of the multiplexers 1012, 1014, 1016 are coupled to the first address processing logic 1001 and the second address processing logic 1003.
  • the select input of the first multiplexer 1012 is coupled to an output of the second arbitration logic 1010.
  • the select input of the second multiplexer 1014 and the select input of the third multiplexer 1016 are coupled to outputs of the first arbitration logic 1008.
  • the output of the first multiplexer 1012 is coupled to an example write port (WRITE PORT[i]) 1024 of the first main cache store bank 1004.
  • the output of the second multiplexer 1014 is coupled to an example read port (READ PORT[i]) 1026 of the first main cache store bank 1004.
  • the output of the third multiplexer 1016 is coupled to an example read/write port (READ/WRITE PORT[i]) 1028 of the first main storage bank 1006.
  • the first arbitration logic 1008 is coupled to the first address processing logic 1001, the second address processing logic 1003, the second arbitration logic 1010, and outputs of the first main cache store queue bank 1004.
  • STQ[i] of FIG. 10A is representative of a single bank of a multi-bank implementation of the main cache store queue 212.
  • the main cache store queue 212 can have STQ[0]-STQ[15] representative of the main cache store queue 212 having 16 banks.
  • each of STQ[0]-STQ[15] can store 64 bits (i.e., 8 bytes).
  • STQ[0]-STQ[15], and/or, more generally, the main cache store queue 212 can store 24,576 bits (i.e., 3072 bytes).
  • each of STQ[0]-STQ[15] may store a different quantity of bits and, thus, the main cache store queue may store a different quantity of bits.
  • a plurality of the banks including the first bank 1002 can be encapsulated to form and/or otherwise generate an encapsulated data cache system 1034.
  • the encapsulated data cache system 1034 can be an example implementation of the encapsulated data cache system 700 of FIG. 7B .
  • each corresponding bank of the main cache store queue 212 and the main storage 214 can be encapsulated together to form and/or otherwise generate example encapsulated data cache banks 1036 for simplification when interacting with external system(s).
  • Each of the encapsulated data cache banks 1036 include an example encapsulated write port (WRITE PORT (STQ[i])) 438, an example encapsulated read port (READ PORT (STQ[i])) 1040, and an example encapsulated read/write port (READ/WRITE PORT MS[i])) 1042.
  • WRITE PORT STQ[i]
  • READ PORT STQ[i]
  • READ/WRITE PORT MS[i] read/write port
  • the first address processing logic 1001 and/or the second address processing logic 1003 can obtain example store instructions 1018, 1020, 1022 from one(s) of the interface(s) of FIG. 1 of the CPU 102 of FIG. 1 .
  • the store instructions 1018, 1020, 1022 include example data, such as WDATA, ADDR, BYTEN, SIZE, and R/W.
  • WDATA corresponds to data (e.g., 64 bits of data) to be written and/or otherwise stored in at least one of the main cache store queue 212 or the main storage 214.
  • ADDR corresponds to a data address associated with at least one of the main cache store queue 212 or the main storage 214.
  • BYTEN corresponds to byte enable data.
  • SIZE corresponds to a data size of a data access operation (e.g., a read operation, a write operation, a modify operation, etc., and/or a combination thereof).
  • R/W corresponds to whether the store instruction is a read operation or a write operation.
  • the store instructions 1018, 1020, 1022 include a first example store instruction (SCALAR_DP (DP0)) 1018, a second example store instruction (DMA) 1020, and a third example store instruction (VECTOR_DP (DP1))1022.
  • the first store instruction 1018 is transmitted from the scalar interface of FIG. 1 (e.g., the CPU interface 202 of FIG. 2 ) and, thus, corresponds to a scalar data path (SCALAR_DP (DP1)) of the data cache system 1000.
  • the second store instruction 1020 is transmitted from the memory interface of FIG. 1 , such as a direct memory access (DMA) interface and, thus, corresponds to a DMA data path (DMA).
  • the third store instruction 1022 is transmitted from the vector interface of FIG. 1 (e.g., the CPU interface 202 of FIG. 2 ) and, thus, corresponds to a vector data path (VECTOR_DP (DP1)) of the data cache system 1000.
  • the first address processing logic 1001 and/or the second address processing logic 1003 generate transaction data (TRANSACTION_DP0[i], TRANSACTION DMA[i], TRANSACTION DP1[i]) that can be used to execute a data access operation associated with at least one of the main cache store queue 212 or the main storage 214.
  • the first address processing logic 1001 can extract, and in some examples rotate, the WDATA from respective one(s) of the store instructions 1018, 1020, 1022 and transmit the extracted and/or rotated WDATA to a respective first input of the multiplexers 1012, 1014, 1016.
  • the first address processing logic 1001 can extract and rotate first WDATA from the first store instruction 1018 and transmit the first extracted and rotated WDATA to the first input of the first multiplexer 1012, the first input of the second multiplexer 1014, and the first input of the third multiplexer 1016.
  • the second address processing logic 1003 can determine an address (MS/STQ_ADDR[i]) for one or more of the 16 banks of at least one of the store queue 212 or the main storage 214. The address can be based on the ADDR data included in the store instructions 1018, 1020, 1022.
  • the second address processing logic 1003 can determine a byte enable value per bank (BYTEN/BANK[i]) based on the BYTEN data included in the store instructions 1018, 1020, 1022.
  • the second address processing logic 1003 can determine a write bank request (WR_BANK_REQ[i]) and/or a read bank request (RD_BANK_REQ[i]) based on the R/W data included in the store instructions 1018, 1020, 1022.
  • the first address processing logic 1001 and/or the second address processing logic 1003 can determine transaction data for respective ones of the store instructions 1018, 1020, 1022.
  • the transaction data can include the rotated WDATA data, MS/STQ_ADDR[i], and BYTEN/BANK[i].
  • the first address processing logic 1001 and/or the second address processing logic 1003 can generate first transaction data (TRANSACTION_DP0[i]) based on the first store instruction 1018, second transaction data (TRANSACTION_DMA[i]) based on the second store instruction 1020, and third transaction data (TRANSACTION_DP1[i]) based on the third store instruction 1022.
  • the first address processing logic 1001 and/or the second address processing logic 1003 can transmit the first transaction data to the first inputs of the multiplexers 1012, 1014, 1016, the second transaction data to the second inputs of the multiplexers 1012, 1014, 1016, and the third transaction data to the third inputs of the multiplexers 1012, 1014, 1016.
  • the first address processing logic 1001 and the second address processing logic 1003 obtain the store instructions 1018, 1020, 1022.
  • the first address processing logic 1001 and the second address processing logic 1003 generate the first through third transaction data based on respective ones of the store instructions 1018, 1020, 1022.
  • the first address processing logic 1001 and the second address processing logic 1003 transmit the first through third transaction data to the multiplexers 1012, 1014, 1016.
  • the second address processing logic 1003 transmit either a read bank request or a write bank request corresponding to each of the store instructions 1018, 1020, 1022.
  • the first arbitration logic 1008 determines whether one(s) of the store instructions 1018, 1020, 1022 are requesting to read one or more banks of the main cache store queue 212 or write to one or more banks of the main storage 214. In example operating conditions, the first arbitration logic 1008 prioritizes read operations over write operations.
  • the bank(s) of the store queue 212 can generate an example store queue full signal (FULL_SIG[i]) 1030 in response to the store queue 212 being full.
  • the bank(s) of the store queue 212 can generate an example complete data write signal (COMLETE_DATA_WR_SIG[i]) 1032.
  • the first store instruction 1018 can correspond to a write operation for Banks 0-4, the second store instruction 1020 can correspond to a read operation of Banks 5-9, and the third store instruction 1022 can correspond to a read operation of Banks 10-14.
  • the second arbitration logic 1010 can assign DP0 to transmit the first transaction data to the write port 1024 of Banks 0-4 (e.g., WRITE PORT[0], WRITE PORT[1], WRITE PORT[2], etc.) because no other data paths are requesting a write operation to be serviced.
  • the second arbitration logic 1010 can assign DP0 by generating a signal (SEL[i]) to instruct the first multiplexer 1012 to select the first transaction data.
  • the first arbitration logic 1008 can assign DMA to transmit the second transaction data to the read port 1026 of Banks 5-9 (e.g., READ PORT[4], READ PORT[5], READ PORT[6], etc.) because no other data paths are requesting a read operation to be serviced in connection with Banks 5-9.
  • the first arbitration logic 1008 can assign DMA by generating a signal (SEL[i]) to instruct the second multiplexer 1014 to select the second transaction data.
  • the first arbitration logic 1008 can delay and/or otherwise stall the write operation. For example, if a first portion of the write operation is associated with writing to the main cache store queue 212 and a second portion of the write operation is associated with reading from the main storage 214, the first arbitration logic 1008 can instruct the second arbitration logic 1010 to not service and/or otherwise not assign the first transaction data to the write port 1024.
  • the main cache store queue 212 can instruct the first arbitration logic 408 to service the write operation when the complete data has bene assembled for writing. For example, if a first portion of the write operation is associated with writing to the main cache store queue 212 and a second portion of the write operation is associated with reading from at least one of the main cache store queue 212 or the main storage 214, the first arbitration logic 1008 can wait to assign the first transaction data to the read/write port 1028. In such examples, in response to locating data associated with the second portion in the main cache store queue 212, the main cache store queue 212 can deliver the located data to the main storage 214.
  • the main cache store queue 212 can generate a signal (e.g., assert a logic high signal) for COMPLETE_DATA_WR_SIG[i] instructing the first arbitration logic 1010 to service the write operation because the complete set of data required for the write operation has been read and/or otherwise assembled for servicing.
  • a signal e.g., assert a logic high signal
  • FIG. 10B is a schematic illustration of an example data cache system 1000b.
  • the data cache system 1000b can be an example implementation of the L1 cache 110 of FIGS. 1, 2 , and/or 3, or portion(s) thereof.
  • the data cache system 1000b includes a first example bank (ENCAPSULATED DATA CACHE SYSTEM BANK[i]) 1002b of the encapsulated data cache system 700 of FIG. 7A .
  • the first bank 1002b can correspond to VICTIM CACHE STORE QUEUE: BANK 1 and VICTIM STORAGE: BANK 1 of FIG. 7A .
  • the first bank 1002b includes a first example victim cache store queue bank 1004b of the victim cache store queue 216 of FIG.
  • the first bank 1002b includes a first example victim storage bank 1006b of the victim storage 218 of FIG. 2 , which can be an example implementation of VICTIM STORAGE: BANK 1 of FIG. 7A .
  • the data cache system 1000b includes first example address processing logic 1001b, second example address processing logic 1003b, example arbitration logic 1008b, 1010b, and example multiplexer logic 1012b, 1014b, 1016b.
  • the arbitration logic 1008b, 1010b includes first example arbitration logic (e.g., a first arbiter) 1008b and second example arbitration logic (e.g., a second arbiter) 1010b.
  • the first arbitration logic 1008b is a victim storage read/write arbiter (VS R/W ARB[i]) and the second arbitration logic 1010b is a victim cache store queue (STQ_V WRITE ARB[i]).
  • 10B is implemented by a logic circuit such as, for example, a hardware processor.
  • any other type of circuitry may additionally or alternatively be used such as, for example, one or more analog or digital circuit(s), logic circuits, programmable processor(s), application specific integrated circuit(s) (ASIC(s)), programmable logic device(s) (PLD(s)), field programmable logic device(s) (FPLD(s)), digital signal processor(s) (DSP(s)), etc.
  • the example multiplexer logic 1012b, 1014b, 1016b of the illustrated example of FIG. 10B is implemented by a logic circuit such as, for example, a hardware processor.
  • any other type of circuitry may additionally or alternatively be used such as, for example, one or more analog or digital circuit(s), logic circuits, programmable processor(s), application specific integrated circuit(s) (ASIC(s)), programmable logic device(s) (PLD(s)), field programmable logic device(s) (FPLD(s)), digital signal processor(s) (DSP(s)), etc.
  • ASIC application specific integrated circuit
  • PLD programmable logic device
  • FPLD field programmable logic device
  • DSP digital signal processor
  • the multiplexer logic 1012b, 1014b, 1016b includes a first example multiplexer (MUX1[i]) (e.g., a first multiplexer logic circuit) 1012b, a second example multiplexer (MUX2[i] (e.g., a second multiplexer logic circuit) 1014b, and a third example multiplexer (MUX3[i]) (e.g., a third multiplexer logic circuit) 1016b.
  • the multiplexers 1012b, 1014b, 1016b have a select input (SEL[i]), data inputs (1-3), and an output.
  • the first data inputs (data input 1) of the multiplexers 1012b, 1014b, 1016b are coupled to the address processing logic 1001b, 1003b.
  • the second data inputs (data input 2) of the multiplexers 1012b, 1014b, 1016b are coupled to the address processing logic 1001b, 1003b.
  • the third data inputs (data input 3) of the multiplexers 1012b, 1014b, 1016b are coupled to the address processing logic 1001b, 1003b.
  • the select input of the first multiplexer 1012b is coupled to an output of the second arbitration logic 1010b.
  • the select input of the second multiplexer 1014b and the select input of the third multiplexer 1016b are coupled to outputs of the first arbitration logic 1008b.
  • the output of the first multiplexer 1012b is coupled to an example write port (WRITE PORT[i]) 1024b of the first victim cache store bank 1004b.
  • the output of the second multiplexer 1014b is coupled to an example read port (READ PORT[i]) 1026b of the first victim cache store bank 1004b.
  • the output of the third multiplexer 1016b is coupled to an example read/write port (READ/WRITE PORT[i]) 1028b of the first victim storage bank 1006b.
  • the first arbitration logic 1008b is coupled to the address processing logic 1001b, 1003b, the second arbitration logic 1010b, and outputs of the first victim cache store queue bank 1004b.
  • STQ_V[i] of FIG. 10B is representative of a single bank of a multi-bank implementation of the victim cache store queue 216.
  • the victim cache store queue 216 can have STQ_V[0]-STQ_V[15] representative of the victim cache store queue 216 having 16 banks.
  • each of STQ_V[0]-STQ_V[15] can store 64 bits (i.e., 8 bytes).
  • STQ_V[0]-STQ_V[15] and/or, more generally, the victim cache store queue 216, can store 24,576 bits (i.e., 3072 bytes).
  • each of STQ_V[0]-STQ_V[15] may store a different quantity of bits and, thus, the victim cache store queue 216 may store a different quantity of bits.
  • the example latch 1102c is coupled to the latch 1102b, the priority multiplexer 1118, the arithmetic component 1104, the atomic compare component 1106, and the read-modify-write merge component 1108. This coupling enables the latch 1102c to transmit the value obtained from the read, modify, and/or write instruction (e.g., the byte value, the bit value, etc.) to the arithmetic component 1104, the atomic compare component 1106, and/or the read-modify-write merge component 1108 in response to a subsequent clock cycle of the cache controller 220.
  • the read, modify, and/or write instruction e.g., the byte value, the bit value, etc.
  • 11A includes three merging circuits 1103a-c, there may be additional merging circuits to merge write operations from other sections of the victim cache store queue 216 (e.g., a merging circuit coupling the output of the latch 1102d to the output of latch 1102b and/or latch 1102a, etc.).
  • the merging circuits 1103a-c is combined into a single circuit that compares the write operations from the different latches 1102b-d and reroutes based on matching memory addresses in any two or more of the different latches 1102b-d.
  • the arithmetic component 1104 is coupled to the latch 1102c, the first multiplexer 1110, and to the ECC logic 312 to perform arithmetic operations on (e.g., increment, decrement, etc.) data from the victim storage 218. Also, the arithmetic component 1104 performs histogram operations on the data stored in the victim storage 218.
  • the example arithmetic component 1104 of the illustrated example of FIG. 11A is implemented by a logic circuit such as, for example, a hardware processor.
  • the atomic compare component 1106 is coupled to the latch 1102c, the first multiplexer 1110, and to the ECC logic 312 to compare data at a memory address to a key and, in the event the data at the memory address matches the key, replace the data.
  • the example atomic compare component 1106 the illustrated example of FIG. 11A is implemented by a logic circuit such as, for example, a hardware processor.
  • any other type of circuitry may additionally or alternatively be used such as, for example, one or more analog or digital circuit(s), logic circuits, programmable processor(s), application specific integrated circuit(s) (ASIC(s)), programmable logic device(s) (PLD(s)), field programmable logic device(s) (FPLD(s)), digital signal processor(s) (DSP(s)), etc. Operation of the example atomic compare component 1106 is further described below.
  • the read-modify-write merge component 1108 is coupled to the latch 1102c, the first multiplexer 1110, and to the ECC logic 312 to facilitate the read, modify, and/or write instruction(s) sent by the cache controller 220.
  • the read-modify-write merge component 1108 is coupled to the ECC logic 312 to obtain the currently stored word that is to be affected by the read, modify, and/or write instruction(s).
  • the read-modify-write merge component 1108 is configured to update the currently stored word obtained from the ECC logic 312 with the new bit(s), byte(s), etc., obtained from the latch 1102c. Additional description of the read-modify-write merge component 1108 is described below.
  • the example read-modify-write merge component 1108 of the illustrated example of FIG. 11A is implemented by a logic circuit such as, for example, a hardware processor.
  • a logic circuit such as, for example, a hardware processor.
  • any other type of circuitry may additionally or alternatively be used such as, for example, one or more analog or digital circuit(s), logic circuits, programmable processor(s), application specific integrated circuit(s) (ASIC(s)), programmable logic device(s) (PLD(s)), field programmable logic device(s) (FPLD(s)), digital signal processor(s) (DSP(s)), etc.
  • the example first multiplexer 1110 of the illustrated example of FIG. 11A is implemented by a logic circuit such as, for example, a hardware processor.
  • a logic circuit such as, for example, a hardware processor.
  • any other type of circuitry may additionally or alternatively be used such as, for example, one or more analog or digital circuit(s), logic circuits, programmable processor(s), application specific integrated circuit(s) (ASIC(s)), programmable logic device(s) (PLD(s)), field programmable logic device(s) (FPLD(s)), digital signal processor(s) (DSP(s)), etc.
  • 11A is implemented by a logic circuit such as, for example, a hardware processor.
  • a logic circuit such as, for example, a hardware processor.
  • any other type of circuitry may additionally or alternatively be used such as, for example, one or more analog or digital circuit(s), logic circuits, programmable processor(s), application specific integrated circuit(s) (ASIC(s)), programmable logic device(s) (PLD(s)), field programmable logic device(s) (FPLD(s)), digital signal processor(s) (DSP(s)), etc.
  • the example arbitration manager 1114 is coupled to the latch 1102a, the latch 1102b, the pending store address datastore 1116, and the victim storage 218 to facilitate the read, modify, and/or write instructions obtained from the cache controller 220.
  • the arbitration manager 1114 is configured to transmit a read instruction of the corresponding currently stored word to the victim storage 218.
  • the arbitration manager 1114 is coupled to the victim storage 218 to arbitrate between conflicting accesses of the victim storage 218. When multiple operations attempt to access the victim storage 218 in the same cycle, the arbitration manager 1114 may select which operation(s) are permitted to access the victim storage 218 according to a priority scheme.
  • the example arbitration manager 1114 of the illustrated example of FIG. 11A is implemented by a logic circuit such as, for example, a hardware processor.
  • a logic circuit such as, for example, a hardware processor.
  • any other type of circuitry may additionally or alternatively be used such as, for example, one or more analog or digital circuit(s), logic circuits, programmable processor(s), application specific integrated circuit(s) (ASIC(s)), programmable logic device(s) (PLD(s)), field programmable logic device(s) (FPLD(s)), digital signal processor(s) (DSP(s)), etc.
  • the pending store address data store 1116 is configured to store the address of the read, modify, and/or write instruction obtained from the cache controller 220. In this manner, the pending store address datastore 1116 maintains a log of the addresses associated with each value stored in any of the latches 1102a, 1102b, 1102c, 1102d, 1102e, 1102f, 1102g, and/or 1102h.
  • the example pending store address datastore 1116 of the illustrated example of FIG. 11A may be implemented by any device for storing data such as, for example, flash memory, magnetic media, optical media, etc.
  • the data stored in the pending store address datastore 1116 may be in any data format such as, for example, binary data, comma delimited data, tab delimited data, structured query language (SQL) structures, etc.
  • the example priority multiplexer 1118 is coupled to the latch 1102b, the latch 1102c, the latch 1102d, and the latch 1102f to facilitate read operations in the event either of the of the latch 1102b, the latch 1102c, the latch 1102d, or the latch 1102f are storing a value corresponding to a write instruction.
  • the cache controller may initiate the following four write instructions regarding a four bit word having addresses A3, A2, A1, and A0: write address A0 with the byte 0x11, write address A1 with the byte 0x22, write address A3 with the byte 0x23, and write address A0 with the byte 0x44.
  • the priority multiplexer 1118 is configured to obtain the byte value 0x11 stored in the latch 1102f, the byte value 0x22 stored in the latch 1102d, the byte value 0x23 stored in the latch 1102c, and the byte value 0x22 stored in the latch 1102b. Also, the pending store address data store 1116 transmits an instruction to the priority multiplexer 1118 indicating which address value in is associated with the byte value stored in the latch 1102b, the latch 1102c, the latch 1102d, and the latch 1102f.
  • the priority multiplexer 1118 is configured to transmit a packet to the latch 1102e indicating that address A0 is 0x44 (e.g., the most recent write instruction associated with the address A0), address A1 is 0x22, and address A3 is 0x23.
  • the MUX circuit 316 is configured to update the value of the currently stored word with the byte values obtained from the priority multiplexer 1118. Such an operation ensures that a read instruction transmitted by the victim cache store queue 216 probably indicates the correct word, even though the write instructions may not have fully propagated through the victim cache store queue 216.
  • An example read path (e.g., the read input to the tag RAM 210) may run in parallel with the victim cache store queue 216. Because a read operation (e.g., a read instruction) may refer to data in a write operation (e.g., a write instruction) that may not have completed yet, the victim cache store queue 216 may include write forwarding functionality that allows the read path to obtain data from the victim cache store queue 216 that has not yet been written back to the victim storage 218.
  • a read operation e.g., a read instruction
  • a write operation e.g., a write instruction
  • the victim cache store queue 216 includes pending store address data store 1116 that records the addresses of the operations at each stage of the victim cache store queue 216, a priority multiplexer 1118 to select data from one of the stages (e.g., latches) of the victim cache store queue 216 for forwarding, and a MUX circuit 316 that selects between the output of the victim storage 218 (by way of the error detection and correction circuit 312) and the forwarded victim cache store queue 216 data from the data priority multiplexer 1118.
  • stages e.g., latches
  • the example write port 1126 is coupled to the write path and the latch 1102a.
  • the write port 1126 may be implemented by an interface that interfaces with the victim cache controller 224 (e.g., the cache controller 220) to obtain a write instruction.
  • the write port 1126 is utilized to receive addresses and values from the cache controller 220 to write.
  • the L1 data cache 110 retrieves a record from the tag RAM 210 that is associated with an address of the read operation to determine whether the data is stored in the victim storage 218.
  • the L1 data cache 110 need not wait for the tag RAM 210 comparison before requesting data from the victim storage 218, and thus, the tag RAM 210 comparison between the address of the read operation and the record of cached addresses may extend into a second or third clock cycle.
  • the L1 data cache 110 may request the data and ECC syndrome bits from the victim storage 218 if the arbitration manager 1114 permits. In this cycle, the L1 data cache 110 may also determine whether newer data is available in the victim cache store queue 216 by comparing the read address to the pending store address data store 1116. If so, the priority multiplexer 1118 is set to forward the appropriate data from the victim cache store queue 216.
  • Data and ECC may be provided by the victim storage 218 in the third cycle. However, this data may or may not correspond to the memory address specified by the read operation because the L1 data cache 110 may allocate multiple extended memory addresses to the same entry in the cache's victim storage 218. Accordingly, in the third cycle, the L1 data cache 110 determines whether the provided data and ECC from the victim storage 218 corresponds to the memory address in the read operation (e.g., a cache hit) based on the comparison of the tag RAM 210 record. In the event of a cache hit, the data and ECC bits are received by the error detection and correction circuit 312, which corrects any errors in the data in a fourth cycle.
  • newer data that has not yet been written to the victim storage 218 may be present in the victim cache store queue 216, and may be forwarded from the victim cache store queue 216 by the priority multiplexer 1118. If so, the MUX circuit 316 selects the forwarded data over the corrected data from the victim storage 218.
  • Either the corrected data from the victim storage 218 or the forwarded data from the victim cache store queue 216 is provided to the L1 data cache 110 in a fifth cycle.
  • the controller 220 may provide data with full ECC checking and correction in the event of a cache hit in about 5 cycles.
  • the victim cache store queue 216 may stall until the data can be retrieved from either the extended memory 106 and/or the victim storage 218, at which point the data may be written to the victim storage and the tag RAM 210 may be updated.
  • FIG. 11B is a schematic illustration of an example unified cache store queue 1124.
  • the unified cache store queue 1124 may implement the example main cache store queue 212 and/or the victim cache store queue 216.
  • the topology illustrates example main cache read and write inputs 1128 and example victim cache read and write inputs 1130.
  • the main cache read and write inputs 1128 may implement the example read and write inputs of the main cache store queue 212 of FIGS. 2 and/or 4
  • the victim cache read and write inputs 1130 may implement the example read and write inputs of the victim cache store queue 216 of FIGS. 2 and/or 11A.
  • FIG. 11A In the example of FIG.
  • the main cache read and write inputs 1128 are pipelined separately from the victim cache read and write inputs 1130. Accordingly, in operation, the main cache read and write inputs 1128 and/or the victim cache read and write inputs 1130 are configured to obtain read and/or write instructions from the CPU 102. In some examples described herein, the main cache read and write inputs 1128 and the victim cache read and write inputs 1130 may be referred to as inputs of the unified cache store queue 1124.
  • the unified cache store queue 1124 includes an example first pipestage (E2) 1132, an example second pipestage (E3) 1134, and an example fourth pipestage (E4) 1136. In this manner, the unified cache store queue 1124 is coupled to the first pipestage (E2) 1132 via example data pipestages 1138 and 1140, and the main storage 214 and the victim storage 218 of FIG. 2 .
  • the CPU 102 transmits a read and/or a write instruction, which enters the unified cache store queue 1124 via the first pipestage (E2) 1132.
  • the unified cache store queue 1124 may obtain a read and/or write instruction from the CPU 102 via the cache controller 220.
  • the example L1 cache 110 compares the address of incoming read and/or write instruction with the main cache tag ram 208 and the victim cache tag ram 210. Also, the determination of whether the read and/or write instruction is indented for the main storage 214 or the victim storage 218 is not yet known to the unified cache storage queue 1124.
  • the read and/or write instruction is transmitted to the third pipestage (E3) 1134.
  • the L1 cache 110 determines, or has determined, if the read and/or write instruction is intended for the main storage 214 or the victim storage 218. Such a determination is transmitted to the unified cache store queue 1124 as a hit and/or miss signal.
  • the physical address of the main storage 214 is a function of the CPU 102 address.
  • the CPU 102 address determines which set of the direct mapped main storage 214 that maps to the CPU 102 address.
  • the size of the main storage 214 is 32kilobytes (KB), and the cache line size is 128 byes, totaling 256 sets.
  • the physical address of the CPU 102 may range from Address A0 to Address A255.
  • the physical address of victim storage 216 is based on the following logic. First, the CPU 102 address is compared with all 16 entries of the victim storage 216. In the event the CPU 102 address corresponds to a hit in the victim storage 216, then the location of entry where the CPU 102 transaction hits is the physical address.
  • the replacement policy chooses a location inside the victim storage 216. Since there are 16 physical address of the victim storage 216, the CPU 102 address may range from A0 to A15.
  • the information corresponding to whether the CPU 102 address is a hit or a miss is sent to the unified cache store queue 1124. Based on this information, the read and/or write instruction obtained by the CPU 102 enters either the main cache store queue 212 of the unified cache store queue 1124 or the victim cache store queue 216 of the unified cache store queue 1124.
  • the victim storage 218 includes multiple memory banks, each bank being 64 bits wide.
  • the victim storage 218 is parallel coupled with the main storage 214.
  • the banks in the victim storage 218 include two 32-bit parity blocks.
  • 7 bits of ECC syndrome is stored for each of the 32-bit party blocks.
  • the overall bank width is 118 bits. In other examples described herein, any suitable bank width may be utilized.
  • the cache controller 220 in the event the cache controller 220 transmits a write instruction to the victim storage 218, and in the event the write instruction is not aligned with a parity block on the victim storage 218, the cache controller 220 indicates to the victim cache store queue 216 to perform a read-modify-write operation.
  • the main storage 214 is a direct mapped cache element and the victim cache storage 218 is a fully associative cache storage. Both the direct mapped main storage 214 and the fully associative victim cache storage 218 are protected by an error correcting code (ECC).
  • ECC error correcting code
  • example applications include reading a line from main storage 214, the ECC logic would correct the cache line and regenerate the ECC syndrome, and then write the line to victim cache storage 218.
  • Such an application may utilize two additional clock cycles of the CPU (e.g., one clock cycle for error correction by the ECC logic and another clock cycle for ECC syndrome regeneration).
  • examples described herein include utilizing the same parity block size between the main storage 214 and the victim cache storage 218.
  • both the main storage 214 and the victim cache storage 218 calculate and/or otherwise determine parity on a 32-bit boundary.
  • the L1 cache 110 can move a cache line directly from main storage 214 to the victim cache storage 218 with less latency.
  • the L1 data cache 110 supports a number of operations that read data from the cache and make changes to the data before rewriting it.
  • the L1 data cache 110 may support read-modify-write operations.
  • a read-modify-write operation reads existing data and overwrites at least portion of the data.
  • a read-modify-write operation may be performed when writing less than a full bank width. The read functionality of the read-modify-write is used because the portion of the data in the bank that will not be overwritten still contributes to the ECC syndrome bits.
  • a read-modify-write operation may be split into a write operation and a read operation, and the victim cache store queue 216 may be structured such that the read operation in the read path stays synchronized with the write operation in the victim cache store queue 216.
  • the read operation and the write operation remain synchronized until the read-modify-write merge component 1108 overwrites at least a portion of the read data with the write data to produce merged data.
  • the merged data is provided to the ECC generator 1112 that generates new ECC syndrome bits for the merged data, and then the merged data and ECC syndrome bits may be provided to the arbitration manager 1114 for storing in the victim storage 218.
  • the cache controller 220 of FIGS. 2 and/or 3 transmits a write request indicating byte(s) of a word, or an entire word, to be re-written.
  • the write request transmitted by the cache controller 220 includes an address value of the byte and the byte value (e.g., a set of data).
  • the victim storage 218 of FIGS. 2 and/or 3 may include the four-bit word 0x12345678 associated with addresses A3, A2, A1, A0.
  • address A3 corresponds to the byte 0x12
  • address A2 corresponds to the byte 0x34
  • address A1 corresponds to the byte 0x56
  • address A0 corresponds to the byte 0x78 of the stored word.
  • the cache controller 220 may transmit a write request to replace address A3 with the byte 0x33, replace address A1 with the byte 0x22, and replace address A0 with the byte 0x11 of the currently stored word 12345678.
  • the first write request to replace address A3 of the stored word with the byte 0x33 would result in the stored word becoming 0x33345678
  • the second write request to replace address A1 of the stored word with the byte 0x22 would result in the stored word becoming 0x3334227
  • the third write request to replace address A0 of the stored word with the byte 0x11 would result in the stored word becoming 0x33342211.
  • the cache controller 220 initiates a read request of the currently stored byte (e.g., a read request of a second set of data stored in the victim storage 218) in address A3 of the currently stored word.
  • the byte and address in the first write request (e.g., 0x33 and A3) is stored in the latch 1102b.
  • the cache controller 220 transmits a read request of the entire currently stored word to the victim storage 218.
  • a read request of the entire currently stored word is transmitted to the victim storage 218 and the byte 0x33 is stored in the first latch 1102b.
  • the read-modify-write merge component 1108 obtains the byte stored in the latch 1102c and the entire currently stored word transmitted by the ECC logic 312. In this manner, the read-modify-write merge component 1108 identifies the address of the byte in the currently stored word to be updated .
  • the read-modify-write merge component 1108 identifies and/or otherwise obtains (a) the value (e.g., byte value, bit value, etc.) of the portion of the currently stored word to be updated from the latch 1102c and the (b) currently stored word from the ECC logic 312, the read-modify-write merge component 1108 writes (e.g., replaces, merges, etc.) the portion of the currently stored word (e.g., the second set of data) with the value of the portion of the currently stored word obtained from the latch 1102c (e.g., the first set of data). For example, the read-modify-write merge component 1108 writes the value of the portion of the word to an address value corresponding to the portion of the word in the word. In some examples described herein, such a merged set of data is provided by the read-modify-write merge component 1108 for writing to the victim storage 218.
  • the portion of the currently stored word e.g., the second set of data
  • Example methods, apparatus, systems, and articles of manufacture to facilitate read-modify-write support in a victim cache are described herein. Further examples and combinations thereof include the following:
  • such a write instruction may be transmitted with a corresponding read instruction, regardless of the size of the write instruction, in an attempt to execute a full read-modify-write cycle of such a write instruction.
  • a write instruction may be obtained by a CPU indicating to write 128 bits across two 64-bit memory banks, starting at address A0 of the first memory bank.
  • such an application maintains a read instruction to read the data currently stored int he two example memory banks.
  • such an approach is inefficient as twice the processing power (e.g., a write and a read instruction) is needed.
  • such an approach does not provide any control logic and/or processing circuitry to analyze the write instruction.
  • the main storage 214 and/or the victim storage 218 may be multi-banked storages.
  • the victim storage 218 may include sixteen memory banks (e.g., sixteen sub-RAMs), each 64 bits wide.
  • the cache controller 220 transmits a write instruction to write all 64 bits of a first bank of the victim storage 218 (e.g., write a 64-bit word starting with the first address of the first bank)
  • the write instruction can be executed without initiating a read instruction.
  • the bank processing logic 303 may detect that such a write of an entire bank is to be performed and, thus, indicate to the cache controller 220 to initiate the read-modify-write operation, negating to transmit the read instruction.
  • the write instruction can be implemented without initiating a read instruction.
  • the bank processing logic 303 may detect that such a write of the entirety of multiple banks is to be performed and, thus, indicate to the cache controller 220 to initiate the read-modify-write operation, negating to transmit the read instruction.
  • the cache controller 220 may transmit a write instruction to write 130 bits of a first bank, a second bank, and a third bank of the victim storage (e.g., a write instruction indicating to write a 130 bit work starting with the first address of the first bank and ending with the second address of the third bank).
  • the bank processing logic 303 detects that all addresses of the first bank and the second bank of the victim storage 218 are to be written entirely and, thus, indicate to the cache controller to initiate the read-modify-write operations for the first bank and the second bank of the victim storage, negating to transmit the read instruction.
  • the bank processing logic 303 may detect that the third bank of the victim storage 218 is to be partially written (e.g., two addresses of the 64 addresses are to be written), and, thus, indicate to the cache controller 220 to initiate a full read-modify-write operation of the third bank of the victim storage 218.
  • Example description of a read-modify-write operation is described above.
  • the example victim cache store queue 216 stores a number of write operations at different sections of the victim cache store queue 216 (e.g., at the example latches 1102a-e). For example, when the CPU 102 transmits three separate write operations in a row, the first write operation that the CPU 102 provided is stored at the first latch 1102b and moved to the second latch 1102c when the second operation is received at the first latch 1102b.
  • the first latch 1102b will store and/or output the last write operation with respect to time (e.g., which is last to be stored in the victim storage 218)
  • the second latch 1102c will have the second write operation (e.g., which is second to be stored in the main storage 214)
  • the third latch 1102d will have the first write operation (e.g., which was the first to be stored in the example victim storage 218).
  • the example arbitration manager 1114 reserves a cycle for the data to be written into the example victim storage 218. Accordingly, during the reserved cycle, the victim storage 218 may not be available to cannot perform read operations.
  • the data operations stored in two or more of the latches 1102b, 1102c, 1102d correspond to the same memory address
  • the data can be merged in order to write the data into the memory address of the victim storage 218 once, instead of two or three times. For example, if the write operation stored in the latch 1102d corresponds to writing a byte of the memory address and the write operation stored in the latch 1102c corresponds to writing a different byte to the memory address, the second write will overwrite the first write.
  • the victim cache store queue 216 merges the two writes into one write, so that only one cycle is used to write the second transaction (e.g., to avoid reserving a cycle for the first write).
  • Such an aggressive merge reduces the number of cycles reserved for write operations. In this manner, the victim storage 218 will have extra cycles to perform read operations, thereby decreasing the latency of the overall systems.
  • the output of the example latches 1102b-1102d are coupled to the example merging circuits 1103a-403c.
  • the output of the third latch 1102d may be coupled to the merging circuit 1103a
  • the output of the second latch 1102c may be coupled to the merging circuit 1103b
  • the output of the first latch 1102b may couple to the merging circuit 1103c.
  • the output of the merging circuit 1103a may additionally be coupled to the output of the second latch 1102c and the merging circuit 1103b
  • the merging circuit 1103b may be coupled to the merging circuit 1103c
  • the merging circuit 1103c may be coupled to the input of the first latch 1102b.
  • the example merging circuits 1103a-c include example comparator(s) 1120 and example switches 1122.
  • the comparator(s) 1120 compare the memory address locations for each write operation that is stored in the respective laches 1102b-1102d to determine whether any of the write operations in the example store queue correspond to the same memory address.
  • the example comparator 1120 may be one comparator to compare all the write operations of the latches 1102b-1102d or may be separate comparators 1120, to compare two of the latches 1102b-d (e.g., a first comparator to the memory address of latch 1102b to the memory address of latch 1102c, a second comparator to the memory address of 1102b to the memory address of latch 1102d, etc.).
  • the comparator(s) 1120 output the results of the comparisons (e.g., with one or more signals corresponding to the one or more comparisons) to the example switch(es) 1122 and/or the arbitration manager 1114. If the example arbitration manager 1114 receives a signal indicative of a match, the arbitration manager 1114 will not reserve the cycle for a first write operation while the first write operation is merged with a second write operation to the same memory location (e.g., to free up cycles for other cache operations).
  • the example switch(es) 1122 reroute the write operations in the example latches 1102b-1102d based on the comparison. For example, if the memory address of the write operation stored in the example latch 1102d is the same as the memory address stored in the latch 1102c, the example switch(es) 1122 enable and/or disable to reroute the output of the latch 1102d to latch 1102c, instead of routing to the example arbitration manager 1114. In this manner, the two write operations are combined and written into the victim storage 218 in a subsequent cycle as a single write operation instead of two write operations.
  • the switch(es) 1122 may be electrical switches, transistors (e.g., MOSFETS), demultiplexers, and/or any other component that can reroute a signal in a circuit.
  • a MUX of the one of the merging circuits 403a-c performs a merging protocol for the one or more rerouted write operations that prioritizes the newest write operation. For example, if the comparator(s) 1120 determines that the write operation stored in the example latch 1102c corresponds to the same memory address as the write operation stored in the example latch 1102d, the switches(es) 1122 reroute the write operation stored in the example latch 1102d to the latch 1102c.
  • the example merging circuit 1103a merges the two write operations to keep the writing data stored in latch 1102c (e.g., the write to byte0 and byte2) and include the write data from latch 1102d that doesn't overlap (e.g., byte2).
  • the write data of byte 0 from the latch 1104d is discarded because the data to be written at byte 0 from the latch 1104d will be overwritten by the write instructions of the latch 1102c.
  • the merged data corresponds to the write data for byte0 from latch 1102c, the write data for byte1 from latch 1104d, and the write data for byte2 from the latch 1102c.
  • the merged write data from the latch 1102c may be manipulated (e.g., via one of the example blocks 1104, 1106, 1108) and/or pushed to the next latch 1102d to be stored in the example victim storage 218 during a subsequent cycle.
  • An example hardware implementation of the merging protocol is further described above in conjunction with FIG. 4C .
  • Atomic operations are further example of multi-part memory operations.
  • an atomic compare and swap operation manipulates a value stored in the memory location based on the results of a comparison of the existing value stored at the memory location.
  • the CPU 102 may want to replace the data stored in the L1 cache 110 with a new value if the existing value stored in the L1 cache 110 matches a specific value.
  • the CPU when a CPU wanted to perform an atomic operation, the CPU sent a read operation to a memory address, performed the manipulation on the read data, and then executed a write operation to the same memory address to store the manipulated data.
  • the L1 cache may need to paused, rejected, blocked, and/or halted any transactions from other devices (e.g., other cores of the CPU, higher level cache, the extended memory, etc.) until the atomic operation was complete (e.g., to avoid manipulation of the memory address corresponding to the atomic operation during the atomic operation). Accordingly, such example techniques may require lots of effort on behalf of the CPU and lots of reserved cycles that increase latency.
  • the example victim cache store queue 216 handles atomic operations in conjunction with the read modify write structure.
  • the example CPU 102 can send a single atomic operations operation to the L1 cache 110, and the victim cache store queue 216 handles the atomic data manipulation and writing operation.
  • the CPU 102 utilizes a single cycle to execute an atomic operation and can use the other cycles (e.g., used in atomic protocols) to perform other functions, thereby reducing the latency of the overall computing system 100.
  • the CPU 102 transmits an atomic operation and/or an atomic compare and swap operation to increment and/or swap the data at a memory address by a value of 1, for example, the atomic instruction is received by the latch 1102a and the tag RAM 210 verifies whether the memory address is stored in the example victim storage 218. If the memory address is stored in the example victim storage 218, the tag RAM 210 instructs the example victim storage 218 to output the data at the memory address while the atomic instructions are passed to the example latch 1102b. While the victim storage 218 outputs the data to the latch 324a, the example latch 1102b outputs the atomic operation to the latch 1102c.
  • the ECC logic 312 performs error detection and/or correction protocol as described above, and the data from the memory address location is forwarded to the example arithmetic component 1104 (e.g., for atomic operations) or the atomic compare component 1106 (e.g., for the atomic compare and swap operations).
  • the arithmetic component 1104 obtains the atomic operation (e.g., including data identifying how to manipulate the data) and/or the atomic compare and swap 1106 obtains the atomic compare and swap operation (e.g., including a key and data to be written if the key matches read data) from the latch 1102c and obtains the data from the corresponding memory address from the output of the ECC logic 312.
  • the arithmetic component 1104 performs the manipulation to the data (e.g., increment the data by 1) and/or the atomic compare component 1106 may perform the swap (replaces the data if the read data matches a key, etc.) and outputs the incremented and/or swapped-in and outputs the incremented data for the corresponding memory address (e.g., the atomic result) to the example latch 1102d via the example MUX 1110. (e.g., which is enabled via the cache controller 220).
  • the atomic compare component 1106 may perform the swap (replaces the data if the read data matches a key, etc.) and outputs the incremented and/or swapped-in and outputs the incremented data for the corresponding memory address (e.g., the atomic result) to the example latch 1102d via the example MUX 1110. (e.g., which is enabled via the cache controller 220).
  • the latch 1102d outputs the new data corresponding to the memory address to the ECC generator 1112 to generate the ECC bit and the arbitration manager 1114 writes the new data (e.g., the atomic result and/or atomic compare and swap result) to the memory address in conjunction with the ECC bit in the example victim storage 218. Additionally or alternatively, the corrected value out of the EDD logic 1112 is returned to the CPU 102. Thus, the atomic operation is performed with only one instruction from the CPU 102.
  • the atomic compare component 1106 and/or the arithmetic component 1104 have several inputs.
  • the atomic component 1106 receives (e.g., obtains) the type of atomic to perform (e.g. atomic compare and swap, or atomic swap), the new data to swap in, the ECC corrected data read out the of the cache 310, and the size of the size of the data to be manipulated during the atomic operation (e.g., 32-bit or 64-bit),
  • the atomic compare component 1106 receives an atomic compare and swap operation and the arithmetic component 1104 receives an atomic operation.
  • the atomic compare component 1106 compares the comparison value (e.g., a key) provided by the CPU 102 against the ECC data 310. On a match, the new data is swapped in place of the old data (e.g. ECC data 310) and output to the MUX 1110. The size of the new data swapped-in is determined by cas_acc_sz input (e.g. 32-bit or 64-bit). In the example circuit implementation 450 of FIG. 4C , the atomic compare component 1106 may also receive an atomic swap operation.
  • the comparison value e.g., a key
  • the atomic compare component 1106 will swap-in the new data replacing the ECC data 310 regardless of the comparison result and output the new value to the mux 1110 and the old data from the address is read from the main storage 214 and is provided back to the CPU 102.
  • the size of the new data swapped-in is determined by cas_acc_sz input (e.g. 32-bit or 64-bit).
  • the arithmetic component 1104 may also receive an atomic operation. The arithmetic component 1104 will manipulate the ECC data 310 and store the manipulated data in the main storage element 214.
  • the size of the new data swapped-in is determined by cas_acc_sz input (e.g. 32-bit or 64-bit).
  • a histogram operation is where the CPU 102 wants to know the value of a bin stored many of each value is present in a section of victim storage 218 (e.g., a SRAM line from the SRAM portion of the victim storage 218). For example, if a SRAM line has 6 bins with the first bin storing 0, the second bin storing 0, the third bin storing 2, the fourth bin storing 0, the fifth bin storing 0, and the sixth bin storing 3.
  • a histogram of the SRAM line may correspond to [0, 0, 2, 0, 0, 3]. Alternatively, the histogram may be structured in a different manner (e.g., [3, 0, 0, 2, 0, 0). In some example systems, to perform a histogram function, the CPU has to read each individual value and increment for each value.
  • the CPU will perform 10 reads. Then to determine how many 1s are in the same 10 byte SRAM line with 10 bins, the CPU will perform an additional 10 reads.
  • N the size of the section of memory (e.g., 10 bytes) being read and M is the number of values that could be store in each byte.
  • M the number of values that could be store in each byte.
  • the L1 SRAM may have to block, pause, halt, discard, etc. all other read and/or write operations until the histogram operation is complete.
  • the CPU 102 instructs the victim storage 218 to perform the histogram operation. Thereby changing the number of cycles that the CPU 102 has to reserve for the operation from (N)(M) to 1. Also, because the atomic operation protocol is already implemented in the store queue, the histogram operation can be performed using the arithmetic component 1104 by performing N reads for the N size of the memory and incrementing a count for each value in the example victim SRAM store queue 216, thereby reducing the number or read operation from (N)(M) operations to N operations.
  • the operation is stored in the example latch 1102a while the tag RAM 210 verifies whether the memory address corresponding to the histogram operation is available in the victim storage 218.
  • the example cache controller 220 facilitates the read operation for each byte of the section identified in the histogram operation (e.g., where histogram bins are accessed in parallel by reading up to 128 Bytes at the same time). If available, the tag RAM 210 instructs the victim storage 218 to output the data at a first byte of the section of the victim storage 218 while the histogram operation is output by the example latch 1102a to the example latch 1102b.
  • the latch 1102b When the example victim storage 218 outputs the data that has been read from the memory address to the example latch 324a, the latch 1102b outputs the histogram operation to the example latch 1102c. After the ECC logic 312 performs the error detection and correction functionality, the data read at the byte is sent to the example arithmetic component 1104.
  • the arithmetic component 1104 After receiving the read value from the ECC logic 312 and the histogram instructions from the latch 1102c, the arithmetic component 1104 initiates data representative of the histogram. For example, the arithmetic component 1104 may initiate a vector (e.g., representing a histogram) with an initial value (e.g., zero) for each possible value that could be stored in the bytes of the victim storage. The arithmetic component 1104 increments the value of the vector based on output by the ECC logic 312 (e.g., the read byte). For example, if the read value of the byte is 0, the arithmetic component 1104 increments the value corresponding to 0 in the vector.
  • a vector e.g., representing a histogram
  • an initial value e.g., zero
  • the resulting vector corresponds to a histogram of the values that were read in the corresponding sections of SRAM in parallel. Because a value of the histogram is incremented for each bit, the resulting vector is a histogram of the values stored in the section of memory identified in the histogram operation from the CPU 102. In some examples, the arithmetic component 1104 may increment in parallel by some weighted value (e.g., 1.5).
  • the example histogram is input to the example MUX 418 (e.g., controlled by the example pending store address table 1116) to be input to the MUX 316 via the example latch 1102e.
  • the example cache controller 220 controls the MUX 316 to output the final histogram vector to the example CPU interface 202 via the multiplexer circuit 314 and the example latch 322b, thereby ending the histogram operation.
  • the L1 cache 110 supports functionality where a histogram bin can saturate after the histogram bin includes more than a threshold limit of the bin size (e.g., a byte, a halfword, a word, etc.).
  • a threshold limit of the bin size e.g., a byte, a halfword, a word, etc.
  • Table 1 illustrates an example of saturation values. Using this functionality, the histogram bin values will not roll over once they reach the maximum value.
  • Example 2 includes the system of example 1, wherein the first cache storage is a main storage and the second cache storage is a victim storage.
  • Example 3 includes the system of example 1, wherein the arithmetic component to obtain (a) the second set of data from the second cache storage via an error detection and correction circuit and (b) the memory operation from a central processing unit via a latch.
  • Example 4 includes the system of example 1, wherein the third set of data is stored in the second cache storage with a single instruction from a central processing unit at a single cycle.
  • Example 5 includes the system of example 1, further including a modified, exclusive, shared, invalid (MESI) component to determine a state of a memory address included in the memory operation, and an interface to, if the memory address included in the memory operation corresponds to a shared state, send miss instructions to another cache, the miss instructions including an exclusive state request.
  • MESI modified, exclusive, shared, invalid
  • Example 6 includes the system of example 5, wherein the arbitration manager is to store the third set of data at the memory address in the second cache storage after the exclusive state request has been granted from the other cache.
  • Example 7 includes the system of example 5, wherein the arithmetic component is to receive the second set of data from the second cache storage after the exclusive state request has been granted from the other cache.
  • Example 8 includes the system of example 5, wherein the second cache storage and the first cache storage are connected in parallel to a central processing unit.
  • Example 9 includes the system of example 5, wherein the memory operation is an atomic operation.
  • Example 10 includes a storage queue comprising an arithmetic component to receive a second set of data from a cache storage in response to a memory operation, and perform an arithmetic operation on the second set of data to produce a third set of data, and an arbitration manager to store the third set of data in the cache storage.
  • Example 11 includes the storage queue of example 10, wherein the cache storage is a victim cache storage, the victim cache storage storing data that has been removed from a main cache storage.
  • Example 13 includes the storage queue of example 10, wherein the arithmetic component to obtain (a) the second set of data from the cache storage via an error detection and correction circuit and (b) the memory operation from a central processing unit via a latch.
  • Example 15 includes the storage queue of example 10, wherein the arbitration manager is to store the third set of data at a memory address in the cache storage after an exclusive state request has been granted from another other cache.
  • Example 16 includes the storage queue of example 15, wherein the arithmetic component is to receive the second set of data from the cache storage from the memory address after the exclusive state request has been granted from the other cache.
  • Example 17 includes the storage queue of example 10, wherein the memory operation is an atomic operation.
  • Example 18 includes a method comprising obtaining a second set of data from a cache storage in response to a memory operation, and performing an arithmetic operation on the second set of data to produce a third set of data, and storing store the third set of data in the cache storage.
  • Example 19 includes the method of example 18, wherein the cache storage is a victim cache storage, the victim cache storage storing data that has been removed from a main storage.
  • Example 20 includes the method of example 19, further including storing the third set of data at a memory address in the cache storage after an exclusive state request has been granted from another other cache Atomic Compare and Swap Support in L1 in Victim Cache for Coherent System
  • the example MESI RAM 300 tracks the state of the data stored in the victim storage 218 to be able to avoid issues with mismatched data in different caches that correspond to the same memory address.
  • the example MESI RAM 300 changes the state of the memory address to shared, because the data in the memory address will not be manipulated. If the CPU 102 transmits a write operation, the example MESI RAM 300 changes the state of the memory address to exclusive, because the data in the memory address will be manipulated and the victim storage 218 needs write permission for the address. After the data in the memory address is written to the victim storage 218, the MESI RAM 300 updates the state of the memory address to modified (e.g., indicating that the memory address has been modified).
  • the data from a memory address is read from the victim storage 218 and provided to the victim cache store queue 216 to be updated (e.g., incremented) and written back into the victim storage 218.
  • the MESI RAM 300 has identified the state of the corresponding memory address as in shared state, the write operation of the atomic protocol may cause problems with other level caches (e.g., because the write will cause a mismatch of data in different caches).
  • the example cache controller 220 marks cache hits that correspond to a shared state as a cache miss. In this manner, the cache controller 220 can instruct the L2 interface 228 to send the cache miss to the higher level cache with an exclusive state request. In this manner, the higher level cache can grant the exclusive state to the L1 cache 110 and the L1 cache 110 can perform the read and write operation as part of the atomic operation in response to receiving the granted exclusive state.
  • the example atomic operation logic 1106 will instruct the MESI RAM 300 to tag the data as modified.
  • the received data from the L2 cache 112 is transmitted into the victim cache store queue 216 to be stored in the victim storage 218. Because the operation was an atomic operation (e.g., a regular atomic operation or an atomic compare and swap) or a histogram protocol, the data from the higher level cache is manipulated by the example arithmetic component 1104 and/or the example atomic compare component 1106 for the manipulation and stored in the example victim storage 218 via the example ECC generator 1112 and the example arbitration manager 1114.
  • atomic operation e.g., a regular atomic operation or an atomic compare and swap
  • FIG. 2-5 and/or 10-11 While an example manner of implementing the L1 data cache 110 of FIG. 1 is illustrated in FIG. 2-5 and/or 10-11, one or more of the elements, processes and/or devices illustrated in FIG. 2-5 and/or 10-11 may be combined, divided, re-arranged, omitted, eliminated and/or implemented in any other way.
  • 2-5 and/or 10-11 could be implemented by one or more analog or digital circuit(s), logic circuits, programmable processor(s), programmable controller(s), graphics processing unit(s) (GPU(s)), digital signal processor(s) (DSP(s)), application specific integrated circuit(s) (ASIC(s)), programmable logic device(s) (PLD(s)) and/or field programmable logic device(s) (FPLD(s)).
  • example L1 data cache 110 of FIG. 1 may include one or more elements, processes and/or devices in addition to, or instead of, those illustrated in FIGS. 2-5 and/or 10-11, and/or may include more than one of any or all of the illustrated elements, processes and devices.
  • the phrase "in communication,” including variations thereof, encompasses direct communication and/or indirect communication through one or more intermediary components, and does not require direct physical (e.g., wired) communication and/or constant communication, but rather additionally includes selective communication at periodic intervals, scheduled intervals, aperiodic intervals, and/or one-time events.
  • FIGS. 12-33 A flowchart representative of example hardware logic, machine readable instructions, hardware implemented state machines, and/or any combination thereof for implementing the L1 data cache of FIG. 1-5 and/or 10-11 is shown in FIGS. 12-33 .
  • the machine readable instructions may be one or more executable programs or portion(s) of an executable program for execution by a computer processor such as the processor 3412 shown in the example processor platform 3400 described below in connection with FIG. 34 .
  • the program may be embodied in software stored on a non-transitory computer readable storage medium such as a CD-ROM, a floppy disk, a hard drive, a DVD, a Blu-ray disk, or a memory associated with the processor 3412, but the entire program and/or parts thereof could alternatively be executed by a device other than the processor 3412 and/or embodied in firmware or dedicated hardware.
  • a non-transitory computer readable storage medium such as a CD-ROM, a floppy disk, a hard drive, a DVD, a Blu-ray disk, or a memory associated with the processor 3412
  • the entire program and/or parts thereof could alternatively be executed by a device other than the processor 3412 and/or embodied in firmware or dedicated hardware.
  • the example program is described with reference to the flowchart illustrated in FIG. 34 , many other methods of implementing the example L1 cache 110 may alternatively be used. For example, the order of execution of the blocks may be changed, and/or some of the blocks described
  • the machine readable instructions described herein may be stored in one or more of a compressed format, an encrypted format, a fragmented format, a compiled format, an executable format, a packaged format, etc.
  • Machine readable instructions as described herein may be stored as data (e.g., portions of instructions, code, representations of code, etc.) that may be utilized to create, manufacture, and/or produce machine executable instructions.
  • the machine readable instructions may be fragmented and stored on one or more storage devices and/or computing devices (e.g., servers).
  • the machine readable instructions may require one or more of installation, modification, adaptation, updating, combining, supplementing, configuring, decryption, decompression, unpacking, distribution, reassignment, compilation, etc.
  • the machine readable instructions may be stored in multiple parts, which are individually compressed, encrypted, and stored on separate computing devices, wherein the parts when decrypted, decompressed, and combined form a set of executable instructions that implement a program such as that described herein.
  • the machine readable instructions may be stored in a state in which they may be read by a computer, but require addition of a library (e.g., a dynamic link library (DLL)), a software development kit (SDK), an application programming interface (API), etc. in order to execute the instructions on a particular computing device or other device.
  • a library e.g., a dynamic link library (DLL)
  • SDK software development kit
  • API application programming interface
  • the machine readable instructions may need to be configured (e.g., settings stored, data input, network addresses recorded, etc.) before the machine readable instructions and/or the corresponding program(s) can be executed in whole or in part.
  • the described machine readable instructions and/or corresponding program(s) encompass such machine readable instructions and/or program(s) regardless of the particular format or state of the machine readable instructions and/or program(s) when stored or otherwise at rest or in transit.
  • the machine readable instructions described herein can be represented by any past, present, or future instruction language, scripting language, programming language, etc.
  • the machine readable instructions may be represented using any of the following languages: C, C++, Java, C#, Perl, Python, JavaScript, HyperText Markup Language (HTML), Structured Query Language (SQL), Swift, etc.
  • FIGS. 12-33 may be implemented using executable instructions (e.g., computer and/or machine readable instructions) stored on a non-transitory computer and/or machine readable medium such as a hard disk drive, a flash memory, a read-only memory, a compact disk, a digital versatile disk, a cache, a random-access memory and/or any other storage device or storage disk in which information is stored for any duration (e.g., for extended time periods, permanently, for brief instances, for temporarily buffering, and/or for caching of the information).
  • a non-transitory computer readable medium is expressly defined to include any type of computer readable storage device and/or storage disk and to exclude propagating signals and to exclude transmission media.
  • A, B, and/or C refers to any combination or subset of A, B, C such as (1) A alone, (2) B alone, (3) C alone, (4) A with B, (5) A with C, (6) B with C, and (7) A with B and with C.
  • the phrase "at least one of A and B" refers to implementations including any of (1) at least one A, (2) at least one B, and (3) at least one A and at least one B.
  • the phrase "at least one of A or B" refers to implementations including any of (1) at least one A, (2) at least one B, and (3) at least one A and at least one B.
  • the phrase "at least one of A and B" refers to implementations including any of (1) at least one A, (2) at least one B, and (3) at least one A and at least one B.
  • the phrase "at least one of A or B" refers to implementations including any of (1) at least one A, (2) at least one B, and (3) at least one A and at least one B.
  • FIG. 12 is an example flowchart representative of example machine readable instructions 1200 that may be executed by the example L1 cache 110 of FIGS. 1-5 to perform write miss caching in the example victim storage 218 as described above. Although the instructions of FIG. 12 are described in conjunction with the L1 cache 110 of FIGS. 1-5 , the instructions may be described in conjunction with any type of storage in any type of cache.
  • the example cache controller 220 allocates a section of the victim storage 218 for write miss information (e.g., a write miss cache).
  • the write miss information corresponds to when the CPU 102 transmits write instructions to the example L1 cache 110 to a memory address that is not stored in the storages 214, 218 of the L1 cache 110 (e.g., so the write instructions are sent to higher level cache for execution).
  • the cache controller 220 accesses the output of the example hit/miss logic 304, 306 to determine if a current write operation from the CPU 102 (e.g., received by the cache controller 220) resulted in a write miss (e.g., the memory address from the write operation is not stored in the example storages 214, 218).
  • a write miss e.g., the memory address from the write operation is not stored in the example storages 214, 218.
  • the hit miss comparison logic 304 may transmit a write miss to the example victim storage 218. In such an example, the victim storage 218 discards the write miss information because the operation hit the victim storage 218.
  • the cache controller 220 determines that a current write operation from the CPU 102 did not result in a write miss (block 1204: NO), control returns to block 1204 until a write operation results in a write miss. If the cache controller 220 determines that a current write operation from the CPU 102 results in a write miss (block 1204: YES), the example cache controller 220 determines if the write miss information corresponds to the same memory address as any write miss information already stored in the allocated section (block 1206).
  • the cache controller 220 determines that the write miss information corresponds to the same memory address as any write miss information already stored in the allocated section (block 1206: YES), the cache controller 220 instructs the example victim storage 218 to merge the write miss information with the stored write miss information corresponding to the same memory address (block 1208).
  • the example victim storage 218 merges the two write miss information by overwriting the older write miss information with the most recent write miss information when the most recent write miss information overlaps (e.g., corresponds to the same bytes as) the older write miss information (e.g., discarding the older write miss information that overlaps the more recent write miss information) and maintaining the older write miss information that does not overlap the more recent write miss information.
  • the cache controller 220 determines that the write miss information does not correspond to the same memory address as any write miss information already stored in the allocated section (block 1206: NO), the cache controller 220 stores, in the example victim storage 218, the write miss information in the allocated section (block 1210).
  • the example cache controller 220 determines if more than a threshold amount of write miss information has been stored in the allocated section.
  • the threshold amount may be set to the size of the victim cache (e.g., the threshold is satisfies when the allocated section is full), the size of the L2 interface 228 (e.g., if the L2 interface has a 64 byte bandwidth, then the threshold is set to 64 bytes), and/or any other amount set by a user and/or manufacturer. If the example cache controller 220 determines that more than the threshold (e.g., a first threshold) amount of write miss information has not been stored in the allocated section (block 1212: NO), control returns to block 1204.
  • the threshold e.g., a first threshold
  • the cache controller 220 selects a threshold (e.g., a second threshold) amount of write miss information (e.g., the N oldest write miss information stored in the allocated section where N corresponds to the threshold) from the allocated section of the victim storage 218 (block 1214).
  • the second threshold may correspond to (e.g., be the same as) the first threshold and/or may correspond to the bandwidth of the L2 interface 228 (e.g., if the bandwidth of the L2 interface 228 is 64 bytes, than no more than 64 bytes of write miss data is selected).
  • the cache controller 220 may proceed to block 1210 when a threshold amount of time has occurred.
  • the cache controller 220 causes the example victim storage 218 to remove the selected write miss information from the allocated section.
  • the example L2 cache interface 228 transmits the selected write miss information to the higher level cache (e.g., the L2 cache 112). As described above, sending multiple write miss information to utilize more of the bandwidth of the L2 interface 112 results in a more efficient system.
  • FIG. 13 is an example flowchart representative of example machine readable instructions 1300 that may be executed by the example L1 cache 110 of FIGS. 1-5 to facilitate a read-modify-write operation, in conjunction with the above Section 2.
  • the main cache store queue 212 obtains a write instruction transmitted by the CPU 102 (e.g., transmitted through the cache controller 220) indicating byte(s) of a word, or an entire word, to be re-written. (Block 1302).
  • the write port 426 may obtain the write instruction transmitted by the CPU 102.
  • the main cache store queue 212 transmits the value of the portion of the word to be rewritten to the latch 402b. (Block 1304).
  • the latch 402b transmits the value of the portion of the word to be rewritten to the latch 402c.
  • the main cache store queue 212 stores the address value associated with the location of the portion of the word to be rewritten in the pending store address data store 416. (Block 1306). Also, the tag ram 208 transmits a read instruction (e.g., a read request) of the entire currently stored word to the main storage 214. (Block 1308).
  • the main cache store queue 212 determines whether there has been a subsequent clock cycle of the CPU 102, or the cache controller 220. (Block 1310). In some examples described herein, the latch 402c determines whether there has been a subsequent clock cycle of the CPU 102, or the cache controller 220. In response to determining that there has not been a subsequent clock cycle of the CPU 102, or the cache controller 220, (e.g., the control of block 1310 returns a result of NO), the process waits.
  • the read-modify-write merge component 408 obtains the value of the portion of the word (e.g., the byte) stored in the latch 402c. (Block 1312). Also, the read-modify-write merge component 408 obtains the entire currently stored word transmitted by the ECC logic 310. (Block 1314). In this manner, the read-modify-write merge 408 identifies the address of the byte in the currently stored word to be updated.
  • the read-modify-write merge component 408 identifies and/or otherwise obtains (a) the value (e.g., byte value, bit value, etc.) of the portion of the currently stored word to be updated from the latch 402c and the (b) currently stored word from the ECC logic 310, the read-modify-write merge component 408 writes (e.g., replaces) the portion of the currently stored word with the value of the portion of the currently stored word obtained from the latch 402c. (Block 1316).
  • the read-modify-write merge component 408 writes (e.g., replaces) the portion of the currently stored word with the value of the portion of the currently stored word obtained from the latch 402c.
  • the main cache store queue 212 generates error detection code based on the word, the error detection code to be stored with the word. (Block 1318).
  • the ECC generator 412 generating error detection code based on the word, the error detection code to be stored with the word.
  • the control of block 1318 may be performed in response to an additional subsequent clock cycle of the CPU 102, or the cache controller 220.
  • the main cache store queue 212 determines whether an additional write instruction is obtained. (Block 1322). the event the main cache store queue 212 determines another write instruction is obtained (e.g., the control of block 1322 returns a result of YES), the process returns to block 1302. Alternatively, in the event the main cache store queue 212 determines another write instruction is not obtained (e.g., the control of block 1322 returns a result of NO), the process 1300 may wait until a threshold timeout period occurs, thus ending the process 1300.
  • FIG. 14 is an example flowchart representative of example machine readable instructions 1400 that may be executed by the example L1 cache 110 of FIGS. 1-5 to facilitate a read-modify-write operation for non-aligned writes, in conjunction with the above-Sections 3 and/or 14.
  • the bank processing logic 303 of FIGS. 3A-3D analyzes the write instructions obtained from the CPU 102 (e.g., the write instructions obtained via the CPU interface 202). For example, the bank processing logic 303 may operate as initial processing circuitry to determine the nature of the write instruction.
  • the bank processing logic 303 determines the number of memory banks to be written to when executing the write instruction. (Block 1404). For example, the bank processing logic 303 determines the address locations of the write instruction and, as such, determines the banks of either the main storage 214 or the victim storage 218 that include the corresponding address locations. In response, the bank processing logic 303 determines whether all addresses of a memory bank (e.g., a memory bank included in either the main storage 214 or the victim storage 218) are to be rewritten. (Block 1406).
  • a memory bank e.g., a memory bank included in either the main storage 214 or the victim storage 21
  • the bank processing logic 303 determines all addresses of a memory bank (e.g., a memory bank included in either the main storage 214 or the victim storage 218) are to be rewritten (e.g., the control of block 1406 returns a result of YES), the bank processing logic 303 indicates to the CPU 102, or the cache controller 220, to execute the write instruction without reading the currently stored values in the memory bank. (Block 1408).
  • the bank processing logic 303 may identify that addresses A0 to A70 are to be rewritten and, thus, determine that the first memory bank (e.g., a memory bank having addresses A0 to A63) are to be rewritten. Thus, such a first memory bank can be rewritten without reading the currently stored values.
  • the bank processing logic 303 determines whether there are additional memory banks to analyze. (Block 1410).
  • the bank processing logic 303 determines whether all memory banks affected by the write instruction have been analyzed. In following the example above, the bank processing logic 303 determines that the memory bank including addresses A64 to A70 have not been analyzed. Thus, in the event the bank processing logic 303 determines that there is an additional memory bank to analyze (e.g., the control of block 1410 returns a result of YES), the process 1400 returns to block 1406. Alternatively, in the event the bank processing logic 303 determines that there are no additional memory banks to analyze (e.g., the control of block 1410 returns a result of NO), the bank processing logic 303 determines whether another write instruction is obtained. (Block 1412).
  • the process 1400 in the event the bank processing logic 303 determines there is another write instruction (e.g., the control of block 1412 returns a result of YES), the process 1400 returns to block 1402. Alternatively, in the event the bank processing logic 303 determines that there is not another write instruction (e.g., the control of block 1412 returns a result of NO), the process 1400 may wait until a threshold timeout period occurs, thus ending the process 1400.
  • FIG. 15 is an example flowchart representative of example machine readable instructions 1500 that may be executed by the example L1 cache 110 of FIGS. 1-5 to perform an aggressive write merge in the example main cache store queue 212 and/or the victim cache store queue 216, in conjunction with the above sections 4 and/or 15.
  • the instructions of FIG. 15 are described in conjunction with the L1 cache 110 of FIGS. 1-5 , the instructions may be described in conjunction with any type of storage in any type of cache.
  • the instructions of FIG. 15 are described in conjunction with the main cache store queue 212 and the main storage 214. However, the instruction of FIG. 15 can likewise be used in conjunction with the victim cache store queue 216 and the victim storage 218.
  • the example comparator(s) 420 of the example merging circuits 403a-c obtains write instructions from corresponding latches 402b-d. As described above, each of the latches 402b-d include different write instructions from the CPU 102.
  • the example comparator(s) 420 of the example merging circuits 403a-c compare the memory addresses for the write instructions from the latches 402b-d. For example, the comparator 420 of the merging circuit 403a compares the memory address for the write instructions output by the latch 402d with the write instructions output by the latch 402c.
  • the example comparator(s) 420 determine if any two or more write instructions output by the latches 402b-d correspond to the same memory address. If the comparator(s) 420 determine that any two or more write instructions output by the latches 402b-d do not correspond to the same memory address (block 1506: NO), control continues to block 1520, as further described below.
  • the comparator(s) 420 determine that any two or more write instructions output by the latches 402b-d corresponds to the same memory address (block 1506: YES), control continues to block 1508..
  • the one or more of the example merging circuit 1103a-c that receive(s) write instructions for the same memory address maintain(s) the write data for the byte(s) for the newest write instructions (e.g., the write instructions that were more recently received from the CPU 102) that overlap the write data for the same byte(s) from older write instructions (block 1510).
  • the one or more of the example merging circuit 1103a that receive(s) write instructions for the same memory address update(s) the write data for byte(s) from older write instructions that do not overlap with write data from the newest write instructions. For example, if the merging circuit 1103a, storing a write instruction to write byte0 of a memory address, receives rerouted data from latch 402d, the rerouted data corresponding to a write instruction to write byte0 and byte 1 of the memory address, then the merging circuit 1103a maintains the write instruction to write to byte0 (e.g., discarding the write instruction to write byte0 from latch 402d, because the write instruction is older) and updates the write instruction to write to byte1 corresponding to the instruction from latch 402b (e.g., because the instruction does not overlap with the newest write instructions).
  • the example switch(es) 422 reroute the merged write instructions the latch with the newest write instructions that also corresponds to the
  • the one or more of the merging circuits 402a-c that rerouted write instructions to be merged flags the data that was rerouted.
  • the one or more merging circuits 403a-c may transmit a signal to the example arbitration manager 414 and/or the cache controller 220. In this manner, the arbitration manager 414 and/or the cache controller 220 can avoid reserving a cycle to write the data that has be rerouted into a prior latch for merging.
  • the latch 402b determines if an additional write instruction has been received. If the latch 402d determines that an additional instruction has not been received (block 1520: NO), control returns to block 1520 until additional write instructions are received. If the latch 402d determines that an additional instruction has been received (block 1520: YES), control returns to block 1504.
  • FIG. 16 is an example flowchart representative of example machine readable instructions 1600 that may be executed by the example L1 cache 110 of FIGS. 1-5 to perform an atomic operation, as described above in conjunction with the above Sections 5 and 16.
  • the instructions of FIG. 16 are described in conjunction with the L1 cache 110 of FIGS. 1-5 , the instructions may be described in conjunction with any type of storage in any type of cache.
  • the instructions of FIG. 16 are described in conjunction with the main half of the L1 cache 110 (e.g., the main cache store queue 212, the main storage 214, etc.). However, the instruction of FIG. 16 can likewise be used in conjunction with the victim side of the L1 cache 110 (e.g., the victim cache store queue 216, the victim storage 218, etc.).
  • the cache controller 220 and/or the example latch 402a obtains an atomic operation from the CPU 102.
  • the cache controller 220 and and/or the latch 402a sends the memory address for the atomic operation to the example tag RAM 208 to determine whether the data corresponding to the atomic operation is stored in the example main storage 214.
  • the cache controller 220 interfaces with the example hit/miss logic 304 to determine if the memory address corresponding to the atomic operations is stored in the main storage 214.
  • cache controller 220 determines that the memory address corresponding to the atomic operation is not stored in the main storage 214 (block 1604: NO)
  • cache controller 220 interfaces with the example L2 cache interface 228 to submit the atomic miss information to higher level cache (e.g., the L2 cache 112 of FIG. 1 ) (block 1606).
  • the example L2 cache 112 can return the corresponding data from the memory address corresponding to be stored in the L1 cache 110 to execute the atomic operation.
  • the L2 cache 112 may have the data corresponding to the memory address stored locally or may obtain the data from the L3 cache 114 and/or the extended memory 110 (e.g., via the L3 cache 114).
  • the example arithmetic component 404 of the main cache store queue 212 obtains the data corresponding to the memory address from the L2 cache 112 via the L2 interface 228.
  • the data may be stored in the example main storage 214, read, and input to the example arithmetic component 404.
  • cache controller 220 determines that the memory address corresponding to the atomic operation is stored in the main storage 214 (block 1604: YES)
  • cache controller 220 causes the example arithmetic component 404 to obtain the data corresponding to the memory address of the atomic operation from the main storage 214 (block 1610).
  • cache controller 220 causes the example arithmetic component 404 to perform the atomic operation from the CPU 102 in conjunction with the data from the storage and/or higher level cache that corresponds to the atomic operation. For example, while blocks 1602-1610 occur, the atomic operation is sent to the main cache store queue 212 via the latch 402a.
  • the atomic operation includes the specifics of the operation (e.g., increment, decrement, etc.).
  • the arithmetic component 404 obtains the atomic operation and the data corresponding to the memory address of the atomic operation.
  • the arithmetic component 404 can perform the atomic operation (e.g., increment, decrement, etc.) using the obtained data (e.g., that corresponds to the memory address of the atomic operation).
  • the example cache controller 220 controls the MUX 410 (e.g., via the select line) to ensure that the output of the arithmetic component 404 is output to the example latch 402d.
  • the manipulated data e.g., incremented data, decremented data, etc.
  • the example ECC generation 412 can be passed to the example ECC generation 412 to generate an ECC code for the manipulated data (block 1616).
  • the example ECC generation 412 outputs an ECC code for the manipulated data, the manipulated data, and the memory address location to the example to the example arbitration manager 414.
  • the cache controller 220 causes the example arbitration manager 414 to store the atomic output (e.g., the manipulated data) in the main storage 214 at the memory address of the atomic operation.
  • FIG. 17 is an example flowchart representative of example machine readable instructions 1700 that may be executed by the example L1 cache 110 of FIGS. 1-5 to perform a histogram operation, in conjunction with the above Sections 5 and/or 16.
  • the instructions of FIG. 17 are described in conjunction with the L1 cache 110 of FIGS. 1-5 , the instructions may be described in conjunction with any type of storage in any type of cache.
  • the instructions of FIG. 17 are described in conjunction with the main half of the L1 cache 110 (e.g., the main cache store queue 212, the main storage 214, etc.). However, the instruction of FIG. 17 can likewise be used in conjunction with the victim side of the L1 cache 110 (e.g., the victim cache store queue 216, the victim storage 218, etc.).
  • the cache controller 220 and/or the example latch 402b of the main cache store queue 212 and/or the example tag RAM 208 obtains a histogram operation from the CPU 102.
  • the histogram operation includes determining a total number of each value stored in a section of memory (e.g., a SRAM line).
  • the cache controller 220 interfaces with the example hit/miss logic 304 to determine if the memory address corresponding to the histogram operation is stored in the SRAM of the main storage 214. If the cache controller 220 determines that the memory address corresponding to the histogram operation is stored in the SRAM of the main storage 214 (block 1704: YES), control continues to block 1710.
  • the cache controller 220 determines that the memory address corresponding to the histogram operation is not stored in SRAM of the main storage 214 (block 1704: NO)
  • the cache controller 220 interfaces with the example L2 interface 228 to transmit the read miss information to higher level cache (e.g., the example L2 cache 112) (block 1706).
  • the cache controller 220 utilizes the example arbitration manager 414 to obtain the read data from the higher level cache via the L2 interface 228 and stores the data corresponding to the memory address of the histogram operation in the SRAM of the main storage 214.
  • the cache controller 220 and/or causes the example arithmetic component 404 to initiate a histogram vector with values to be representative of counts for values stored in the section of the SRAM of the main storage 214.
  • the cache controller 220 causes the example SRAM of the main storage 214 outputs the read value of the bins corresponding to the section of SRAM in parallel.
  • the read values are output to the example arithmetic component 404 via the ECC logic 310.
  • the cache controller 220 utilizes the example arithmetic component 404 to increment one of the elements of the histogram value based on the read values of the bins. For example, if a read value is '01,' the arithmetic component 404 increments the element that corresponds to the '01' count.
  • the histogram vector is provided to the example MUX 314 via the example MUX 418 and the example latch 402e and the example MUX 314 outputs the histogram vector to the CPU 102 via the latch 322b and the CPU interface 202 (block 1722).
  • the histogram vector is additionally or alternatively, stored in the example main storage 214 via the ECC generator 412 and the arbitration manager 414.
  • FIGS. 18A and 18B illustrate an example flowchart representative of example machine readable instructions 1800 that may be executed by the example L1 cache 110 of FIGS. 1-5 to perform an atomic compare and swap operation, in conjunction with the above Sections 6 and/or 17.
  • the flowchart may be described in conjunction with any atomic operation or a histogram operation.
  • the instructions of FIGS. 18A and 18B are described in conjunction with the L1 cache 110 of FIGS. 1-5 , the instructions may be described in conjunction with any type of storage in any type of cache.
  • the instructions of FIGS. 18A and 18B are described in conjunction with the main half of the L1 cache 110 (e.g., the main cache store queue 212, the main storage 214, etc.).
  • the instruction of FIGS. 18A and 18B can likewise be used in conjunction with the victim side of the L1 cache 110 (e.g., the victim cache store queue 216, the victim storage 218, etc.).
  • the cache controller 220 and/or the example latch 402b of the example main cache store queue 212 obtains an atomic compare and swap operation with a key from the example CPU 102.
  • the atomic compare and swap compares the data at a memory address to a key and performs a write to the memory address with swap data if the previously stored data at the memory address matches the key.
  • the cache controller 220 interfaces with the example MESI RAM 300 to determine the state of the memory address corresponding to the atomic compare and swap operation.
  • the MESI RAM 300 tracks the states of the memory addresses (e.g., shared, modified, inactive, or exclusive).
  • the cache controller 220 interfaces with the example MESI RAM 300 to determine if the state of the memory address is inactive (e.g., the memory addressed is not stored in the L1 cache 110) or is shared (e.g., stored in the L1 cache 110 and stored in another higher level cache).
  • the cache controller 220 determines that the state of the memory address corresponding to the atomic compare and swap is inactive of shared (block 1806: YES)
  • the cache controller 220 causes the example L2 interface 228 to submit an atomic misinformation to higher level cache with an exclusive state request (block 1808).
  • the example L2 interface 228 transmits the exclusive state request to let the higher level cache know that the L1 cache 110 will perform an operation on the cache for more than one cycle so that different writes do not occur in different caches to the same memory address.
  • the example cache controller 220 causes the example MESI RAM 300 to change the state of the corresponding memory address to exclusive.
  • the MESI RAM 3400 may change the state after submitting the request to higher level cache or after receiving a response from the higher level cache. If the state of the memory address was inactive, the higher level cache will return the data at the memory address, which may be stored in the example main storage 214 and/or input to the example atomic compare component 406.
  • the example cache controller 220 causes the example atomic compare component 406 to obtain the data corresponding to the memory address (e.g., from the example main storage 214 and/or from the higher level cache).
  • the example cache controller 220 determines that the state of the memory address does not correspond to the atomic compare and swap is inactive or shared (block 1806: NO)
  • the example atomic compare component 406 obtains the data corresponding to the memory address from the main storage 214 (block 1814).
  • the cache controller 220 causes the example atomic compare component 406 determines if the obtained data matches the key (e.g., from the atomic swap and compare operation).
  • the cache controller 220 causes the atomic compare component 406 to discard the swap data to be written from the atomic compare and swap (e.g., the data that was to be stored if the obtained data matched the key) (block 1818).
  • atomic compare component 406 outputs the obtained data to rewrite the obtained data back into the main storage 214.
  • the cache controller 220 causes the example atomic compare component 406 outputs the swap data to be written to the memory address (e.g., from the atomic compare and swap operation) to the example MUX 410 (Block 1820).
  • the example cache controller 220 controls the MUX 410 (e.g., via the select line) to ensure that the output of the atomic compare component 406 is output to the example latch 402d. Accordingly, the swapped data can be passed to the example ECC generation 412 to generate an ECC code for the swapped data (block 1824).
  • the example ECC generation 412 outputs an ECC code for the swapped data, the swapped data, and the memory address location to the example to the example arbitration manager 414.
  • the cache controller 220 causes the example arbitration manager 414 to store the atomic output (e.g., the manipulated data) in the main storage 214 at the memory address of the atomic operation.
  • FIG. 19 is an example flowchart representative of example machine readable instructions 1900 that may be executed by the example L1 cache 110 of FIGS. 1-5 to perform in-flight data forwarding and invalidation of write instructions from the CPU 102, in conjunction with the above-Section 7. Although the instructions of FIG. 19 are described in conjunction with the L1 cache 110 of FIGS. 1-5 , the instructions may be described in conjunction with any type of storage in any type of cache.
  • the cache controller 220 issues a read-invalidate operation to the store queue 212 (block 1906). For example, the cache controller 220 sends an operation to the read port 424 of the store queue in response to receiving the memory address of the victim.
  • the read port 424 obtains an address corresponding to victim (block 1908). For example, the cache controller 220 sends the victim address to the store queue 212 when issuing a read-invalidate operation.
  • the data store 416 compares the address of the victim to addresses stored in the data store 416 (block 1910). For example, the data store 416 maintains a log of the addresses associated with each value stored in any of the latches 402a, 402b, 402c, 402d, 402e, and/or any of the merging circuits 403a, 403b, and/or 403g. Also, the data store 416 stores the victim address corresponding to the read-invalidate operation. The data store 416 determines if any of the addresses in the data store 416 match the address of the victim (block 1912). For example, the data store 416 determines if any of the latches 402a-d include values and/or data corresponding to the victim address.
  • the latches 402a-d store outstanding write addresses.
  • the outstanding write corresponds to a write operation has not been completed (e.g., the data of the write operation has not been fully written into the main storage element 214).
  • the store queue 212 writes data to a location (e.g., a cache line) in the main storage element 214 that an allocation policy selected to be the victim.
  • the priority multiplexer 418 forwards the data corresponding to the matching addresses to the MUX circuit 314 (block 1914). For example, the data store 416 sends the matching address to the priority multiplexer 418. The priority multiplexer 418 selects the data and/or the values stored in the latches 402a-d that store the victim address. The priority multiplexer 418 sends the selected data to the latch 402e to be forwarded to the MUX circuit 314. The MUX circuit 314 sends the data to the victim storage element 214 and/or the L2 cache 112.
  • the machine readable instructions 2000 begin at block 2002, at which the L1 cache 110 receives read address(es) from interface(s).
  • the L1 cache 110 can receive ADP_ADDR_E2_DP0 from the scalar interface 502 of the CPU 102 of FIG. 1 , SNP_ADDR_E2_DP0 from the snoop interface of FIGS. 3 and/or 5, and/or ADP_ADDR_E2_DP1 from the vector interface 502 of the CPU 102 as depicted in FIG. 8B .
  • the L1 cache 110 compares read address(es) to sets of a multi-bank victim cache tag (VCT) random access memory (RAM).
  • VCT multi-bank victim cache tag
  • the first comparators 850 of FIG. 8B can compare a first read address of ADP_ADDR_E2_DP0 to respective addresses stored in the sets 846 of FIG. 8B .
  • the second comparators 852 of FIG. 8B can compare a second read address of ADP_ADDR_E2_DP1 to respective addresses stored in the sets 846.
  • the L1 cache 110 determines whether at least one of the read address(es) is mapped to one of the sets. For example, one of the first comparators 850 can assert a logic one in response to the read address matching the set 846 that the one of the first comparators 850 is associated with. In other examples, the second comparators 852 can generate HIT_DP1 based on the comparisons. In other examples, the one of the first comparators 850 can generate a logic low in response to the read address not matching the set 846 that the one of the first comparators 850 corresponds to.
  • the L1 cache 110 executes cache hit-miss conversion logic.
  • the first address encoder logic circuit 854 can invoke at least one of the first AND gate 864A, the third comparator 870A, or the fourth comparator 872A of FIG. 8B to convert a cache hit to a cache miss or vice versa in response to example operating conditions.
  • the second address encoder logic circuit 856 can invoke at least one of the second AND gate 864B, the fifth comparator 870B, or the sixth comparator 872B of FIG. 8B to convert a cache hit to a cache miss or vice versa in response to example operating conditions.
  • An example process that may be used to implement block 2010 is described below in connection with FIG. 21 .
  • the L1 cache 110 outputs cache hit address(es) based on the cache hit-miss conversion logic.
  • the first address encoder logic circuit 854 can output HIT_ADDR0 in response to executing cache hit-miss conversion logic.
  • the second address encoder logic circuit 856 can output HIT_ADDR1 in response to executing cache hit-miss conversion logic.
  • the L1 cache 110 determines whether there additional read address(es) have been received. If, at block 2014, the L1 cache 110 determines additional read address(es) have been received, control returns to block 2002 to receive the read address(es) from the interface(s). If, at block 2014, the L1 cache 110 determines no additional read address(es) have been received, the example machine readable instructions 2000 of FIG. 20 conclude.
  • FIG. 21 is an example flowchart representative of example machine readable instructions 2100 that may be executed by the example L1 cache 110 of FIGS. 1-5 to execute cache hit-miss conversion logic as described above.
  • the example machine readable instructions 2100 of FIG. 21 can be executed to implement block 2010 of FIG. 20 .
  • the instructions of FIG. 21 are described in conjunction with the L1 cache 110 of FIGS. 1-5 , the instructions may be described in conjunction with any type of storage in any type of cache.
  • the instructions of FIG. 21 are described in conjunction with the victim side of the L1 cache 110 (e.g., the victim cache store queue 216, the victim storage 218, etc.). However, the instructions of FIG. 21 can likewise be used in conjunction with the main half of the L1 cache 110 (e.g., the main cache store queue 212, the main storage 214, etc.).
  • the machine readable instructions 2100 of FIG. 21 begin at block 2102, at which the L1 cache 110 determines whether a new address from a first interface has been written to victim cache in a later pipeline stage.
  • the first decoder 860A can receive VTA_WR_SET0 at the E2 pipeline stage, which can be representative of the scalar interface 502 of FIG. 5 writing an address to the victim storage 218 at the E3 pipeline stage.
  • the L1 cache 110 determines that a new address from the first interface is being written to the victim cache in a later pipeline stage, control proceeds to block 2106 to compare the new address to the address of the cache hit. If, at block 2102, the L1 cache 110 determines that a new address from the first interface is not being written to the victim cache in a later pipeline stage, then, at block 2104, the L1 cache 110 determines whether a new address from a second interface has been written to victim cache in a later pipeline stage.
  • the second decoder 860B can receive VTA_WR_SET1 at the E2 pipeline stage, which can be representative of the scalar interface 502 of FIG. 5 writing an address to the victim storage 218 at the E3 pipeline stage.
  • control returns to block 2012 of the example machine readable instructions 2000 of FIG. 20 to output the cache hit address(es) based on the cache hit-miss conversion logic.
  • the L1 cache 110 determines that a new address from the second interface is being written to the victim cache in a later pipeline stage, then at block 2106, the L1 cache 110 compares the new address to the address of the cache hit.
  • the first AND gate 864A can assert a logic one in response to an address of VTAG_WR_SET0 not matching an address of HIT_DP0.
  • the third comparator 870A can compare an address of HIT_DP0 to an address being written to the victim storage 218 by the scalar interface 502.
  • the fourth comparator 872A can compare an address of HIT_DP0 to an address being written to the victim storage 218 by the vector interface 504.
  • the L1 cache 110 determines whether a cache hit or a cache miss is identified. For example, the first AND gate 864A, the third comparator 870A, and/or the fourth comparator 872A can determine that there is a cache hit of the address of ADP_ADDR_E2_DP0 in the victim storage 218 based on HIT_DP0 including at least one bit value of 1.
  • the second AND gate 864B, the fifth comparator 870B, and/or the sixth comparator 872B can determine that there is a cache hit of the address of ADP_ADDR_E2_DP1 in the victim storage 218 based on HIT_DP1 including at least one bit value of 1.
  • the L1 cache 110 determines that a cache hit is identified, then, at block 2112, the L1 cache 110 converts the cache hit to a cache miss.
  • the first AND gate 864A can output a logic low to convert a cache hit to a cache miss.
  • the second AND gate 864B can output a logic low to convert a cache hit to a cache miss.
  • control returns to block 2012 of the example machine readable instructions 2000 of FIG. 20 to output the cache hit address(es) based on the cache hit-miss conversion logic.
  • the third comparator 870A and/or the fourth comparator 872A can assert a logic one to convert a cache miss to a cache hit in response to ADP_ADDR_E2_DP0 matching an address of a write operation from either DP0 or DP1.
  • the fifth comparator 870B and/or the sixth comparator 872B can assert a logic one to convert a cache miss to a cache hit in response to ADP_ADDR_E2_DP1 matching an address of a write operation from either DP0 or DP1.
  • control returns to block 2012 of the example machine readable instructions 2000 of FIG. 20 to output the cache hit address(es) based on the cache hit-miss conversion logic.
  • FIG. 22 is an example flowchart representative of example machine readable instructions 2200 that may be executed by the example L1 cache 110 of FIGS. 1-5 to perform data allocation in the main storage 214, in conjunction with the above description. Although the instructions of FIG. 22 are described in conjunction with the L1 cache 110 of FIGS. 1-5 , the instructions may be described in conjunction with any type of storage in any type of cache.
  • the example main cache controller 222 obtains an instruction from the CPU interface 202 ( FIG. 2 ).
  • the CPU interface 202 provides an instruction to the cache controller 220
  • the cache controller 220 propagates the instruction to the main cache controller 222.
  • the main cache controller 222 determines the instruction is a read instruction (e.g., block 2204 returns a value YES)
  • the main cache controller 22 determines the address of the read instruction (block 2206). For example, the main cache controller 222 determines where the data is to be read from in the main storage 214. In some examples, the main tag RAM access 204 determines the address of the read instruction.
  • the main cache controller 222 determines if the address of the read instruction matches an address in the tag RAMs 208, 210. For example, the cache controller 220 may obtain hit/miss results from the tag RAM access(es) 204, 206 and determine if the address is available in the main storage 214 and/or victim storage 218. The main cache controller 222 determines the read instruction is a miss (e.g., block 2208 returns a value NO), the main cache controller 223 identifies the cache line associated with the address (block 2210). For example, the main cache controller 222 is a direct mapped cache, and the address of the read instruction can only be stored in one location (e.g., at one cache line) of the main storage 214.
  • a miss e.g., block 2208 returns a value NO
  • the main cache controller 223 identifies the cache line associated with the address (block 2210).
  • the main cache controller 222 is a direct mapped cache, and the address of the read instruction can only be stored in one location (e
  • the main cache controller 222 allocates data of the cache line to the victim storage 218 (block 2212). For example, the main cache controller 222 allocates data from the direct mapped cache line to the victim storage 214. The main cache controller 222 allocates data regardless of the MESI state of that data. Such an allocation reduces latency of the main cache controller 222 and the overall L1 cache 110 by allocated any line in the main storage 214 to the victim storage 218.
  • FIG. 23 is an example flowchart representative of example machine readable instructions 2300 that may be executed by the example L1 cache 110 of FIGS. 1-5 to facilitate a snoop request, in conjunction with the above Section 10.
  • the snoop address 502 e.g., the snoop interface
  • obtains the snoop request from a higher-level data cache e.g., the L2 data cache 112).
  • the snoop address 502 issues a read instruction to the tag RAM 210.
  • the read instruction is issued to the tag RAM 210 to identify whether the victim storage 218 includes the data requested via the snoop address 502.
  • the comparison logic 306c determines whether the read issued to the tag RAM 210 was a hit. (Block 2306). In the event the comparison logic 306c determines the read issued to the tag RAM 210 is not a hit (e.g., the control of block 2306 returns a result of NO), the victim storage 218 generates a snoop response indicating a miss occurred. (Block 2308). Also, the victim storage 218 transmits the snoop response back to the higher-level data cache (e.g., the L2 data cache 112). (Block 2310).
  • the higher-level data cache e.g., the L2 data cache 112
  • the comparison logic 306 determines the read issued to the tag RAM 210 is a hit (e.g., the control of block 2306 returns a result of YES)
  • the comparison logic 306c determines the state of the address associated with the read instruction in the MESI RAM 300. (Block 2312).
  • the comparison logic 306 may also store the state of the address as identified responsive to the read instruction in the MESI RAM 300.
  • the example address encoder 326c generates an address value for use by the victim storage 218 in obtaining the data.
  • the address encoder 326c encodes an address of the tag RAM 210 to a form that is interpretable by the victim storage 218.
  • the tag RAM 210 may store 16-bit memory addresses while the victim storage 218 stores 4-bit memory addresses corresponding to the 16-bit memory addresses.
  • the address encoder 326 may transform the 16-bit memory address into a 4-bit memory address to locate and/or enter the corresponding memory address in the victim storage 218.
  • the example response multiplexer 508 determines whether a data input is obtained from the victim cache store queue 216. (Block 2316). In the event the response multiplexer 508 determines no data has been input from the victim cache store queue 216 (e.g., the control of block 2316 returns a result of NO), the response multiplexer 508 outputs the data identified based on the address provided by the address encoder 326c as the snoop response to the higher-level data cache (e.g., the L2 data cache 112). (Block 2320).
  • the response multiplexer 508 determines data has been input from the victim cache store queue 216 (e.g., the control of block 2316 returns a result of YES)
  • the response multiplexer 508 identifies the updated version of the data as the data to be sent in the snoop response. (Block 2318).
  • the response multiplexer 508 outputs the data identified based on the address provided by the address encoder 326c as the snoop response to the higher-level data cache (e.g., the L2 data cache 112). (Block 2320).
  • the snoop address component 506 determines whether an additional snoop request is available. (Block 2322). In the event the snoop address component 506 (e.g., the snoop interface) determines an additional snoop request is available (e.g., the control of block 2322 returns a result of YES), the process 2300 returns to block 2302. Alternatively, in the event the snoop address component 506 (e.g., the snoop interface) determines an additional snoop request is not available (e.g., the control of block 2322 returns a result of NO), the process 2300 stops.
  • the snoop address component 506 e.g., the snoop interface
  • FIGS. 24 , 25 , 26 , 27 , 28 , and 29A , 29B-1, and 29B-2 are example flowcharts representative of example machine readable instructions that may be executed by the example L1 cache 110 of FIGS. 1-5 to perform eviction of data in the victim storage 218, in conjunction with the above Section 11.
  • the instructions of FIG. 24 , 25 , 26 , 27 , 28 , and 29A , 29B-1, and 29B-2 are described in conjunction with the L1 cache 110 of FIGS. 1-5 , the instructions may be described in conjunction with any type of storage in any type of cache.
  • FIG. 24 illustrates an example first operation 2400 of the replacement policy component 308 ( FIGS. 3 and 5 ) when the first and second data paths (DPO and DP1) include valid transactions.
  • FIG. 25 illustrates an example second operation 2500 of the replacement policy component 308 when the first and second data paths (DPO and DP1) include valid transactions.
  • FIG. 26 illustrates an example third operations 2600 of the replacement policy component 308 when the first and second data paths (DPO and DP1) include valid transactions.
  • FIG. 27 illustrates an example valid-invalid operation 2700 of the replacement policy component 308 when the first data path is a valid transaction and the second data path is an invalid transaction.
  • the example replacement policy component 308 determines if the results indicate that both of the transactions of the first data path and the second data path are hits (block 2406). When the replacement policy component 308 determines DP0 and DP1 are both hits (e.g., block 2406 returns a value yes), the least recently used value Y remains constant (block 2408). For example, since neither the first data path nor the second data path needs to evict data, the LRU value does not need to change.
  • the example replacement policy component 308 determines if a transaction on a new clock cycle has been received (block 2410). For example, if the replacement policy component 308 obtains hit-miss results corresponding to different transactions (e.g., accesses) than the previous accesses (e.g., block 2410 returns a value YES) then control returns to block 2402. If the replacement policy component 308 does not obtain hit-miss results corresponding to different transactions (e.g., accesses) than the previous accesses (e.g., block 2410 returns a value NO) then the first operation 2400 ends.
  • hit-miss results corresponding to different transactions e.g., accesses
  • block 2410 returns a value NO
  • the replacement policy component 308 determines if the results indicate that both of the transactions of the first data path and the second data path are misses (block 2412). For example, the replacement policy component 308 determines if both results from the first hit-miss comparison logic 306a and the second hit-miss comparison logic 306b indicate neither of the accesses matched the addresses in the tag RAM 210.
  • the replacement policy component 308 points the first data path to the LRU way (Y) (block 2414).
  • the replacement policy component 308 points the second data path to the next LRU way (Y+1) (block 2416).
  • the victim storage 218 includes n number of ways, each way has a location (e.g., slot 1, slot 2, slot n ), each way is mapped to an address, and each way includes data.
  • the replacement policy component 308 initializes a value Y to be equal to the least recently used way in the victim cache. For example, the LRU way is slot 2, thus Y is equal to slot 2.
  • the replacement policy component 308 When the replacement policy component 308 points the first data path to the LRU way (block 2414), the replacement policy component 308 is assigning the location of Y in the victim storage 218 to DP0 for eviction. Similarly, when the replacement policy component 308 points the second data path to the next LRU way (block 2416), the replacement policy component 308 is assigning the location of Y+1 in the victim storage 218 to DP1 for eviction.
  • the example replacement policy component 308 provides the pointer values to the example multiplexers 330a, 330b (block 2418). For example, the replacement policy component 308 provides a location (Y) of the way that is to be evicted by DP0 from the victim storage 218 to the multiplexer 330a and a location (Y+1) of the way that is to be evicted by DP1 to the multiplexer 330b.
  • the selecting input of the multiplexer 330a and 330b selects the replacement policy component input
  • the address read 332a and 332b reads the input of the replacement policy component 308 and evicts the ways indicated by location Y and location Y+1.
  • the example replacement policy component 308 determines if a transaction on a new clock cycle has been received (block 2422). For example, if the replacement policy component 308 obtains hit-miss results corresponding to different transactions (e.g., accesses) than the previous accesses (e.g., block 2422 returns a value YES) then control returns to block 2402. If the replacement policy component 308 does not obtain hit-miss results corresponding to different transactions (e.g., accesses) than the previous accesses (e.g., block 2420 returns a value NO) then the first operation 2400 ends.
  • hit-miss results corresponding to different transactions e.g., accesses
  • block 2420 returns a value NO
  • the example replacement policy component 308 determines if the results indicate that the first data path is a hit and the second data path is a miss (block 2502).
  • the replacement policy component 308 determines that the first data path is a hit and the second data path is a miss (e.g., block 2502 returns a value YES), then the replacement policy component determines the location in the victim storage 218 of the hit way (DPO Way) (block 2504). For example, the replacement policy component 308 analyzes the address of DP0 and identifies the location in the victim storage 218 that includes that address. In some examples, the replacement policy component 308 may include an updated list of the elements in the victim storage 218. In other examples, the replacement policy component 308 retrieves and/or obtains information from the tag RAM 210 regarding the locations of the addresses stored in the victim storage 218.
  • the example replacement policy component 308 determines if the hit way (DPO Hit Way) matches the location of the next LRU value (Y+1) (block 2506). For example, the replacement policy component 308 may compare the location of the hit way containing the address of DP0 to the location value assigned to Y+1. If the replacement policy component 308 determines that the locations are match (e.g., block 2506 returns a value YES), then the replacement policy component 308 switches the assignment of the next LRU value and the LRU value (block 2508). For example, the second data path DP1 pointer is to be assigned to the LRU value (e.g., location Y) instead of the next LRU value (e.g., location Y+1).
  • the replacement policy component 308 switches the assignment to avoid the second data path DP1 evicting the DP0 Hit Way. In some examples, the replacement policy component 308 decrements an indicator to indicate the LRU way of the victim storage 214 to be evicted by the second data path DP1.
  • the replacement policy component 308 points the second data path to the LRU way (Y) (block 2510). For example, the replacement policy component 308 assigns the value of Y (e.g., the location of the LRU way) to the second data path DP1 for eviction.
  • the example replacement policy component 308 provides the pointer values to the multiplexer(s) 330a, 330b (block 2512). For example, the replacement policy component 308 provides a location (Y) of the way that is to be evicted by DP1 from the victim storage 218 to the multiplexer 330b. In some examples, when the hit way does not match the location of the next LRU value (e.g., block 2506 returns a value NO), the replacement policy component 308 provides the pointer value Y+1 and the location of the hit way to the multiplexer(s) 330a, 330b. For example, the original assignment of the next LRU value to the second data path DP1 remains the same.
  • the example replacement policy component 308 increments Y based on eviction (block 2514). For example, if the assignments of LRU values to data paths were switched (e.g., DP1 pointer points to the LRU value Y), then the replacement policy component 308 increments Y by one. Otherwise, the replacement policy component 308 increments Y by two. In this manner, during the next clock cycle, the replacement policy component 308 is provided with an updated Y value and Y+1 value. Alternatively and/or additionally, the replacement policy component 308 increments indicators for the first and second data paths based on eviction.
  • the example replacement policy component 308 determines if a transaction on a new clock cycle has been received (block 2516). For example, if the replacement policy component 308 obtains hit-miss results corresponding to different transactions (e.g., accesses) than the previous accesses (e.g., block 2516 returns a value YES) then control returns to block 2402 of FIG. 24 . If the replacement policy component 308 does not obtain hit-miss results corresponding to different transactions (e.g., accesses) than the previous accesses (e.g., block 2516 returns a value NO) then the second operation 2500 ends.
  • hit-miss results corresponding to different transactions e.g., accesses
  • block 2516 returns a value NO
  • the example replacement policy component 308 determines that the results indicate that the first data path is a miss and the second data path is a hit (block 2602).
  • the example replacement policy component 308 determines the location in the victim storage 218 of the hit way (DP1 Way) (block 2604). For example, the replacement policy component 308 analyzes the address of DP1 and identifies the location in the victim storage 218 that includes that address.
  • the example replacement policy component 308 determines if the hit way (DP1 Way) matches the location of the LRU value (Y) (block 2606). For example, the replacement policy component 308 may compare the location of the hit way containing the address of DP1 to the location value assigned to Y. If the replacement policy component 308 determines that the locations match (e.g., block 2606 returns a value YES), then the replacement policy component 308 switches the assignment of the LRU value and the next LRU value (block 2608). For example, first data path DP0 pointer is to be assigned to the next LRU value (e.g., location Y+1) instead of the LRU value (e.g., location Y).
  • the next LRU value e.g., location Y+1
  • the replacement policy component 308 switches the assignment to avoid the first data path DP0 evicting the DP1 Hit Way. In some examples, the replacement policy component 308 increments an indicator to indicate the next LRU way in the victim storage 214 to be evicted by the first data path DP0.
  • the replacement policy component 308 points the first data path to the next LRU value (Y+1) (block 2610). For example, the replacement policy component 308 assigns the value of Y+1 (e.g., the location of the next LRU way) to the first data path DP0 for eviction.
  • the example replacement policy component 308 provides the pointer values to the multiplexer(s) 330a, 330b (block 2612). For example, the replacement policy component 308 provides a location (Y+1) of the way that is to be evicted, by DP0, from the victim storage 218 to the multiplexer 330a. In some examples, when the hit way does not match the location of the LRU value (e.g., block 2506 returns a value NO), the replacement policy component 308 provides the pointer value Y and the location of the hit way to the multiplexer(s) 330a, 330b. For example, the original assignment of the LRU value to the first data path DP0 remains the same.
  • the example replacement policy component 308 increments Y based on eviction (block 2614). For example, if the assignments of LRU values to data paths were switched (e.g., DP0 pointer points to the next LRU value Y+1), then the replacement policy component 308 increments Y by two. Otherwise, the replacement policy component 308 increments Y by one. In this manner, during the next clock cycle, the replacement policy component 308 is provided with an updated Y value. Alternatively and/or additionally, the replacement policy component 308 increments indicators for the first and second data paths based on eviction.
  • the example replacement policy component 308 determines if a transaction on a new clock cycle has been received (block 2616). For example, if the replacement policy component 308 obtains hit-miss results corresponding to different transactions (e.g., accesses) than the previous accesses (e.g., block 2616 returns a value YES) then control returns to block 2402 of FIG. 24 . If the replacement policy component 308 does not obtain hit-miss results corresponding to different transactions (e.g., accesses) than the previous accesses (e.g., block 2616 returns a value NO) then the third operation 2600 ends.
  • the replacement policy component 308 does not obtain hit-miss results corresponding to different transactions (e.g., accesses) than the previous accesses (e.g., block 2616 returns a value NO) then the third operation 2600 ends.
  • the example scalar interface 502 and the example vector interface 504 determine the first and second data paths are not valid transactions (e.g., block 2402 returns a value NO)
  • the example scalar interface 502 and the example vector interface 504 determine if the first data path is valid and the second data path is invalid (block 2702). For example, the scalar interface 502 determines if the first data path DP0 is accessing (e.g., requesting a read or write operation) the victim storage 218 and the vector interface 504 determines if the second data path DP1 is not attempting to access the victim storage 218.
  • the replacement policy component 308 obtains results from the hit-miss comparison logic 306a (block 2704). For example, the replacement policy component 308 obtains a result indicating whether the first data path access has a matching addresses in the tag RAM 210 or does not have a matching address in the tag RAM 210. The example replacement policy component 308 determines if the results indicate that first data path is a hit (block 2706).
  • the replacement policy component 308 determines the address of the first data path DP0 hits an address in the tag RAM 210 (e.g., block 2706 returns a value YES), the least recently used value Y remains constant (block 2708). For example, since the first data path does not need to evict data, the LRU value does not need to change.
  • the example replacement policy component 308 determines if a transaction on a new clock cycle has been received (block 2710). For example, if the replacement policy component 308 obtains hit-miss results corresponding to different transactions (e.g., accesses) than the previous accesses (e.g., block 2710 returns a value YES) then control returns to block 2402 of FIG. 24 . If the replacement policy component 308 does not obtain hit-miss results corresponding to different transactions (e.g., accesses) than the previous accesses (e.g., block 2710 returns a value NO) then the first operation 2700 ends.
  • the replacement policy component 308 does not obtain hit-miss results corresponding to different transactions (e.g., accesses) than the previous accesses (e.g., block 2710 returns a value NO) then the first operation 2700 ends.
  • the example replacement policy component 308 determines that the results do not indicate that first data path is a hit (e.g., block 2706 returns a value NO), then the first data path is a miss (block 2712).
  • the example replacement policy component 308 points the first data path to the LRU Way (Y) (block 2714). For example, the replacement policy component 308 assigns the location of Y in the victim storage 218 to DP0 for eviction.
  • the example replacement policy component 308 provides the pointer value to the first multiplexer 330a (block 2716). For example, the replacement policy component 308 provides the location the LRU way to the first multiplexer 330a for eviction of that way.
  • the example replacement policy component 308 increments Y (block 2718). For example, the replacement policy component 308 updates the LRU way to the next location (e.g., Y+1) in the victim storage 218. Alternatively and/or additionally, the replacement policy component 308 increments indicators for the first and second data paths.
  • the example replacement policy component 308 determines if a transaction on a new clock cycle has been received (block 2720). For example, if the replacement policy component 308 obtains hit-miss results corresponding to different transactions (e.g., accesses) than the previous accesses (e.g., block 2720 returns a value YES) then control returns to block 2402 of FIG. 24 . If the replacement policy component 308 does not obtain hit-miss results corresponding to different transactions (e.g., accesses) than the previous accesses (e.g., block 2720 returns a value NO) then the first operation 2700 ends.
  • the replacement policy component 308 does not obtain hit-miss results corresponding to different transactions (e.g., accesses) than the previous accesses (e.g., block 2720 returns a value NO) then the first operation 2700 ends.
  • the example scalar interface 502 and the example vector interface 504 determine the first data path is not a valid transaction and the second data path is a valid transactions (e.g., block 2702 returns a value NO)
  • the example scalar interface 502 and the example vector interface 504 determine the first data path is invalid and the second data path is valid (block 2802).
  • the scalar interface 502 determines that the first data path DP0 is not accessing (e.g., requesting a read or write operation) the victim storage 218 and the vector interface 504 determines if the second data path DP1 is accessing the victim storage 218.
  • the replacement policy component 308 obtains results from the hit-miss comparison logic 306b (block 2804). For example, the replacement policy component 308 obtains a result indicating whether the second data path access has a matching addresses in the tag RAM 210 or does not have a matching address in the tag RAM 210.
  • the example replacement policy component 308 determines if the results indicate that second data path is a hit (block 2806). If the replacement policy component 308 determines the address of the second data path DP1 hits an address in the tag RAM 210 (e.g., block 2806 returns a value YES), the least recently used value Y remains constant (block 2808). For example, since the second data path does not need to evict data, the LRU value does not need to change.
  • the example replacement policy component 308 determines if a transaction on a new clock cycle has been received (block 2810). For example, if the replacement policy component 308 obtains hit-miss results corresponding to different transactions (e.g., accesses) than the previous accesses (e.g., block 2810 returns a value YES) then control returns to block 2402 of FIG. 24 . If the replacement policy component 308 does not obtain hit-miss results corresponding to different transactions (e.g., accesses) than the previous accesses (e.g., block 2810 returns a value NO) then the first operation 2700 ends.
  • the replacement policy component 308 does not obtain hit-miss results corresponding to different transactions (e.g., accesses) than the previous accesses (e.g., block 2810 returns a value NO) then the first operation 2700 ends.
  • the example replacement policy component 308 determines that the results do not indicate that second data path is a hit (e.g., block 2806 returns a value NO), then the second data path is a miss (block 2812).
  • the example replacement policy component 308 points the second data path to the LRU Way (Y) (block 2814). For example, the replacement policy component 308 assigns the location of Y in the victim storage 218 to DP1 for eviction.
  • the example replacement policy component 308 provides the pointer value to the second multiplexer 330b (block 2816). For example, the replacement policy component 308 provides the location the LRU way to the second multiplexer 330b for eviction of that way.
  • the example replacement policy component 308 increments Y (block 2818). For example, the replacement policy component 308 updates the LRU way to the next location (e.g., Y+1) in the victim storage 218. Alternatively and/or additionally, the replacement policy component 308 increments indicators for the first and second data paths.
  • the example replacement policy component 308 determines if a transaction on a new clock cycle has been received (block 2820). For example, if the replacement policy component 308 obtains hit-miss results corresponding to different transactions (e.g., accesses) than the previous accesses (e.g., block 2820 returns a value YES) then control returns to block 2402 of FIG. 24 . If the replacement policy component 308 does not obtain hit-miss results corresponding to different transactions (e.g., accesses) than the previous accesses (e.g., block 2820 returns a value NO) then the first operation 2800 ends.
  • the machine readable instructions 2400, 2500, 2600, 2700, and 2800 correspond to the first table 602 of FIG. 6 .
  • FIG. 29A , FIG. 29B-1, and FIG. 29B-2 are example flowcharts representative of example machine readable instructions 2900 that may be executed by the L1 cache 110 of FIGS. 1-5 to perform LRU incrementing in the victim storage 214 based on the allocation status of a data path, in conjunction with the above description.
  • the machine readable instructions 2900 begin at block 2902, at which the replacement policy component 308 initializes the first data path allocate pointer to equal location Y. For example, the replacement policy component 308 assigns a portion of the victim storage 218 not recently used by the CPU 102, having location Y, to the LRU value. In such an example, when the first data path DP0 is to allocate, the victim storage 218 evicts data from the LRU value (e.g., location Y). The replacement policy component 308 initializes the second data path allocate pointer to equal location Y+1 (block 2904). For example, the replacement policy component 308 assigns a portion of the victim storage 218 not recently used by the CPU 102, having location Y+1, to the next LRU value. In such an example, when the second data path DP1 is to allocate, the victim storage 218 evicts data from the next LRU value (e.g., location Y+1).
  • the replacement policy component 308 assigns a portion of the victim storage 218 not
  • the replacement policy component 308 determines the first and second data paths are valid transactions (block 2906). For example, the CPU 102 provided instructions on both data paths.
  • the replacement policy component 308 determines if the hit location is equal to the location of the second data path allocate pointer (Y+1) (block 2912). For example, the replacement policy component 308 determines if the address of DP0 matches the location of Y+1. If the locations do match (e.g., block 2912 returns a value YES), the replacement policy component 308 updates the second data path allocate pointer to equal location Y (block 2914). For example, the replacement policy component 308 switches the assignment of the second data path allocate pointer from Y+1 to Y to avoid evicting data requested on the DP0 instruction.
  • the cache controller 220 performs the first transaction and the second transaction (block 2916). For example, the cache controller 220 reads/writes data of DP0 at location Y+1 and evicts data from location Y in the victim storage 218.
  • the replacement policy component 308 increments the first data path allocate pointer by one (block 2918). For example, since the cache controller 220 evicted data from location Y and not Y+1, the replacement policy component 308 only needs to update the LRU value to the next LRU value (Y+1).
  • the cache controller 220 performs the first transaction and the second transaction (block 2920). For example, the replacement policy component 308 determines that Y+1 includes data that is available to evict and thus, the second data path allocate pointer can evict data from that location while the first data path DP0 reads/writes data from the hit location.
  • the replacement policy component 308 determines if the condition of block 2908 is not true (e.g., block 2908 returns a value NO)
  • the replacement policy component 308 determines if the first data path is to allocate and the second data path hits (block 2924). For example, the replacement policy component 308 determines if the second data path hits a location in the victim storage 218 and if the main storage 214 is allocating data on the first data path DP0.
  • the replacement policy component 308 determines the first data path is to allocate and the second data path is a hit (e.g., block 2924 returns a value YES)
  • the replacement policy component 308 determines the location in the victim storage 218 of the hit location (DP2 Way) (block 2926). For example, the replacement policy component 308 determines where the second data path is reading/writing data from in the victim storage 218.
  • the replacement policy component 308 determines if the hit location is equal to the location Y (block 2928). For example, the replacement policy component 308 determines if the first data path allocate pointer points to the same location storing the hit data.
  • the replacement policy component 308 determines the locations match (e.g., block 2928 returns a value YES)
  • the replacement policy component 308 updates the first data path allocate pointer to equal location Y+1 (block 2930). For example, the replacement policy component 308 switches the assignments of the LRU value and the next LRU value to avoid the first data path evicting the hit data from the victim storage 218.
  • the cache controller 220 performs the first transaction and the second transaction (block 2932). For example, the cache controller 220 reads/writes data from the location Y and evicts data of the location Y+1.
  • the replacement policy component 308 increments the pointer Y by two locations (block 2934). For example, the replacement policy component 308 updates the LRU location to a location after the most recently evicted location (e.g., in this example, the most recently evicted location is Y+1, therefore the LRU value is incremented by two to equal Y+2).
  • the replacement policy component 308 determines the hit location does not match the location of the originally assigned first data path pointer (Y) (e.g., block 2928 returns a value NO), the cache controller 220 performs the first transaction and the second transaction (block 2936). For example, the replacement policy component 308 determines that Y includes data that is available to evict and thus, the first data path allocate pointer can evict data from that location while the second data path DP1 reads/writes data from the hit location.
  • the replacement policy component 308 increments the first data path allocate pointer by one location (block 2938). For example, since the cache controller 220 evicts data from the location Y, the replacement policy component 308 updates the LRU value Y to a location after the evicted location. In this manner, the replacement policy component 308 includes an updated LRU value and an updated next LRU value during the next clock cycle.
  • block 2924 of FIG. 29A If the condition of block 2924 of FIG. 29A is not true (e.g., block 2924 returns a value NO when the first data path is not to allocate when the second data path is to hit), then control moves to FIG. 29B-2 where the replacement policy component 308 determines both data paths are to allocate (block 2940). For example, if two read-misses occur, the main storage 214 allocates two lines from the main storage 214 to the victim storage 218.
  • the cache controller 220 performs the first transaction and the second transaction (block 2942). For example, the cache controller 220 evicts data from the LRU location (Y) utilizing the first data path DP0 and evicts data from the next LRU location (Y+1) utilizing the second data path DP1.
  • the replacement policy component 308 increments the first data path allocate pointer by two (block 2944). For example, the replacement policy component 308 increments the location Y by two locations since data was evicted from Y+1. In some examples, when the LRU value is incremented by a value, the next LRU value is incremented simultaneously by the same value. Therefore, the first data path allocate pointer and the second data path allocate pointer always point to an updated and accurate eviction location.
  • the machine readable instructions 2900 correspond to the second table 604 of FIG. 6 .
  • FIG. 30 is an example flowchart representative of example machine readable instructions 3000 that may be executed by the example L1 cache 110 of FIGS. 1-5 to execute arbitration logic to perform a read, modify, or write operation as described above.
  • the instructions of FIG. 30 are described in conjunction with the L1 cache 110 of FIGS. 1-5 , the instructions may be described in conjunction with any type of storage in any type of cache.
  • the instructions of FIG. 30 are described in conjunction with the main half of the L1 cache 110 (e.g., the main cache store queue 212, the main storage 214, etc.). However, the instructions of FIG. 30 can likewise be used in conjunction with the victim side of the L1 cache 110 (e.g., the victim cache store queue 216, the victim storage 218, etc.).
  • the machine readable instructions 3000 of FIG. 30 begin at block 3002, at which the L1 cache 110 obtains store instruction(s) from interface(s) coupled to hardware.
  • the address processing components 302a-c of FIGS. 3A-3D can obtain the first store instruction 1018 from the scalar interface, the second store instruction 1020 from the memory interface, and/or the third store instruction 1022 from the vector interface.
  • the address processing components 302a-c of FIGS. 3A-3D can obtain the first store instruction 1018b from the scalar interface, the second store instruction 1020b from the memory interface, and/or the third store instruction 1022b from the vector interface of FIG. 10B .
  • the L1 cache 110 generates transaction data based on the store instruction(s).
  • the address processing components 302a-c and/or the bank processing logic 303 of FIGS. 3A-3D can generate the first transaction data, the second transaction data, and/or the third transaction data of FIG. 10A .
  • the address processing components 302a-c and/or the bank processing logic 303 of FIGS. 3A-3D can generate the first transaction data, the second transaction data, and/or the third transaction data of FIG. 10B .
  • An example process that may be used to implement block 3004 is described below in connection with FIG. 31 .
  • the L1 cache 110 determines whether read operation(s) is/are identified based on the transaction data.
  • the address processing components 302a-c can determine that at least one of the first store instruction 1018, the second store instruction 1020, or the third store instruction 1022 includes a request to have a read operation serviced (e.g., a value of RD_BANK_REQ[i] is indicative of a read request, a logic high signal for RD_BANK REQ[i], etc.).
  • the read operation request can be determined based on the R/W data included in the store instructions 1018, 1020, 1022.
  • the address processing components 302a-c can determine that at least one of the first store instruction 1018b, the second store instruction 1020b, or the third store instruction 1022b includes a request to have a read operation serviced (e.g., a value of RD_BANK REQ[i] is indicative of a read request, a logic high signal for RD_BANK_REQ[i], etc.).
  • the read operation request can be determined based on the R/W data included in the store instructions 1018b, 1020b, 1022b of FIG. 10B .
  • the L1 cache 110 determines that there are no read operations identified based on the transaction data, control proceeds to block 3014 to invoke second arbitration logic to write the data to the store queue. If, at block 3006, the L1 cache 110 determines that there is at least one read operation identified based on the transaction data, then, at block 3008, the L1 cache 110 invokes first arbitration logic to locate data for the read operation(s) in at least one of a store queue or storage. For example, the address processing components 302a-c can invoke the first arbitration logic 1008 to locate data for the read operation(s) in at least one of the main cache store queue 212 or the main storage 214.
  • the address processing components 302a-c can invoke the first arbitration logic 1008b to locate data for the read operation(s) in at least one of the victim cache store queue 216 or the victim storage 218.
  • An example process that may be executed to implement block 3008 is described below in connection with FIG. 20 .
  • the L1 cache 110 identifies the most recent version of the located data. For example, the L1 cache 110 can compare a first version of the requested data from the main cache store queue 212 to a second version of the requested data from the main storage 214 and determine that the first version is more recent than the second version based on the comparison. Alternatively, the L1 cache 110 can compare a first version of the requested data from the victim cache store queue 216 to a second version of the requested data from the victim storage 218 and determine that the first version is more recent than the second version based on the comparison.
  • the L1 cache 110 delivers the most recent version of the located data to store queue to execute a modify operation on the read and write data.
  • the main cache store queue 212 can deliver and/or otherwise transmit the first version of the requested data to the main cache store queue 212 to execute a modify operation on the requested data and the data to be written.
  • the victim cache store queue 216 can deliver and/or otherwise transmit the first version of the requested data to the victim cache store queue 216 to execute a modify operation on the requested data and the data to be written.
  • the L1 cache 110 invokes the second arbitration logic to write the data to the store queue or the storage.
  • the first arbitration logic 1008 can transmit an instruction to the second arbitration logic 1010 to write the WDATA or portion(s) thereof to at least one of the main cache store queue 212 or the main storage 214.
  • the first arbitration logic 1008b can transmit an instruction to the second arbitration logic 1010b to write the WDATA or portion(s) thereof to at least one of the victim cache store queue 216 or the victim storage 218 of FIG. 10B .
  • the example machine readable instructions 3000 of FIG. 30 conclude.
  • the machine readable instructions 3100 begin at block 3102, at which the L1 cache 110 extracts write data from the store instruction(s) based on a number of data storage banks.
  • the address processing components 302a-c and/or the bank processing logic 303 of FIGS. 3A-3D can extract the WDATA from the store instructions 1018, 1020, 1022 based on a quantity of data banks that the main cache store queue 212 and/or the main storage 214 are broken up into.
  • the cache line in response to the main cache store queue 212 having 16 data banks, the cache line can be 64 bits and, thus, WDATA can be extracted in 64 bit chunks.
  • 3A-3D can extract the WDATA from the store instructions 1018b, 1020b, 1022b of FIG. 10B based on a quantity of data banks that the victim cache store queue 216 and/or the victim storage 218 are broken up into.
  • the cache line in response to the victim cache store queue 216 having 16 data banks, the cache line can be 64 bits and, thus, WDATA can be extracted in 64 bit chunks.
  • the L1 cache 110 determines byte enable data based on the store instruction(s).
  • the address processing components 302a-c and/or the bank processing logic 303 can determine the BYTEN/BANK[i] data of FIG. 10A based on the BYTEN data included in the store instructions 1018, 1020, 1022.
  • the address processing components 302a-c and/or the bank processing logic 303 can determine the BYTEN/BANK[i] data of FIG. 10B based on the BYTEN data included in the store instructions 1018b, 1020b, 1022b.
  • the L1 cache 110 determines a data access operation data size based on the store instruction(s).
  • the address processing components 302a-c and/or the bank processing logic 303 can determine the data size of data to be read, written, and/or modified based on the SIZE data included in the store instructions 1018, 1020, 1022.
  • the address processing components 302a-c and/or the bank processing logic 303 can determine the data size of data to be read, written, and/or modified based on the SIZE data included in the store instructions 1018b, 1020b, 1022b.
  • the L1 cache 110 determines a data storage address based on the store instruction(s). For example, the address processing components 302a-c and/or the bank processing logic 303 can determine the MS_ADDR[i] of a corresponding bank of the main cache store queue 212 and/or the STQ_ADDR[i] address of a corresponding bank of the main storage 214 based on the ADDR data included in the store instructions 1018, 1020, 1022.
  • the address processing components 302a-c and/or the bank processing logic 303 can determine the VS_ADDR[i] of a corresponding bank of the victim cache store queue 216 and/or the STQ_V_ADDR[i] address of a corresponding bank of the victim storage 218 based on the ADDR data included in the store instructions 1018b, 1020b, 1022b.
  • the L1 cache 110 maps the data access operation data size and the data storage address to a first quantity of data banks to read from.
  • the address processing components 302a-c and/or the bank processing logic 303 can map the data access operation size and the data storage address to zero or more banks of the main cache store queue 212, zero or more banks of the main storage 214, etc., to generate RD_BANK_REQ[i] of FIG. 10A .
  • the address processing components 302a-c and/or the bank processing logic 303 can map the data access operation size and the data storage address to zero or more banks of the victim cache store queue 216, zero or more banks of the victim storage 218, etc., to generate RD_BANK_REQ[i] of FIG. 10B .
  • the L1 cache 110 maps the data access operation data size and the data storage address to a second quantity of data banks to write to.
  • the address processing components 302a-c and/or the bank processing logic 303 can map the data access operation size and the data storage address to zero or more banks of the main cache store queue 212, zero or more banks of the main storage 214, etc., to generate WR_BANK_REQ[i] of FIG. 10A .
  • the address processing components 302a-c and/or the bank processing logic 303 can map the data access operation size and the data storage address to zero or more banks of the victim cache store queue 216, zero or more banks of the victim storage 218, etc., to generate WR_BANK_REQ[i] of FIG. 10B .
  • the L1 cache 110 generates transaction data based on at least one of the first quantity, the second quantity, the byte enable data, or the write data.
  • the address processing components 302a-c and/or the bank processing logic 303 can generate the first transaction data (TRANSACTION_DP0[i]), the second transaction data (TRANSACTION_DMA[i]), and the third transaction data (TRANSACTION_DP1[i]) of FIG. 10A .
  • the address processing components 302a-c and/or the bank processing logic 303 can generate the first transaction data (TRANSACTION_DP0[i]), the second transaction data (TRANSACTION_DMA[i]), and the third transaction data (TRANSACTION_DP1[i]) of FIG. 10B .
  • control In response to generating the transaction data based on at least one of the first quantity, the second quantity, the byte enable data, or the write data at block 3116, control returns to block 3006 of the machine readable instructions 3000 of FIG. 30 to determine whether read operation(s) is/are identified based on the transaction data.
  • FIG. 32 is an example flowchart representative of example machine readable instructions 3200 that may be executed by the example L1 cache 110 of FIGS. 1-5 to invoke first arbitration logic to locate data for read operation(s) in at least one of a store queue or storage as described above.
  • the flowchart of FIG. 32 can be an example implementation of the machine readable instructions 3008 of FIG. 30 .
  • the instructions of FIG. 32 are described in conjunction with the L1 cache 110 of FIGS. 1-5 , the instructions may be described in conjunction with any type of storage in any type of cache.
  • the instructions of FIG. 32 are described in conjunction with the main half of the L1 cache 110 (e.g., the main cache store queue 212, the main storage 214, etc.). However, the instructions of FIG. 32 can likewise be used in conjunction with the victim side of the L1 cache 110 (e.g., the victim cache store queue 216, the victim storage 218, etc.).
  • the machine readable instructions 3200 begin at block 3202, at which the L1 cache 110 selects a data storage bank of interest to process.
  • the address processing components 302a-c and/or the bank processing logic 303 of FIGS. 3A-3D can select the first bank 1002 of FIG. 10A to process.
  • the address processing components 302a-c and/or the bank processing logic 303 of FIGS. 3A-3D can select the first bank 1002b of FIG. 10B to process.
  • the L1 cache 110 compares selected data storage bank to data storage banks included in read bank request(s) from interface(s).
  • the first arbitration logic 1008 can compare the bank(s) identified in respective one(s) of RD_BANK_REQ[i] from the scalar interface, the memory interface, and the vector interface to the first bank 1002 (e.g., STQ[0], MS[i], etc.).
  • the first arbitration logic 1008b can compare the bank(s) identified in respective one(s) of RD_BANK_REQ[i] from the scalar interface, the memory interface, and the vector interface to the first bank 1002b (e.g., STQ_V[0], VS[i], etc.).
  • the L1 cache 110 determines whether at least one interface requests access to the selected data storage. If, at block 3206, the L1 cache 110 determines that none of the interfaces request access to the selected data bank, control proceeds to block 3208 to determine that the selected data storage bank is not used for read operation(s). In response to determining that the selected data storage bank is not used for read operation(s) at block 3208, control returns to block 3202 to select another data storage bank of interest to process.
  • control proceeds to block 3210 to determine whether more than one interface requests access to the selected data storage bank. If, at block 3210, the L1 cache 110 determines that only one interface requests access to the selected data storage bank, control proceeds to block 3212 to invoke first arbitration logic to assign the selected data storage bank to the requesting interface. In response to invoking the first arbitration logic to assign the selected data storage bank to the requesting interface at block 3212, control returns to block 3202 to select another data storage bank of interest to process.
  • the first arbitration logic 1008 can assign the first bank 1002 to the one of the interfaces requiring a read operation as read operations are prioritized over write operations.
  • the first arbitration logic 1008b can assign the first bank 1002b to the one of the interfaces requiring a read operation as read operations are prioritized over write operations.
  • the L1 cache 110 invokes the first arbitration logic to inform second arbitration logic that the requesting interface requiring a write operation is not assigned the selected data storage bank. For example, if scalar data path is requesting a write operation and a read operation and the scalar data path is not assigned the first data bank 1002 for the read operation, the first arbitration logic 1008 can instruct the second arbitration logic 1010 to not assign the scalar data path the first data bank 1002 and, thus, stall and/or otherwise prevent execution of the write operation since the corresponding read operation is not to be completed during the clock cycle.
  • the L1 cache 110 determines whether to select another data storage bank of interest to process. For example, the address processing components 302a-c and/or the bank processing logic 303 can determine to select a second bank of the main cache store queue 212 and the main storage 214 to process. Alternatively, the address processing components 302a-c and/or the bank processing logic 303 can determine to select a second bank of the victim cache store queue 216 and the victim storage 218 to process. If, at block 3218, the L1 cache 110 determines to select another data storage bank of interest to process, control returns to block 3202 to select another data storage bank of interest to process. If, at block 3218, the L1 cache 110 determines not to select another data storage bank of interest to process, control returns to block 3010 of the machine readable instructions 3000 of FIG. 30 to identify the most recent version of the located data.
  • FIG. 33 is an example flowchart representative of example machine readable instructions 3300 that may be executed by the example L1 cache 110 of FIGS. 1-5 to facilitate a read-modify-write operation in the victim storage 216, in conjunction with the above Section 13.
  • the victim cache store queue 216 obtains a write instruction transmitted by the CPU 102 (e.g., transmitted through the cache controller 220) indicating byte(s) of a word, or an entire word, to be re-written. (Block 3302).
  • the write port 1126 may obtain the write instruction transmitted by the CPU 102.
  • the victim cache store queue 216 transmits the value of the portion of the word to be rewritten to the latch 1102b.
  • the latch 1102b transmits the value of the portion of the word to be rewritten to the latch 1102c.
  • the victim cache store queue 216 stores the address value associated with the location of the portion of the word to be rewritten in the pending store address data store 1116. (Block 3306). Also, the tag ram 210 transmits a read instruction (e.g., a read request) of the entire currently stored word to the victim storage 218. (Block 3308).
  • the victim cache store queue 216 determines whether there has been a subsequent clock cycle of the CPU 102, or the cache controller 220. (Block 3310). In some examples described herein, the latch 1102c determines whether there has been a subsequent clock cycle of the CPU 102, or the cache controller 220. In response to determining that there has not been a subsequent clock cycle of the CPU 102, or the cache controller 220, (e.g., the control of block 3310 returns a result of NO), the process waits.
  • the read-modify-write merge component 1108 obtains the value of the portion of the word (e.g., the byte) stored in the latch 1102c. (Block 3312). Also, the read-modify-write merge component 1108 obtains the entire currently stored word transmitted by the ECC logic 312. (Block 3314). In this manner, the read-modify-write merge 1108 identifies the address of the byte in the currently stored word to be updated.
  • the read-modify-write merge component 1108 identifies and/or otherwise obtains (a) the value (e.g., byte value, bit value, etc.) of the portion of the currently stored word to be updated from the latch 1102c and the (b) currently stored word from the ECC logic 312, the read-modify-write merge component 1108 writes (e.g., replaces) the portion of the currently stored word with the value of the portion of the currently stored word obtained from the latch 1102c. (Block 3316). For example, the read-modify-write merge component 1108 writes the value of the portion of the word to an address value corresponding to the portion of the word in the word.
  • the read-modify-write merge component 1108 writes the value of the portion of the word to an address value corresponding to the portion of the word in the word.
  • the victim cache store queue 216 generates error detection code based on the word, the error detection code to be stored with the word. (Block 3318).
  • the ECC generator 1112 generates error detection code based on the word, the error detection code to be stored with the word.
  • the control of block 3318 may be performed in response to an additional subsequent clock cycle of the CPU 102, or the cache controller 220.
  • the victim cache store queue 216 determines whether an additional write instruction is obtained. (Block 3322). the event the victim cache store queue 216 determines another write instruction is obtained (e.g., the control of block 3322 returns a result of YES), the process returns to block 3302. Alternatively, in the event the victim cache store queue 216 determines another write instruction is not obtained (e.g., the control of block 3322 returns a result of NO), the process 3300 may wait until a threshold timeout period occurs, thus ending the process 3300.
  • FIG. 34 is a block diagram of an example processor platform 3400 structured to execute the instructions of FIGS. 12-33 to implement the L1 cache 110 of FIGS. 1-5 and 10-11 .
  • the processor platform 1000 can be, for example, a server, a personal computer, a workstation, a self-learning machine (e.g., a neural network), a mobile device (e.g., a cell phone, a smart phone, a tablet such as an iPad TM ), a personal digital assistant (PDA), an Internet appliance, a gaming console, or any other type of computing device.
  • a self-learning machine e.g., a neural network
  • a mobile device e.g., a cell phone, a smart phone, a tablet such as an iPad TM
  • PDA personal digital assistant
  • Internet appliance e.g., a gaming console, or any other type of computing device.
  • the processor platform 3400 of the illustrated example includes a processor 3412.
  • the processor 3412 of the illustrated example is hardware.
  • the processor 3412 can be implemented by one or more integrated circuits, logic circuits, microprocessors, GPUs, DSPs, or controllers from any desired family or manufacturer.
  • the hardware processor may be a semiconductor based (e.g., silicon based) device.
  • the processor implements any element of the example L1 cache 110 as shown in FIGS. 1-5 and 10-11 .
  • the processor 3412 of the illustrated example includes a local memory 3413 (e.g., a cache).
  • the processor 3412 of the illustrated example is in communication with a main memory including a volatile memory 3414 and a non-volatile memory 3416 via a bus 3418.
  • the volatile memory 3414 may be implemented by Synchronous Dynamic Random Access Memory (SDRAM), Dynamic Random Access Memory (DRAM), RAMBUS ® Dynamic Random Access Memory (RDRAM ® ) and/or any other type of random access memory device.
  • the non-volatile memory 3416 may be implemented by flash memory and/or any other desired type of memory device. Access to the main memory 3414, 3416 is controlled by a memory controller.
  • the processor platform 3400 of the illustrated example also includes an interface circuit 3420.
  • the interface circuit 3420 may be implemented by any type of interface standard, such as an Ethernet interface, a universal serial bus (USB), a Bluetooth ® interface, a near field communication (NFC) interface, and/or a PCI express interface.
  • one or more input devices 3422 are connected to the interface circuit 3420.
  • the input device(s) 3422 permit(s) a user to enter data and/or commands into the processor 3412.
  • the input device(s) can be implemented by, for example, an audio sensor, a microphone, a camera (still or video), a keyboard, a button, a mouse, a touchscreen, a track-pad, a trackball, isopoint and/or a voice recognition system.
  • One or more output devices 3424 are also connected to the interface circuit 3420 of the illustrated example.
  • the output devices 3424 can be implemented, for example, by display devices (e.g., a light emitting diode (LED), an organic light emitting diode (OLED), a liquid crystal display (LCD), a cathode ray tube display (CRT), an in-place switching (IPS) display, a touchscreen, etc.), a tactile output device, a printer and/or speaker.
  • display devices e.g., a light emitting diode (LED), an organic light emitting diode (OLED), a liquid crystal display (LCD), a cathode ray tube display (CRT), an in-place switching (IPS) display, a touchscreen, etc.
  • the interface circuit 3420 of the illustrated example thus, includes a graphics driver card, a graphics driver chip and/or a graphics driver processor.
  • the machine executable instructions 3432 of FIGS 12-33 may be stored in the mass storage device 3428, in the volatile memory 3414, in the non-volatile memory 3416, and/or on a removable non-transitory computer readable storage medium such as a CD or DVD.
  • example methods, apparatus and articles of manufacture have been described to facilitate write miss caching in cache system.
  • the described methods, apparatus and articles of manufacture improve the efficiency of using a computing device by reducing a data cache to reduce latency of a computing system and improving computer system efficiency to reduce stress on a computer core.
  • the described methods, apparatus and articles of manufacture are accordingly directed to one or more improvement(s) in the functioning of a computer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Claims (15)

  1. Vorrichtung, die Folgendes umfasst:
    einen ersten Zwischenspeicher (214);
    einen zweiten Zwischenspeicher (218),
    wobei der zweite Zwischenspeicher einen ersten Teil, der betreibbar ist, um einen ersten Satz von Daten zu speichern, die aus dem ersten Zwischenspeicher entfernt werden, und einen zweiten Teil beinhaltet;
    eine Zwischenspeichersteuerung (220), die mit dem ersten Zwischenspeicher und dem zweiten Zwischenspeicher gekoppelt ist und betreibbar ist, um:
    eine Schreiboperation zu empfangen;
    zu bestimmen, dass die Schreiboperation einen Fehlschlag in der ersten Zwischenspeicherung erzeugt;
    dadurch gekennzeichnet, dass die Zwischenspeichersteuerung ferner betreibbar ist, um:
    als Reaktion auf den Fehlschlag in der ersten Zwischenspeicherung, mit der Schreiboperation assoziierte Schreibfehlschlaginformationen an die zweite Zwischenspeicherung zum Speichern in dem zweiten Teil bereitzustellen.
  2. Vorrichtung nach Anspruch 1, wobei die Zwischenspeichersteuerung betreibbar ist, um:
    den zweiten Teil der zweiten Zwischenspeicherung mit einem Schwellenwert zu vergleichen; und
    basierend darauf, dass der zweite Teil den Schwellenwert überschreitet, die Schreibfehlschlaginformationen veranlasst, an den zweiten Zwischenspeicher übertragen zu werden.
  3. Vorrichtung nach Anspruch 1, wobei die Schreibfehlschlaginformationen erste Schreibfehlschlaginformationen sind, wobei die Zwischenspeichersteuerung betreibbar ist, um die ersten Schreibfehlschlaginformationen von der ersten Zwischenspeicherung an die zweite Zwischenspeicherung bereitzustellen, falls die zweite Zwischenspeicherung zweite Schreibfehlschlaginformationen für eine Speicheradresse beinhaltet, die den ersten Schreibfehlschlaginformationen entspricht.
  4. Vorrichtung nach Anspruch 1, wobei der erste Zwischenspeicher und der zweite Zwischenspeicher parallel mit einer zentralen Verarbeitungseinheit verbunden sind.
  5. Vorrichtung nach Anspruch 1, wobei die Schreibfehlschlaginformationen erste Schreibfehlschlaginformationen sind, wobei die Zwischenspeichersteuerung betreibbar ist, um, wenn eine erste Speicheradresse der ersten Schreibfehlschlaginformationen von der ersten Zwischenspeicherung mit einer zweiten Speicheradresse von zweiten Schreibfehlschlaginformationen übereinstimmt, die in dem zweiten Teil gespeichert sind, um die ersten Schreibfehlschlaginformationen mit den zweiten Schreibfehlschlaginformationen zusammenzuführen.
  6. Vorrichtung nach Anspruch 1, wobei der zweite Teil ein Bytefreigaberegister beinhaltet, wobei die Zwischenspeichersteuerung Werte in dem Bytefreigaberegister basierend auf den Schreibfehlschlaginformationen speichert; wobei optional die Werte Elementen der Schreibfehlschlaginformationen entsprechen, die geschrieben werden.
  7. System, das Folgendes umfasst:
    eine zentrale Verarbeitungseinheit, um einen Schreibbefehl auszugeben, der einer Speicheradresse entspricht; und
    die Vorrichtung nach Anspruch 1, wobei der zweite Zwischenspeicher angepasst ist, um die Schreibfehlschlaginformationen in einem dedizierten Abschnitt des zweiten Speichers zu speichern, wobei der dedizierte Abschnitt den Schreibfehlschlaginformationen gewidmet ist, wobei der dedizierte Abschnitt der zweite Teil ist.
  8. System nach Anspruch 7, wobei der zweite Zwischenspeicher die Schreibfehlschlaginformationen an den zweiten Zwischenspeicher ausgeben soll, wenn der dedizierte Abschnitt mehr als eine Schwellenwertmenge an Schreibfehlschlaginformationen aufweist.
  9. Vorrichtung nach Anspruch 2 oder System nach Anspruch 8, wobei der Schwellenwert a) einer Bandbreite einer Schnittstelle zu der zweiten Zwischenspeicherung; oder b) einer Größe des zweiten Teils entspricht.
  10. System nach Anspruch 7, wobei der zweite Zwischenspeicher die Schreibfehlschlaginformationen aus dem ersten Speicher nicht in dem zweiten Teil speichern soll, falls der zweite Zwischenspeicher zweite Schreibanweisungen beinhaltet, die einer gleichen Speicheradresse wie die Schreibfehlschlaginformationen von der zentralen Verarbeitungseinheit entsprechen.
  11. System nach Anspruch 7, wobei der erste Zwischenspeicher und der zweite Zwischenspeicher parallel mit der zentralen Verarbeitungseinheit verbunden sind.
  12. System nach Anspruch 7, wobei die Schreibfehlschlaginformationen erste Schreibfehlschlaginformationen sind, die ferner eine Steuerung beinhalten, um die ersten Schreibfehlschlaginformationen mit den zweiten Schreibfehlschlaginformationen zusammenzuführen, wenn eine erste Speicheradresse der ersten Schreibfehlschlaginformationen aus der ersten Zwischenspeicherung mit einer zweiten Speicheradresse von zweiten Schreibfehlschlaginformationen übereinstimmt, die in dem zweiten Teil gespeichert sind.
  13. Vorrichtung nach Anspruch 5 oder System nach Anspruch 12, wobei die Zwischenspeichersteuerung die ersten Schreibfehlschlaginformationen mit den zweiten Schreibfehlschlaginformationen zusammenführen soll, durch (a) Aufrechterhalten der ersten Schreibinformationen der ersten Schreibfehlschlaginformationen und/oder (b) Verwerfen der zweiten Schreibinformationen der zweiten Schreibfehlschlaginformationen, wenn die zweiten Schreibinformationen demselben einen oder mehreren Bytes wie die ersten Schreibfehlschlaginformationen entsprechen.
  14. Verfahren, das Folgendes umfasst:
    Empfangen einer Schreiboperation;
    Bestimmen, dass die Schreiboperation einen Fehlschlag in einer ersten Zwischenspeicherung erzeugt; und
    als Reaktion auf den Fehlschlag in der ersten Zwischenspeicherung, Bereitstellen von Schreibfehlschlaginformationen, die mit der Schreiboperation assoziiert sind, an ein zweites Zwischenspeicherelement mit einem ersten Teil und einem zweiten Teil zum Speichern in dem zweiten Teil (1202), wobei der erste Teil einen ersten Satz von Daten speichert, die aus der ersten Zwischenspeicherung entfernt wurden.
  15. Verfahren nach Anspruch 14, das ferner Folgendes umfasst:
    Vergleichen des zweiten Teils der zweiten Zwischenspeicherung mit einem Schwellenwert; und
    basierend darauf, dass der zweite Teil den Schwellenwert überschreitet, Ausgeben der Schreibfehlschlaginformationen an die zweite Zwischenspeicherung.
EP20815004.5A 2019-05-24 2020-05-26 Verfahren und vorrichtung zur erleichterung der schreibfehlerzwischenspeicherung in einem zwischenspeichersystem Active EP3977296B1 (de)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP25193762.9A EP4636601A2 (de) 2019-05-24 2020-05-26 Verfahren und vorrichtung zur erleichterung von schreibfehlgriffzwischenspeicherung in einem cachesystem

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201962852494P 2019-05-24 2019-05-24
US16/882,258 US11693790B2 (en) 2019-05-24 2020-05-22 Methods and apparatus to facilitate write miss caching in cache system
PCT/US2020/034560 WO2020243098A1 (en) 2019-05-24 2020-05-26 Methods and apparatus to facilitate write miss caching in cache system

Related Child Applications (1)

Application Number Title Priority Date Filing Date
EP25193762.9A Division EP4636601A2 (de) 2019-05-24 2020-05-26 Verfahren und vorrichtung zur erleichterung von schreibfehlgriffzwischenspeicherung in einem cachesystem

Publications (3)

Publication Number Publication Date
EP3977296A1 EP3977296A1 (de) 2022-04-06
EP3977296A4 EP3977296A4 (de) 2022-07-20
EP3977296B1 true EP3977296B1 (de) 2025-08-13

Family

ID=73456693

Family Applications (6)

Application Number Title Priority Date Filing Date
EP20815004.5A Active EP3977296B1 (de) 2019-05-24 2020-05-26 Verfahren und vorrichtung zur erleichterung der schreibfehlerzwischenspeicherung in einem zwischenspeichersystem
EP24183618.8A Pending EP4432288A3 (de) 2019-05-24 2020-05-26 Verfahren und vorrichtung zur erleichterung von vollpipeline-lese-änderungs-schreibunterstützung in einem level-1-datencache mit speicherwarteschlange und datenweiterleitung
EP20812543.5A Active EP3977295B1 (de) 2019-05-24 2020-05-26 Opfer-cache mit unterstützung des entleerens von write-miss-einträgen
EP24190664.3A Pending EP4443435A3 (de) 2019-05-24 2020-05-26 Opfer-cache mit unterstützung des entleerens von write-miss-einträgen
EP25193762.9A Pending EP4636601A2 (de) 2019-05-24 2020-05-26 Verfahren und vorrichtung zur erleichterung von schreibfehlgriffzwischenspeicherung in einem cachesystem
EP20813951.9A Active EP3977299B1 (de) 2019-05-24 2020-05-26 Verfahren und vorrichtung für lese-änderungs-schreibunterstützung im pipelineformat in einem cache-speicher

Family Applications After (5)

Application Number Title Priority Date Filing Date
EP24183618.8A Pending EP4432288A3 (de) 2019-05-24 2020-05-26 Verfahren und vorrichtung zur erleichterung von vollpipeline-lese-änderungs-schreibunterstützung in einem level-1-datencache mit speicherwarteschlange und datenweiterleitung
EP20812543.5A Active EP3977295B1 (de) 2019-05-24 2020-05-26 Opfer-cache mit unterstützung des entleerens von write-miss-einträgen
EP24190664.3A Pending EP4443435A3 (de) 2019-05-24 2020-05-26 Opfer-cache mit unterstützung des entleerens von write-miss-einträgen
EP25193762.9A Pending EP4636601A2 (de) 2019-05-24 2020-05-26 Verfahren und vorrichtung zur erleichterung von schreibfehlgriffzwischenspeicherung in einem cachesystem
EP20813951.9A Active EP3977299B1 (de) 2019-05-24 2020-05-26 Verfahren und vorrichtung für lese-änderungs-schreibunterstützung im pipelineformat in einem cache-speicher

Country Status (5)

Country Link
US (66) US11403229B2 (de)
EP (6) EP3977296B1 (de)
JP (6) JP7553477B2 (de)
CN (3) CN113853593B (de)
WO (3) WO2020243098A1 (de)

Families Citing this family (40)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11151042B2 (en) * 2016-09-27 2021-10-19 Integrated Silicon Solution, (Cayman) Inc. Error cache segmentation for power reduction
CN109392181B (zh) * 2017-08-11 2022-07-19 华为技术有限公司 发送和接收随机接入前导码的方法和装置
US10585819B2 (en) 2018-03-05 2020-03-10 Samsung Electronics Co., Ltd. SSD architecture for FPGA based acceleration
EP3893119B1 (de) * 2019-02-21 2023-07-26 Huawei Technologies Co., Ltd. System auf chip, routing-verfahren für zugangsbefehl und endgerät
US11403229B2 (en) 2019-05-24 2022-08-02 Texas Instruments Incorporated Methods and apparatus to facilitate atomic operations in victim cache
US11223575B2 (en) * 2019-12-23 2022-01-11 Advanced Micro Devices, Inc. Re-purposing byte enables as clock enables for power savings
KR102875444B1 (ko) * 2020-03-13 2025-10-24 에스케이하이닉스 주식회사 호스트 및 메모리 시스템을 포함하는 전자 시스템
KR20210157830A (ko) * 2020-06-22 2021-12-29 에스케이하이닉스 주식회사 메모리 및 메모리의 동작 방법
US11467962B2 (en) * 2020-09-02 2022-10-11 SiFive, Inc. Method for executing atomic memory operations when contested
US12093258B2 (en) * 2020-12-14 2024-09-17 Samsung Electronics Co., Ltd. Storage device adapter to accelerate database temporary table processing
WO2022139637A1 (en) * 2020-12-22 2022-06-30 Telefonaktiebolaget Lm Ericsson (Publ) Snapshotting pending memory writes using non-volatile memory
US12271318B2 (en) * 2020-12-28 2025-04-08 Advanced Micro Devices, Inc. Method and apparatus for managing a cache directory
CN112765057B (zh) * 2020-12-30 2024-04-30 京信网络系统股份有限公司 数据传输方法、pcie系统、设备及存储介质
US11144822B1 (en) * 2021-01-04 2021-10-12 Edgecortix Pte. Ltd. Neural network accelerator run-time reconfigurability
US11735285B1 (en) * 2021-03-12 2023-08-22 Kioxia Corporation Detection of address bus corruption for data storage devices
US11599269B2 (en) * 2021-03-17 2023-03-07 Vmware, Inc. Reducing file write latency
US11803311B2 (en) 2021-03-31 2023-10-31 Advanced Micro Devices, Inc. System and method for coalesced multicast data transfers over memory interfaces
US12175116B2 (en) * 2021-04-27 2024-12-24 Microchip Technology Inc. Method and apparatus for gather/scatter operations in a vector processor
CN113553292B (zh) * 2021-06-28 2022-04-19 睿思芯科(深圳)技术有限公司 一种向量处理器及相关数据访存方法
US11768599B2 (en) 2021-07-13 2023-09-26 Saudi Arabian Oil Company Managing an enterprise data storage system
KR102850055B1 (ko) 2021-07-23 2025-08-25 삼성전자주식회사 Dma를 이용한 데이터 처리 장치 및 방법
CN113778906B (zh) * 2021-07-30 2023-11-21 成都佰维存储科技有限公司 请求读取方法、装置、可读存储介质及电子设备
US11829643B2 (en) * 2021-10-25 2023-11-28 Skyechip Sdn Bhd Memory controller system and a method of pre-scheduling memory transaction for a storage device
US20220107897A1 (en) * 2021-12-15 2022-04-07 Intel Corporation Cache probe transaction filtering
US11847062B2 (en) * 2021-12-16 2023-12-19 Advanced Micro Devices, Inc. Re-fetching data for L3 cache data evictions into a last-level cache
US12050538B2 (en) * 2022-03-30 2024-07-30 International Business Machines Corporation Castout handling in a distributed cache topology
US12131058B2 (en) * 2022-04-22 2024-10-29 SanDisk Technologies, Inc. Configurable arithmetic HW accelerator
US20230359556A1 (en) * 2022-05-03 2023-11-09 Advanced Micro Devices, Inc. Performing Operations for Handling Data using Processor in Memory Circuitry in a High Bandwidth Memory
US12242388B2 (en) * 2022-06-02 2025-03-04 Micron Technology, Inc. Row hammer mitigation using a victim cache
DE112023002072T5 (de) * 2022-06-28 2025-02-27 Apple Inc. Pc-basierte computerberechtigungen
CN115331718A (zh) * 2022-08-02 2022-11-11 长江存储科技有限责任公司 一种数据传输装置、方法、存储器及存储系统
US12013788B2 (en) * 2022-08-30 2024-06-18 Micron Technology, Inc. Evicting a cache line with pending control request
WO2024058801A1 (en) * 2022-09-12 2024-03-21 Google Llc Time-efficient implementation of cache replacement policy
CN115299888A (zh) * 2022-09-14 2022-11-08 曹毓琳 基于现场可编程门阵列的多微器官系统数据处理系统
US12462331B2 (en) * 2023-02-28 2025-11-04 Qualcomm Incorporated Adaptive caches for power optimization of graphics processing
US12298915B1 (en) 2023-07-26 2025-05-13 Apple Inc. Hierarchical store queue circuit
US20250147893A1 (en) * 2023-11-06 2025-05-08 Akeana, Inc. Cache evict duplication management
US20250225081A1 (en) * 2024-01-09 2025-07-10 Qualcomm Incorporated Priority-based cache eviction policy governed by latency critical central processing unit (cpu) cores
US20250252027A1 (en) * 2024-02-01 2025-08-07 Micron Technology, Inc. Compressing histograms in a memory device
WO2025175090A1 (en) * 2024-02-14 2025-08-21 Texas Instruments Incorporated Read-modify-write manager with arithmetic circuit

Family Cites Families (256)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4577293A (en) * 1984-06-01 1986-03-18 International Business Machines Corporation Distributed, on-chip cache
US4941201A (en) * 1985-01-13 1990-07-10 Abbott Laboratories Electronic data storage and retrieval apparatus and method
US5003459A (en) 1988-04-01 1991-03-26 Digital Equipment Corporation Cache memory system
KR920009192B1 (ko) * 1989-12-26 1992-10-14 재단법인한국전자 통신연구소 다중처리 시스템의 스눕 인터페이스 방법 및 그 장치
US5412799A (en) * 1990-02-27 1995-05-02 Massachusetts Institute Of Technology Efficient data processor instrumentation for systematic program debugging and development
US5256293A (en) 1991-09-20 1993-10-26 Research Corporation Technologies, Inc. Separation of enantiomers of non-steroidal anti-inflammatory drugs and chiral selector therefor
US5325503A (en) * 1992-02-21 1994-06-28 Compaq Computer Corporation Cache memory system which snoops an operation to a first location in a cache line and does not snoop further operations to locations in the same line
JPH05342107A (ja) * 1992-06-11 1993-12-24 Fujitsu Ltd キャッシュメモリ制御装置
US5418973A (en) * 1992-06-22 1995-05-23 Digital Equipment Corporation Digital computer system with cache controller coordinating both vector and scalar operations
US6219773B1 (en) * 1993-10-18 2001-04-17 Via-Cyrix, Inc. System and method of retiring misaligned write operands from a write buffer
US5687338A (en) * 1994-03-01 1997-11-11 Intel Corporation Method and apparatus for maintaining a macro instruction for refetching in a pipelined processor
US5644752A (en) * 1994-06-29 1997-07-01 Exponential Technology, Inc. Combined store queue for a master-slave cache system
US5561782A (en) * 1994-06-30 1996-10-01 Intel Corporation Pipelined cache system having low effective latency for nonsequential accesses
US5577227A (en) * 1994-08-04 1996-11-19 Finnell; James S. Method for decreasing penalty resulting from a cache miss in multi-level cache system
DE69616402T2 (de) * 1995-03-31 2002-07-18 Sun Microsystems, Inc. Schnelle Zweitor-Cachesteuerungsschaltung für Datenprozessoren in einem paketvermittelten cachekohärenten Multiprozessorsystem
US5651136A (en) * 1995-06-06 1997-07-22 International Business Machines Corporation System and method for increasing cache efficiency through optimized data allocation
JPH09114734A (ja) * 1995-10-16 1997-05-02 Hitachi Ltd ストアバッファ装置
US5809228A (en) * 1995-12-27 1998-09-15 Intel Corporaiton Method and apparatus for combining multiple writes to a memory resource utilizing a write buffer
US5822755A (en) * 1996-01-25 1998-10-13 International Business Machines Corporation Dual usage memory selectively behaving as a victim cache for L1 cache or as a tag array for L2 cache
US5758056A (en) * 1996-02-08 1998-05-26 Barr; Robert C. Memory system having defective address identification and replacement
JP3429948B2 (ja) * 1996-04-10 2003-07-28 株式会社日立製作所 組込み型cpu用制御装置
US20010034808A1 (en) * 1996-07-19 2001-10-25 Atsushi Nakajima Cache memory device and information processing system
US6038645A (en) * 1996-08-28 2000-03-14 Texas Instruments Incorporated Microprocessor circuits, systems, and methods using a combined writeback queue and victim cache
US5860107A (en) * 1996-10-07 1999-01-12 International Business Machines Corporation Processor and method for store gathering through merged store operations
KR100190379B1 (ko) 1996-11-06 1999-06-01 김영환 쓰기 사이클의 성능 향상을 위한 프로세서
US5894569A (en) * 1997-04-14 1999-04-13 International Business Machines Corporation Method and system for back-end gathering of store instructions within a data-processing system
US6173371B1 (en) * 1997-04-14 2001-01-09 International Business Machines Corporation Demand-based issuance of cache operations to a processor bus
US5978888A (en) * 1997-04-14 1999-11-02 International Business Machines Corporation Hardware-managed programmable associativity caching mechanism monitoring cache misses to selectively implement multiple associativity levels
US5935233A (en) * 1997-05-21 1999-08-10 Micron Electronics, Inc. Computer system with a switch interconnector for computer devices
US6085294A (en) * 1997-10-24 2000-07-04 Compaq Computer Corporation Distributed data dependency stall mechanism
US6078992A (en) 1997-12-05 2000-06-20 Intel Corporation Dirty line cache
US6226713B1 (en) 1998-01-21 2001-05-01 Sun Microsystems, Inc. Apparatus and method for queueing structures in a multi-level non-blocking cache subsystem
US6195729B1 (en) * 1998-02-17 2001-02-27 International Business Machines Corporation Deallocation with cache update protocol (L2 evictions)
US6289438B1 (en) * 1998-07-29 2001-09-11 Kabushiki Kaisha Toshiba Microprocessor cache redundancy scheme using store buffer
US6215497B1 (en) * 1998-08-12 2001-04-10 Monolithic System Technology, Inc. Method and apparatus for maximizing the random access bandwidth of a multi-bank DRAM in a computer graphics system
US6243791B1 (en) * 1998-08-13 2001-06-05 Hewlett-Packard Company Method and architecture for data coherency in set-associative caches including heterogeneous cache sets having different characteristics
DE69924939T2 (de) * 1998-09-01 2006-03-09 Texas Instruments Inc., Dallas Verbesserte Speicherhierarchie für Prozessoren und Koherenzprotokoll hierfür
US6397296B1 (en) * 1999-02-19 2002-05-28 Hitachi Ltd. Two-level instruction cache for embedded processors
US6366984B1 (en) * 1999-05-11 2002-04-02 Intel Corporation Write combining buffer that supports snoop request
FR2795196B1 (fr) * 1999-06-21 2001-08-10 Bull Sa Processus de liberation de pages physiques pour mecanisme d'adressage virtuel
US6446166B1 (en) * 1999-06-25 2002-09-03 International Business Machines Corporation Method for upper level cache victim selection management by a lower level cache
US6484237B1 (en) * 1999-07-15 2002-11-19 Texas Instruments Incorporated Unified multilevel memory system architecture which supports both cache and addressable SRAM
US6609171B1 (en) * 1999-12-29 2003-08-19 Intel Corporation Quad pumped bus architecture and protocol
US6633299B1 (en) * 2000-01-10 2003-10-14 Intel Corporation Method and apparatus for implementing smart allocation policies for a small frame buffer cache serving 3D and 2D streams
JP2001249846A (ja) * 2000-03-03 2001-09-14 Hitachi Ltd キャッシュメモリ装置及びデータ処理システム
US20020087821A1 (en) * 2000-03-08 2002-07-04 Ashley Saulsbury VLIW computer processing architecture with on-chip DRAM usable as physical memory or cache memory
US6513104B1 (en) * 2000-03-29 2003-01-28 I.P-First, Llc Byte-wise write allocate with retry tracking
JP3498673B2 (ja) 2000-04-05 2004-02-16 日本電気株式会社 記憶装置
US6751720B2 (en) * 2000-06-10 2004-06-15 Hewlett-Packard Development Company, L.P. Method and system for detecting and resolving virtual address synonyms in a two-level cache hierarchy
JP2002007209A (ja) 2000-06-26 2002-01-11 Fujitsu Ltd データ処理装置
EP1182570A3 (de) * 2000-08-21 2004-08-04 Texas Instruments Incorporated TLB mit Ressource-Kennzeichnungsfeld
JP2002108702A (ja) 2000-10-03 2002-04-12 Hitachi Ltd マイクロコンピュータ及びデータ処理装置
US6546461B1 (en) * 2000-11-22 2003-04-08 Integrated Device Technology, Inc. Multi-port cache memory devices and FIFO memory devices having multi-port cache memory devices therein
JP2002163150A (ja) 2000-11-28 2002-06-07 Toshiba Corp プロセッサ
US6766389B2 (en) 2001-05-18 2004-07-20 Broadcom Corporation System on a chip for networking
US6775750B2 (en) 2001-06-29 2004-08-10 Texas Instruments Incorporated System protection map
US6839808B2 (en) * 2001-07-06 2005-01-04 Juniper Networks, Inc. Processing cluster having multiple compute engines and shared tier one caches
US6587929B2 (en) * 2001-07-31 2003-07-01 Ip-First, L.L.C. Apparatus and method for performing write-combining in a pipelined microprocessor using tags
US7085955B2 (en) * 2001-09-14 2006-08-01 Hewlett-Packard Development Company, L.P. Checkpointing with a write back controller
US6810465B2 (en) 2001-10-31 2004-10-26 Hewlett-Packard Development Company, L.P. Limiting the number of dirty entries in a computer cache
JP2003196084A (ja) 2001-12-25 2003-07-11 Toshiba Corp リードモディファイライトユニットを有するシステム
US20030182539A1 (en) * 2002-03-20 2003-09-25 International Business Machines Corporation Storing execution results of mispredicted paths in a superscalar computer processor
US6912628B2 (en) 2002-04-22 2005-06-28 Sun Microsystems Inc. N-way set-associative external cache with standard DDR memory devices
US7937559B1 (en) * 2002-05-13 2011-05-03 Tensilica, Inc. System and method for generating a configurable processor supporting a user-defined plurality of instruction sizes
JP2005527030A (ja) * 2002-05-24 2005-09-08 コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ ストール機能を有する疑似マルチポートデータメモリ
EP1552396B1 (de) 2002-10-04 2013-04-10 Callahan Cellular L.L.C. Datenverarbeitungssystem mit einer hierarchischen speicherorganisation und betriebsverfahren dafür
US7290093B2 (en) * 2003-01-07 2007-10-30 Intel Corporation Cache memory to support a processor's power mode of operation
US6976125B2 (en) * 2003-01-29 2005-12-13 Sun Microsystems, Inc. Method and apparatus for predicting hot spots in cache memories
US7571287B2 (en) * 2003-03-13 2009-08-04 Marvell World Trade Ltd. Multiport memory architecture, devices and systems including the same, and methods of using the same
US6976132B2 (en) * 2003-03-28 2005-12-13 International Business Machines Corporation Reducing latency of a snoop tenure
US6880047B2 (en) * 2003-03-28 2005-04-12 Emulex Design & Manufacturing Corporation Local emulation of data RAM utilizing write-through cache hardware within a CPU module
US20040199722A1 (en) * 2003-04-03 2004-10-07 International Business Machines Corp. Method and apparatus for performing bus tracing in a data processing system having a distributed memory
TWI246658B (en) * 2003-04-25 2006-01-01 Ip First Llc Microprocessor, apparatus and method for selectively associating store buffer cache line status with response buffer cache line status
ITRM20030354A1 (it) * 2003-07-17 2005-01-18 Micron Technology Inc Unita' di controllo per dispositivo di memoria.
US7240277B2 (en) * 2003-09-26 2007-07-03 Texas Instruments Incorporated Memory error detection reporting
WO2005066830A1 (en) * 2004-01-08 2005-07-21 Agency For Science, Technology & Research A shared storage network system and a method for operating a shared storage network system
US7308526B2 (en) * 2004-06-02 2007-12-11 Intel Corporation Memory controller module having independent memory controllers for different memory types
US7277982B2 (en) * 2004-07-27 2007-10-02 International Business Machines Corporation DRAM access command queuing structure
US20060036817A1 (en) * 2004-08-10 2006-02-16 Oza Alpesh B Method and system for supporting memory unaligned writes in a memory controller
US8775740B2 (en) * 2004-08-30 2014-07-08 Texas Instruments Incorporated System and method for high performance, power efficient store buffer forwarding
US7360035B2 (en) * 2004-09-01 2008-04-15 International Business Machines Corporation Atomic read/write support in a multi-module memory configuration
US7606998B2 (en) * 2004-09-10 2009-10-20 Cavium Networks, Inc. Store instruction ordering for multi-core processor
JP2006091995A (ja) 2004-09-21 2006-04-06 Toshiba Microelectronics Corp キャッシュメモリのライトバック装置
US7716424B2 (en) * 2004-11-16 2010-05-11 International Business Machines Corporation Victim prefetching in a cache hierarchy
US20060143396A1 (en) * 2004-12-29 2006-06-29 Mason Cabot Method for programmer-controlled cache line eviction policy
US8103749B2 (en) * 2005-01-05 2012-01-24 Yissum Research Development Company Of The Hebrew University Of Jerusalem Method and apparatus for managing communications
US20060179231A1 (en) * 2005-02-07 2006-08-10 Advanced Micron Devices, Inc. System having cache memory and method of accessing
US7490200B2 (en) * 2005-02-10 2009-02-10 International Business Machines Corporation L2 cache controller with slice directory and unified cache structure
JP2006260378A (ja) 2005-03-18 2006-09-28 Seiko Epson Corp 半導体集積回路
US7386685B2 (en) 2005-03-29 2008-06-10 International Busniess Machines Corporation Method and apparatus for filtering snoop requests using multiple snoop caches
CN101156139A (zh) * 2005-04-08 2008-04-02 松下电器产业株式会社 高速缓冲存储器
US20070005842A1 (en) * 2005-05-16 2007-01-04 Texas Instruments Incorporated Systems and methods for stall monitoring
US20060277352A1 (en) * 2005-06-07 2006-12-07 Fong Pong Method and system for supporting large caches with split and canonicalization tags
US7711988B2 (en) * 2005-06-15 2010-05-04 The Board Of Trustees Of The University Of Illinois Architecture support system and method for memory monitoring
US7779307B1 (en) * 2005-09-28 2010-08-17 Oracle America, Inc. Memory ordering queue tightly coupled with a versioning cache circuit
US8019944B1 (en) * 2005-09-28 2011-09-13 Oracle America, Inc. Checking for a memory ordering violation after a speculative cache write
US7437510B2 (en) * 2005-09-30 2008-10-14 Intel Corporation Instruction-assisted cache management for efficient use of cache and memory
US20070094450A1 (en) * 2005-10-26 2007-04-26 International Business Machines Corporation Multi-level cache architecture having a selective victim cache
JP4832862B2 (ja) * 2005-11-18 2011-12-07 株式会社日立製作所 ディスクアレイシステム及びセキュリティ方法
US8327075B2 (en) * 2005-12-08 2012-12-04 International Business Machines Corporation Methods and apparatus for handling a cache miss
US7613941B2 (en) * 2005-12-29 2009-11-03 Intel Corporation Mechanism for self refresh during advanced configuration and power interface (ACPI) standard C0 power state
US7461210B1 (en) * 2006-04-14 2008-12-02 Tilera Corporation Managing set associative cache memory according to entry type
US8407395B2 (en) * 2006-08-22 2013-03-26 Mosaid Technologies Incorporated Scalable memory system
KR101086417B1 (ko) 2006-11-27 2011-11-25 삼성전자주식회사 다이내믹 랜덤 액세스 메모리의 부분 액세스 장치 및 방법
CN101558390B (zh) * 2006-12-15 2014-06-18 密克罗奇普技术公司 用于微处理器的可配置高速缓冲存储器
US7539062B2 (en) * 2006-12-20 2009-05-26 Micron Technology, Inc. Interleaved memory program and verify method, device and system
US7996632B1 (en) * 2006-12-22 2011-08-09 Oracle America, Inc. Device for misaligned atomics for a highly-threaded x86 processor
JP5010271B2 (ja) 2006-12-27 2012-08-29 富士通株式会社 エラー訂正コード生成方法、およびメモリ制御装置
US7620780B1 (en) * 2007-01-23 2009-11-17 Xilinx, Inc. Multiprocessor system with cache controlled scatter-gather operations
US7660967B2 (en) * 2007-02-01 2010-02-09 Efficient Memory Technology Result data forwarding in parallel vector data processor based on scalar operation issue order
JP2008234806A (ja) * 2007-03-23 2008-10-02 Toshiba Corp 半導体記憶装置およびそのリダンダンシ方法
US7836262B2 (en) * 2007-06-05 2010-11-16 Apple Inc. Converting victim writeback to a fill
JP4613247B2 (ja) 2007-06-20 2011-01-12 富士通株式会社 演算処理装置、情報処理装置及び演算処理装置の制御方法
US8166246B2 (en) * 2008-01-31 2012-04-24 International Business Machines Corporation Chaining multiple smaller store queue entries for more efficient store queue usage
KR100947178B1 (ko) * 2008-02-01 2010-03-12 엠텍비젼 주식회사 카메라 시리얼 통신 인터페이스와 디스플레이 시리얼 통신인터페이스의 통합구조를 가지는 인터페이스 통합 장치
EP2271989A1 (de) * 2008-04-22 2011-01-12 Nxp B.V. Mehrfachverarbeitungsschaltung mit cache-schaltung zum schreiben auf vorher nicht geladene cache-zeilen
US7814300B2 (en) * 2008-04-30 2010-10-12 Freescale Semiconductor, Inc. Configurable pipeline to process an operation at alternate pipeline stages depending on ECC/parity protection mode of memory access
US9146744B2 (en) * 2008-05-06 2015-09-29 Oracle America, Inc. Store queue having restricted and unrestricted entries
US7934080B2 (en) * 2008-05-28 2011-04-26 Oracle America, Inc. Aggressive store merging in a processor that supports checkpointing
US8117395B1 (en) * 2008-06-25 2012-02-14 Marvell Israel (Misl) Ltd. Multi-stage pipeline for cache access
US8327072B2 (en) 2008-07-23 2012-12-04 International Business Machines Corporation Victim cache replacement
US8943273B1 (en) * 2008-08-14 2015-01-27 Marvell International Ltd. Method and apparatus for improving cache efficiency
US8181005B2 (en) 2008-09-05 2012-05-15 Advanced Micro Devices, Inc. Hybrid branch prediction device with sparse and dense prediction caches
US8782348B2 (en) * 2008-09-09 2014-07-15 Via Technologies, Inc. Microprocessor cache line evict array
TW201017421A (en) * 2008-09-24 2010-05-01 Panasonic Corp Cache memory, memory system and control method therefor
US8225045B2 (en) * 2008-12-16 2012-07-17 International Business Machines Corporation Lateral cache-to-cache cast-in
US9053030B2 (en) * 2009-01-28 2015-06-09 Nec Corporation Cache memory and control method thereof with cache hit rate
US8259520B2 (en) * 2009-03-13 2012-09-04 Unity Semiconductor Corporation Columnar replacement of defective memory cells
US8117390B2 (en) * 2009-04-15 2012-02-14 International Business Machines Corporation Updating partial cache lines in a data processing system
US9183145B2 (en) * 2009-04-27 2015-11-10 Intel Corporation Data caching in a network communications processor architecture
US8095734B2 (en) 2009-04-30 2012-01-10 Lsi Corporation Managing cache line allocations for multiple issue processors
US8533388B2 (en) * 2009-06-15 2013-09-10 Broadcom Corporation Scalable multi-bank memory architecture
JP5413001B2 (ja) * 2009-07-09 2014-02-12 富士通株式会社 キャッシュメモリ
US8244981B2 (en) * 2009-07-10 2012-08-14 Apple Inc. Combined transparent/non-transparent cache
US8392661B1 (en) * 2009-09-21 2013-03-05 Tilera Corporation Managing cache coherence
US8595425B2 (en) * 2009-09-25 2013-11-26 Nvidia Corporation Configurable cache for multiple clients
US8621478B2 (en) * 2010-01-15 2013-12-31 International Business Machines Corporation Multiprocessor system with multiple concurrent modes of execution
US8688901B2 (en) * 2009-12-08 2014-04-01 Intel Corporation Reconfigurable load-reduced memory buffer
US20110149661A1 (en) * 2009-12-18 2011-06-23 Rajwani Iqbal R Memory array having extended write operation
US9081501B2 (en) * 2010-01-08 2015-07-14 International Business Machines Corporation Multi-petascale highly efficient parallel supercomputer
US8341353B2 (en) 2010-01-14 2012-12-25 Qualcomm Incorporated System and method to access a portion of a level two memory and a level one memory
US8370582B2 (en) * 2010-01-26 2013-02-05 Hewlett-Packard Development Company, L.P. Merging subsequent updates to a memory location
US8429374B2 (en) * 2010-01-28 2013-04-23 Sony Corporation System and method for read-while-write with NAND memory device
US8621145B1 (en) * 2010-01-29 2013-12-31 Netapp, Inc. Concurrent content management and wear optimization for a non-volatile solid-state cache
US20130117838A1 (en) * 2010-02-11 2013-05-09 Timothy Evert LEVIN Superpositional Control of Integrated Circuit Processing
US8514235B2 (en) * 2010-04-21 2013-08-20 Via Technologies, Inc. System and method for managing the computation of graphics shading operations
JP5650441B2 (ja) * 2010-06-07 2015-01-07 キヤノン株式会社 演算装置、キャッシュ装置、その制御方法及びコンピュータプログラム
US8751745B2 (en) 2010-08-11 2014-06-10 Advanced Micro Devices, Inc. Method for concurrent flush of L1 and L2 caches
US8352685B2 (en) 2010-08-20 2013-01-08 Apple Inc. Combining write buffer with dynamically adjustable flush metrics
US20120079245A1 (en) * 2010-09-25 2012-03-29 Cheng Wang Dynamic optimization for conditional commit
US8904115B2 (en) * 2010-09-28 2014-12-02 Texas Instruments Incorporated Cache with multiple access pipelines
US8756374B2 (en) * 2010-11-05 2014-06-17 Oracle International Corporation Store queue supporting ordered and unordered stores
WO2012116369A2 (en) 2011-02-25 2012-08-30 Fusion-Io, Inc. Apparatus, system, and method for managing contents of a cache
US9547593B2 (en) 2011-02-28 2017-01-17 Nxp Usa, Inc. Systems and methods for reconfiguring cache memory
JP2012203729A (ja) * 2011-03-25 2012-10-22 Fujitsu Ltd 演算処理装置および演算処理装置の制御方法
US9229879B2 (en) * 2011-07-11 2016-01-05 Intel Corporation Power reduction using unmodified information in evicted cache lines
JP5524144B2 (ja) * 2011-08-08 2014-06-18 株式会社東芝 key−valueストア方式を有するメモリシステム
JP5674613B2 (ja) 2011-09-22 2015-02-25 株式会社東芝 制御システム、制御方法およびプログラム
US9330002B2 (en) * 2011-10-31 2016-05-03 Cavium, Inc. Multi-core interconnect in a network processor
US8634221B2 (en) * 2011-11-01 2014-01-21 Avago Technologies General Ip (Singapore) Pte. Ltd. Memory system that utilizes a wide input/output (I/O) interface to interface memory storage with an interposer and that utilizes a SerDes interface to interface a memory controller with an integrated circuit, and a method
US8966457B2 (en) * 2011-11-15 2015-02-24 Global Supercomputing Corporation Method and system for converting a single-threaded software program into an application-specific supercomputer
US9928179B2 (en) * 2011-12-16 2018-03-27 Intel Corporation Cache replacement policy
US9229853B2 (en) * 2011-12-20 2016-01-05 Intel Corporation Method and system for data de-duplication
US9251086B2 (en) 2012-01-24 2016-02-02 SanDisk Technologies, Inc. Apparatus, system, and method for managing a cache
JP5565425B2 (ja) 2012-02-29 2014-08-06 富士通株式会社 演算装置、情報処理装置および演算方法
US11024352B2 (en) * 2012-04-10 2021-06-01 Samsung Electronics Co., Ltd. Memory system for access concentration decrease management and access concentration decrease method
US20130321439A1 (en) * 2012-05-31 2013-12-05 Allen B. Goodrich Method and apparatus for accessing video data for efficient data transfer and memory cache performance
CN104508646A (zh) * 2012-06-08 2015-04-08 惠普发展公司,有限责任合伙企业 访问存储器
US8904100B2 (en) * 2012-06-11 2014-12-02 International Business Machines Corporation Process identifier-based cache data transfer
US9092359B2 (en) * 2012-06-14 2015-07-28 International Business Machines Corporation Identification and consolidation of page table entries
JP6011194B2 (ja) 2012-09-21 2016-10-19 富士通株式会社 演算処理装置及び演算処理装置の制御方法
US9639469B2 (en) * 2012-09-28 2017-05-02 Qualcomm Technologies, Inc. Coherency controller with reduced data buffer
US10169091B2 (en) * 2012-10-25 2019-01-01 Nvidia Corporation Efficient memory virtualization in multi-threaded processing units
US9158725B2 (en) * 2012-11-20 2015-10-13 Freescale Semiconductor, Inc. Flexible control mechanism for store gathering in a write buffer
US9170955B2 (en) 2012-11-27 2015-10-27 Intel Corporation Providing extended cache replacement state information
US9612972B2 (en) * 2012-12-03 2017-04-04 Micron Technology, Inc. Apparatuses and methods for pre-fetching and write-back for a segmented cache memory
US9244841B2 (en) 2012-12-31 2016-01-26 Advanced Micro Devices, Inc. Merging eviction and fill buffers for cache line transactions
US9317433B1 (en) * 2013-01-14 2016-04-19 Marvell Israel (M.I.S.L.) Ltd. Multi-core processing system having cache coherency in dormant mode
US8984230B2 (en) * 2013-01-30 2015-03-17 Hewlett-Packard Development Company, L.P. Method of using a buffer within an indexing accelerator during periods of inactivity
US9189422B2 (en) * 2013-02-07 2015-11-17 Avago Technologies General Ip (Singapore) Pte. Ltd. Method to throttle rate of data caching for improved I/O performance
US9489204B2 (en) * 2013-03-15 2016-11-08 Qualcomm Incorporated Method and apparatus for precalculating a direct branch partial target address during a misprediction correction process
US9223710B2 (en) * 2013-03-16 2015-12-29 Intel Corporation Read-write partitioning of cache memory
US9361240B2 (en) * 2013-04-12 2016-06-07 International Business Machines Corporation Dynamic reservations in a unified request queue
US20150006820A1 (en) * 2013-06-28 2015-01-01 Texas Instruments Incorporated Dynamic management of write-miss buffer to reduce write-miss traffic
US10061675B2 (en) * 2013-07-15 2018-08-28 Texas Instruments Incorporated Streaming engine with deferred exception reporting
US9606803B2 (en) 2013-07-15 2017-03-28 Texas Instruments Incorporated Highly integrated scalable, flexible DSP megamodule architecture
US8704842B1 (en) * 2013-07-17 2014-04-22 Spinella Ip Holdings, Inc. System and method for histogram computation using a graphics processing unit
WO2015010327A1 (zh) * 2013-07-26 2015-01-29 华为技术有限公司 数据发送方法、数据接收方法和存储设备
US9092345B2 (en) * 2013-08-08 2015-07-28 Arm Limited Data processing systems
US9710380B2 (en) * 2013-08-29 2017-07-18 Intel Corporation Managing shared cache by multi-core processor
US9612961B2 (en) * 2013-08-29 2017-04-04 Empire Technology Development Llc Cache partitioning in a multicore processor
JP6088951B2 (ja) * 2013-09-20 2017-03-01 株式会社東芝 キャッシュメモリシステムおよびプロセッサシステム
WO2015057857A1 (en) * 2013-10-15 2015-04-23 Mill Computing, Inc. Computer processor employing dedicated hardware mechanism controlling the initialization and invalidation of cache lines
TWI514145B (zh) * 2013-10-21 2015-12-21 Univ Nat Sun Yat Sen 可儲存除錯資料的處理器、其快取及控制方法
JP6179369B2 (ja) 2013-11-22 2017-08-16 富士通株式会社 演算処理装置及び演算処理装置の制御方法
GB2520942A (en) * 2013-12-03 2015-06-10 Ibm Data Processing system and method for data processing in a multiple processor system
US10216632B2 (en) * 2013-12-30 2019-02-26 Michael Henry Kass Memory system cache eviction policies
WO2015101827A1 (en) * 2013-12-31 2015-07-09 Mosys, Inc. Integrated main memory and coprocessor with low latency
US9436972B2 (en) * 2014-03-27 2016-09-06 Intel Corporation System coherency in a distributed graphics processor hierarchy
US9342403B2 (en) * 2014-03-28 2016-05-17 Intel Corporation Method and apparatus for managing a spin transfer torque memory
US9483310B2 (en) * 2014-04-29 2016-11-01 Bluedata Software, Inc. Associating cache memory with a work process
WO2015165055A1 (zh) * 2014-04-30 2015-11-05 华为技术有限公司 存储数据的方法、内存控制器和中央处理器
US9864007B2 (en) * 2014-04-30 2018-01-09 Duke University Software-based self-test and diagnosis using on-chip memory
US9691452B2 (en) * 2014-08-15 2017-06-27 Micron Technology, Inc. Apparatuses and methods for concurrently accessing different memory planes of a memory
GB2549239A (en) * 2014-11-13 2017-10-18 Advanced Risc Mach Ltd Context sensitive barriers in data processing
US10447316B2 (en) * 2014-12-19 2019-10-15 Micron Technology, Inc. Apparatuses and methods for pipelining memory operations with error correction coding
US9690710B2 (en) * 2015-01-15 2017-06-27 Qualcomm Incorporated System and method for improving a victim cache mode in a portable computing device
US9696934B2 (en) * 2015-02-12 2017-07-04 Western Digital Technologies, Inc. Hybrid solid state drive (SSD) using PCM or other high performance solid-state memory
US10360972B2 (en) * 2015-03-10 2019-07-23 Rambus Inc. Memories and memory components with interconnected and redundant data interfaces
US20160269501A1 (en) * 2015-03-11 2016-09-15 Netapp, Inc. Using a cache cluster of a cloud computing service as a victim cache
US20160294983A1 (en) * 2015-03-30 2016-10-06 Mellanox Technologies Ltd. Memory sharing using rdma
US20170046278A1 (en) * 2015-08-14 2017-02-16 Qualcomm Incorporated Method and apparatus for updating replacement policy information for a fully associative buffer cache
JP6540363B2 (ja) * 2015-08-19 2019-07-10 富士通株式会社 ストレージ制御装置、ストレージ制御方法、およびストレージ制御プログラム
US9824012B2 (en) * 2015-09-24 2017-11-21 Qualcomm Incorporated Providing coherent merging of committed store queue entries in unordered store queues of block-based computer processors
US10002076B2 (en) * 2015-09-29 2018-06-19 Nxp Usa, Inc. Shared cache protocol for parallel search and replacement
US10255196B2 (en) 2015-12-22 2019-04-09 Intel Corporation Method and apparatus for sub-page write protection
US20170255569A1 (en) * 2016-03-01 2017-09-07 Qualcomm Incorporated Write-allocation for a cache based on execute permissions
US10019375B2 (en) 2016-03-02 2018-07-10 Toshiba Memory Corporation Cache device and semiconductor device including a tag memory storing absence, compression and write state information
US10185668B2 (en) * 2016-04-08 2019-01-22 Qualcomm Incorporated Cost-aware cache replacement
US10169240B2 (en) * 2016-04-08 2019-01-01 Qualcomm Incorporated Reducing memory access bandwidth based on prediction of memory request size
US9940267B2 (en) * 2016-05-17 2018-04-10 Nxp Usa, Inc. Compiler global memory access optimization in code regions using most appropriate base pointer registers
US12287763B2 (en) * 2016-05-27 2025-04-29 Netapp, Inc. Methods for facilitating external cache in a cloud storage environment and devices thereof
US10430349B2 (en) * 2016-06-13 2019-10-01 Advanced Micro Devices, Inc. Scaled set dueling for cache replacement policies
US9928176B2 (en) 2016-07-20 2018-03-27 Advanced Micro Devices, Inc. Selecting cache transfer policy for prefetched data based on cache test regions
US9946646B2 (en) 2016-09-06 2018-04-17 Advanced Micro Devices, Inc. Systems and method for delayed cache utilization
US10719447B2 (en) 2016-09-26 2020-07-21 Intel Corporation Cache and compression interoperability in a graphics processor pipeline
US10949360B2 (en) * 2016-09-30 2021-03-16 Mitsubishi Electric Corporation Information processing apparatus
US20180107602A1 (en) * 2016-10-13 2018-04-19 Intel Corporation Latency and Bandwidth Efficiency Improvement for Read Modify Write When a Read Operation is Requested to a Partially Modified Write Only Cacheline
EP3549129B1 (de) 2016-11-29 2021-03-10 ARM Limited Auf einen etikettabgleichsbefehl reagierende speicherschaltung
US10430706B2 (en) * 2016-12-01 2019-10-01 Via Alliance Semiconductor Co., Ltd. Processor with memory array operable as either last level cache slice or neural network unit memory
US10282296B2 (en) * 2016-12-12 2019-05-07 Intel Corporation Zeroing a cache line
US10162756B2 (en) * 2017-01-18 2018-12-25 Intel Corporation Memory-efficient last level cache architecture
US10331582B2 (en) 2017-02-13 2019-06-25 Intel Corporation Write congestion aware bypass for non-volatile memory, last level cache (LLC) dropping from write queue responsive to write queue being full and read queue threshold wherein the threshold is derived from latency of write to LLC and main memory retrieval time
JP2018133038A (ja) 2017-02-17 2018-08-23 Necプラットフォームズ株式会社 情報処理装置、制御装置、制御方法及びプログラム
US10102149B1 (en) * 2017-04-17 2018-10-16 Intel Corporation Replacement policies for a hybrid hierarchical cache
US10482028B2 (en) * 2017-04-21 2019-11-19 Intel Corporation Cache optimization for graphics systems
US20180336143A1 (en) * 2017-05-22 2018-11-22 Microsoft Technology Licensing, Llc Concurrent cache memory access
US10318436B2 (en) * 2017-07-25 2019-06-11 Qualcomm Incorporated Precise invalidation of virtually tagged caches
US11294594B2 (en) * 2017-08-07 2022-04-05 Kioxia Corporation SSD architecture supporting low latency operation
US20190073305A1 (en) * 2017-09-05 2019-03-07 Qualcomm Incorporated Reuse Aware Cache Line Insertion And Victim Selection In Large Cache Memory
US10503656B2 (en) * 2017-09-20 2019-12-10 Qualcomm Incorporated Performance by retaining high locality data in higher level cache memory
US10719058B1 (en) * 2017-09-25 2020-07-21 Cadence Design Systems, Inc. System and method for memory control having selectively distributed power-on processing
US10691345B2 (en) 2017-09-29 2020-06-23 Intel Corporation Systems, methods and apparatus for memory access and scheduling
US10402096B2 (en) * 2018-01-31 2019-09-03 EMC IP Holding Company LLC Unaligned IO cache for inline compression optimization
US10909040B2 (en) * 2018-04-19 2021-02-02 Intel Corporation Adaptive calibration of nonvolatile memory channel based on platform power management state
US10983922B2 (en) * 2018-05-18 2021-04-20 International Business Machines Corporation Selecting one of multiple cache eviction algorithms to use to evict a track from the cache using a machine learning module
US10884751B2 (en) * 2018-07-13 2021-01-05 Advanced Micro Devices, Inc. Method and apparatus for virtualizing the micro-op cache
US11030098B2 (en) * 2018-09-25 2021-06-08 Micron Technology, Inc. Configurable burst optimization for a parameterizable buffer
US10628312B2 (en) * 2018-09-26 2020-04-21 Nxp Usa, Inc. Producer/consumer paced data transfer within a data processing system having a cache which implements different cache coherency protocols
US11347644B2 (en) * 2018-10-15 2022-05-31 Texas Instruments Incorporated Distributed error detection and correction with hamming code handoff
US10768899B2 (en) * 2019-01-29 2020-09-08 SambaNova Systems, Inc. Matrix normal/transpose read and a reconfigurable data processor including same
US11061810B2 (en) * 2019-02-21 2021-07-13 International Business Machines Corporation Virtual cache mechanism for program break point register exception handling
US11487616B2 (en) * 2019-05-24 2022-11-01 Texas Instruments Incorporated Write control for read-modify-write operations in cache memory
US11403229B2 (en) * 2019-05-24 2022-08-02 Texas Instruments Incorporated Methods and apparatus to facilitate atomic operations in victim cache
US11163700B1 (en) * 2020-04-30 2021-11-02 International Business Machines Corporation Initiating interconnect operation without waiting on lower level cache directory lookup
CN116257472B (zh) * 2023-05-15 2023-08-22 上海励驰半导体有限公司 接口控制方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
EP3977296A4 (de) 2022-07-20
JP7762781B2 (ja) 2025-10-30
US11741020B2 (en) 2023-08-29
JP7762783B2 (ja) 2025-10-30
US11194729B2 (en) 2021-12-07
EP3977295A4 (de) 2022-08-03
JP2022532938A (ja) 2022-07-20
US11449432B2 (en) 2022-09-20
JP2024174950A (ja) 2024-12-17
US20250139019A1 (en) 2025-05-01
US20240004800A1 (en) 2024-01-04
US12189540B2 (en) 2025-01-07
US11693791B2 (en) 2023-07-04
US20250094359A1 (en) 2025-03-20
US20250225083A1 (en) 2025-07-10
US20250094358A1 (en) 2025-03-20
US20230342305A1 (en) 2023-10-26
EP3977299A4 (de) 2022-07-13
US12182038B2 (en) 2024-12-31
JP2024170552A (ja) 2024-12-10
US20200371912A1 (en) 2020-11-26
US20200371960A1 (en) 2020-11-26
US11762780B2 (en) 2023-09-19
US12007907B2 (en) 2024-06-11
US20230236974A1 (en) 2023-07-27
US20240095164A1 (en) 2024-03-21
US20250117340A1 (en) 2025-04-10
US20250272249A1 (en) 2025-08-28
JP7553477B2 (ja) 2024-09-18
US20230281126A1 (en) 2023-09-07
US20200371922A1 (en) 2020-11-26
US20200371962A1 (en) 2020-11-26
US12292839B2 (en) 2025-05-06
US20200371938A1 (en) 2020-11-26
US11403229B2 (en) 2022-08-02
WO2020243095A1 (en) 2020-12-03
EP3977299B1 (de) 2024-07-10
US11507513B2 (en) 2022-11-22
US20220206949A1 (en) 2022-06-30
EP4432288A2 (de) 2024-09-18
US20250348438A1 (en) 2025-11-13
US12197347B2 (en) 2025-01-14
EP3977296A1 (de) 2022-04-06
WO2020243099A1 (en) 2020-12-03
US20200371956A1 (en) 2020-11-26
US20230004500A1 (en) 2023-01-05
US20200371964A1 (en) 2020-11-26
US20200371939A1 (en) 2020-11-26
US20200371957A1 (en) 2020-11-26
US12321284B2 (en) 2025-06-03
US20250036573A1 (en) 2025-01-30
US20240264952A1 (en) 2024-08-08
US11620230B2 (en) 2023-04-04
US11693790B2 (en) 2023-07-04
US12141073B1 (en) 2024-11-12
US12417186B2 (en) 2025-09-16
CN113874846A (zh) 2021-12-31
US12210463B2 (en) 2025-01-28
US12265477B2 (en) 2025-04-01
JP7551659B2 (ja) 2024-09-17
US12072814B2 (en) 2024-08-27
US11714760B2 (en) 2023-08-01
CN113853592A (zh) 2021-12-28
US11275692B2 (en) 2022-03-15
US20220374362A1 (en) 2022-11-24
US20200371949A1 (en) 2020-11-26
US11940930B2 (en) 2024-03-26
US20250117341A1 (en) 2025-04-10
US20250028652A1 (en) 2025-01-23
US20250190368A1 (en) 2025-06-12
JP2022534891A (ja) 2022-08-04
EP4636601A2 (de) 2025-10-22
US20200371915A1 (en) 2020-11-26
US20240362166A1 (en) 2024-10-31
US11640357B2 (en) 2023-05-02
US20200371947A1 (en) 2020-11-26
US20200371946A1 (en) 2020-11-26
US20250028645A1 (en) 2025-01-23
US20240104026A1 (en) 2024-03-28
US20200371963A1 (en) 2020-11-26
EP4432288A3 (de) 2024-12-18
US20200371921A1 (en) 2020-11-26
US20240232100A1 (en) 2024-07-11
WO2020243098A1 (en) 2020-12-03
US20230401162A1 (en) 2023-12-14
US20250335372A1 (en) 2025-10-30
US11360905B2 (en) 2022-06-14
US12001345B2 (en) 2024-06-04
US20240020242A1 (en) 2024-01-18
US20220292023A1 (en) 2022-09-15
US11803486B2 (en) 2023-10-31
US12141079B2 (en) 2024-11-12
US12105640B2 (en) 2024-10-01
US12259826B2 (en) 2025-03-25
US20240296129A1 (en) 2024-09-05
US20200371948A1 (en) 2020-11-26
EP3977295B1 (de) 2024-09-04
US11461236B2 (en) 2022-10-04
US11940929B2 (en) 2024-03-26
US20240143516A1 (en) 2024-05-02
US20250028651A1 (en) 2025-01-23
US20250265200A1 (en) 2025-08-21
US20200371916A1 (en) 2020-11-26
US12380035B2 (en) 2025-08-05
US20240370380A1 (en) 2024-11-07
US11636040B2 (en) 2023-04-25
US12216591B2 (en) 2025-02-04
US12393521B2 (en) 2025-08-19
JP2022534892A (ja) 2022-08-04
US11886353B2 (en) 2024-01-30
US11119935B2 (en) 2021-09-14
US12147353B2 (en) 2024-11-19
US20230032348A1 (en) 2023-02-02
US11334494B2 (en) 2022-05-17
JP7553478B2 (ja) 2024-09-18
US20240193098A1 (en) 2024-06-13
US20230333991A1 (en) 2023-10-19
CN113853593A (zh) 2021-12-28
US20240028523A1 (en) 2024-01-25
US20200371961A1 (en) 2020-11-26
EP3977299A1 (de) 2022-04-06
US20240078190A1 (en) 2024-03-07
US20200371928A1 (en) 2020-11-26
US11442868B2 (en) 2022-09-13
US20250272250A1 (en) 2025-08-28
US12321285B2 (en) 2025-06-03
US12455836B2 (en) 2025-10-28
CN113853593B (zh) 2025-08-12
JP2024167392A (ja) 2024-12-03
EP4443435A3 (de) 2024-12-18
US20220309004A1 (en) 2022-09-29
US20210406190A1 (en) 2021-12-30
EP3977295A1 (de) 2022-04-06
US20200371911A1 (en) 2020-11-26
CN120973707A (zh) 2025-11-18
US20210342270A1 (en) 2021-11-04
US11868272B2 (en) 2024-01-09
US20230108306A1 (en) 2023-04-06
US20240419607A1 (en) 2024-12-19
US20220276965A1 (en) 2022-09-01
US11347649B2 (en) 2022-05-31
US12141078B2 (en) 2024-11-12
US20200371932A1 (en) 2020-11-26
EP4443435A2 (de) 2024-10-09
US11775446B2 (en) 2023-10-03

Similar Documents

Publication Publication Date Title
EP3977296B1 (de) Verfahren und vorrichtung zur erleichterung der schreibfehlerzwischenspeicherung in einem zwischenspeichersystem

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20220103

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

A4 Supplementary search report drawn up and despatched

Effective date: 20220621

RIC1 Information provided on ipc code assigned before grant

Ipc: G06F 12/08 20160101AFI20220614BHEP

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

17Q First examination report despatched

Effective date: 20240222

GRAP Despatch of communication of intention to grant a patent

Free format text: ORIGINAL CODE: EPIDOSNIGR1

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: GRANT OF PATENT IS INTENDED

INTG Intention to grant announced

Effective date: 20250310

P01 Opt-out of the competence of the unified patent court (upc) registered

Free format text: CASE NUMBER: APP_20664/2025

Effective date: 20250430

GRAS Grant fee paid

Free format text: ORIGINAL CODE: EPIDOSNIGR3

GRAA (expected) grant

Free format text: ORIGINAL CODE: 0009210

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE PATENT HAS BEEN GRANTED

AK Designated contracting states

Kind code of ref document: B1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

REG Reference to a national code

Ref country code: GB

Ref legal event code: FG4D

REG Reference to a national code

Ref country code: CH

Ref legal event code: EP

REG Reference to a national code

Ref country code: DE

Ref legal event code: R096

Ref document number: 602020056596

Country of ref document: DE

REG Reference to a national code

Ref country code: IE

Ref legal event code: FG4D