US20230004493A1 - Bulk memory initialization - Google Patents
Bulk memory initialization Download PDFInfo
- Publication number
- US20230004493A1 US20230004493A1 US17/902,263 US202217902263A US2023004493A1 US 20230004493 A1 US20230004493 A1 US 20230004493A1 US 202217902263 A US202217902263 A US 202217902263A US 2023004493 A1 US2023004493 A1 US 2023004493A1
- Authority
- US
- United States
- Prior art keywords
- bulk
- store
- memory
- store operation
- bulk store
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004891 communication Methods 0.000 claims abstract description 8
- 238000000034 method Methods 0.000 claims description 63
- 238000012545 processing Methods 0.000 claims description 12
- 230000004044 response Effects 0.000 claims description 8
- 230000000903 blocking effect Effects 0.000 claims description 7
- 238000005516 engineering process Methods 0.000 abstract description 5
- 230000008569 process Effects 0.000 description 51
- 230000009471 action Effects 0.000 description 11
- 238000010586 diagram Methods 0.000 description 7
- 238000004590 computer program Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 230000006870 function Effects 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- LHMQDVIHBXWNII-UHFFFAOYSA-N 3-amino-4-methoxy-n-phenylbenzamide Chemical compound C1=C(N)C(OC)=CC=C1C(=O)NC1=CC=CC=C1 LHMQDVIHBXWNII-UHFFFAOYSA-N 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 230000007723 transport mechanism Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0877—Cache access modes
- G06F12/0879—Burst mode
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/061—Improving I/O performance
- G06F3/0611—Improving I/O performance in relation to response time
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0811—Multiuser, multiprocessor or multiprocessing cache systems with multilevel cache hierarchies
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0815—Cache consistency protocols
- G06F12/0831—Cache consistency protocols using a bus scheme, e.g. with bus monitoring or watching means
- G06F12/0833—Cache consistency protocols using a bus scheme, e.g. with bus monitoring or watching means in combination with broadcast means (e.g. for invalidation or updating)
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0815—Cache consistency protocols
- G06F12/0837—Cache consistency protocols with software control, e.g. non-cacheable data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0888—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches using selective caching, e.g. bypass
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/10—Address translation
- G06F12/1009—Address translation using page tables, e.g. page table structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/0625—Power saving in storage systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0629—Configuration or reconfiguration of storage systems
- G06F3/0632—Configuration or reconfiguration of storage systems by initialisation or re-initialisation of storage systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0655—Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
- G06F3/0656—Data buffering arrangements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0655—Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
- G06F3/0659—Command handling arrangements, e.g. command buffers, queues, command scheduling
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/0671—In-line storage system
- G06F3/0683—Plurality of storage devices
- G06F3/0688—Non-volatile semiconductor memory arrays
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/10—Providing a specific technical effect
- G06F2212/1016—Performance improvement
- G06F2212/1024—Latency reduction
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- the disclosure generally relates to memory initialization in a computing system.
- a central processing unit To perform any operation on data that resides in main memory, a central processing unit (CPU) first issues a series of commands to the main memory (e.g., DRAM modules) across an off-chip bus that is commonly referred to as a memory channel.
- the main memory responds by sending the data to CPU, after which the data is placed within a cache. This process of moving data from the main memory to the CPU incurs a long latency, and consumes a significant amount of energy.
- Memory initialization is a process of establishing known values in the memory. Initialization of a region of memory could occur in response to an allocation of that region to, for example, a computer program or operating system. In some cases, memory is initialized to all zeroes.
- Initializing main memory is generally decomposed into a series of store instructions. Each store instruction may initialize a small region of main memory. For example, each store instruction may initialize a region of main memory that is the size of a cache line.
- the series of store instructions may be executed in CPU execution unit. Each store instruction may fetch a cache line into a cache, modify the cache line and write the cache line to the main memory. In those operations, the caches are not properly leveraged if the line brought into the caches are not reused later by the CPU.
- a computer system for initializing memory.
- the computer system comprises a processor core comprising a central processing unit (CPU), a load store unit, and an internal cache.
- the computer system comprises a last level cache in communication with the processor core.
- the last level cache is configured to receive bulk store operations from the load store unit. Each bulk store operation includes a physical address in the memory to be initialized.
- the last level cache is configured to send multiple write transactions to the memory for each bulk store operation to perform a bulk initialization of the memory for each bulk store operation.
- the last level cache is configured to track status of the bulk store operations.
- the last level cache is further configured to maintain cache coherence in a hierarchy of caches in the computer system when performing the bulk initialization of the memory for each bulk store operation.
- the load store unit comprises a bulk store combine buffer, and the load store unit is configured to store status of the bulk store operations in the bulk store combine buffer.
- the load store unit is further configured to send the bulk store operations directly to the last level cache while bypassing the internal cache.
- the load store unit is further configured to track bulk store operations that are pending. Each bulk store operation is associated with a region of the memory to be initialized. The load store unit is further configured to block younger loads associated with any region of the memory associated with any pending bulk store operation.
- the load store unit is configured to either set pending status for a bulk store operation to complete or remove the bulk store operation from the bulk store combine buffer in response to the last level cache indicating that the bulk store operation is complete.
- the last level cache is further configured to store information on intact status associated with each bulk store operation.
- the intact status indicates whether a region of the memory initialized by a bulk store operation is intact with initialization values.
- the last level cache is further configured to set the intact status to not intact responsive to another processor core writing to a region of the memory associated with a bulk store operation.
- the load store unit is further configured to invalidate an entry for a first bulk store operation in the bulk store combine buffer responsive to the intact status indicating that the status is not intact.
- the load store unit is further configured to maintain a corresponding entry for a second bulk store operation as a valid entry in the bulk store combine buffer responsive to the intact information indicating that the status is intact.
- the load store unit is further configured to respond to a younger load instruction that loads from a region of the memory initialized by a bulk store operation that is complete by providing known initialization values if the region is still intact.
- each bulk store operation initializes a region of the memory to all zeroes.
- each write transaction initializes a region of the memory that has a size of a cache line.
- each bulk store operation initializes a region of the memory that has a size of a page.
- the computer system further comprises logic configured to create a single bulk store operation from a plurality of store instructions that each are configured to initialize a cache line sized region in the memory.
- a method of initializing memory in a computer system comprises receiving, at a last level cache in a hierarchy of caches in the computer system, a bulk store operation from a load store unit in a processor core in the computer system.
- the method comprises performing a bulk initialization of the memory for each bulk store operation, including sending multiple write transactions from the last level cache to the memory for each bulk store operation.
- the method comprises tracking status of the bulk store operations.
- a computer system for initializing memory.
- the computer system comprises main memory, a central processing unit, a load store unit, and a hierarchy of caches comprising a last level cache.
- the load store unit comprises load store unit means for tracking status of page store operations. Each page store operation includes a physical address in the main memory.
- the last level cache comprises means for sending multiple write transactions to the main memory for each page store operation to initialize a page of the main memory.
- the last level cache comprises last level cache means for tracking status of the page store operations and reporting the status to the load store unit.
- FIG. 1 A is a block diagram of one embodiment of a computing system that may perform bulk initialization of memory.
- FIG. 1 B is a block diagram of one embodiment of a last level cache that forms multiple write transactions from one bulk store operation.
- FIG. 2 depicts one embodiment of the bulk store engine in FIG. 1 B .
- FIG. 3 depicts one embodiment of bulk store operation buffer, which may reside in the bulk store engine.
- FIG. 4 depicts one embodiment of a load store unit.
- FIG. 5 depicts a flowchart of one embodiment of a process of performing a bulk initialization of memory.
- FIG. 6 depicts a flowchart of one embodiment of a process performed at load store unit with respect to a bulk store operation.
- FIG. 7 depicts a flowchart of one embodiment of a process of actions at the load store unit when a bulk store operation is initiated.
- FIG. 8 depicts one embodiment of a process of actions at the last level cache to track the status of a bulk store operation.
- FIG. 9 depicts a flowchart of one embodiment of a process of actions at the last level cache to initialize the memory for a bulk store operation.
- FIG. 10 depicts a flowchart of one embodiment of a process of actions at the last level cache to maintain cache coherence while processing a bulk store operation.
- FIG. 11 depicts a flowchart of one embodiment of a process of actions performed at the load store unit when a bulk store operation is completed.
- FIG. 12 depicts a flowchart of one embodiment of a process of a load store unit handling loads while, or after, a bulk store operation is pending.
- Bulk initialization of memory refers to initializing a region of memory that is larger than a cache line in size.
- a cache line is a basic unit for cache storage and may also be referred to as a cache block.
- bulk initialization may be used to initialize a region that is four kilobytes in size (herein, a kilobyte is defined as 1024 bytes).
- a load store unit in a processor core sends a bulk store operation to a last level cache.
- the last level cache is configured to send multiple write transactions to the memory for each bulk store operation in order to perform a bulk initialization of the memory.
- the last level cache is configured to track status of the bulk store operation.
- the last level cache is configured to maintain cache coherence in a hierarchy of caches when performing the bulk initialization of the memory for each bulk store operation.
- the bulk store operation may eliminate the need to have numerous store transactions at the load store unit, which saves considerable time.
- the bulk store operation may eliminate the need to transfer a series of store transactions over the cache hierarchy, thereby saving considerable time.
- the bulk store operation may reduce or eliminate the need to cache data in the cache hierarchy when performing bulk initialization of the memory, thereby saving considerable time and reducing complexity.
- the load store unit has a bulk store combine buffer configured to hold status of the bulk store operations.
- the last level cache may report status of the bulk store operations to the load store unit.
- the load store unit blocks younger loads associated with any region of the memory associated with any pending bulk store operation.
- FIG. 1 A is a block diagram of one embodiment of a computing system 100 .
- the computing system 100 is configured to perform bulk initialization of memory.
- the computing system 100 includes a processor core 102 , a last level cache (LLC) 104 , and main memory 108 .
- the main memory 108 is optional.
- the main memory 108 is volatile memory, such as DRAM or SRAM. However, the main memory 108 is not required to be volatile memory.
- the processor core 102 contains at least one central processing unit (CPU) 110 , a load store unit (LSU) 112 , and internal cache 114 .
- the term internal cache 114 refers to cache that is on the same semiconductor die (chip) as the CPU 110 .
- the internal cache 114 contains an L1 cache and an L2 cache.
- the internal cache 114 may include more than one level of caches.
- a computing system may use a cache to improve computing performance. For instance, a computing system may store data that it needs to access more frequently in a smaller, faster cache memory instead of storing the data in a slower, larger memory (e.g., main memory 108 ).
- the computing system 100 has a hierarchy of caches that are ordered in what are referred to herein as cache levels.
- the cache levels are numbered from a highest level cache to lowest level cache. There may be two, three, four, or even more levels in the cache hierarchy.
- a convention is used to refer to the highest level cache with the lowest number, with progressively lower levels receiving progressively higher numbers.
- the highest level cache in the hierarchy may be referred to as L1 cache.
- the lower level cache levels may be referred to as L2 cache, L3 cache, L4 cache, etc.
- the internal cache 114 has L1 cache, which is a small, fast cache near the central processing unit 110 .
- the lowest level cache is referred to as a last level cache (LLC) 104 .
- LLC last level cache
- the computing system 100 performs a bulk initialization of main memory 108 .
- a processor core 102 initializes main memory 108 by sending commands to initialize cache line size regions of main memory 108 .
- the processor core 102 sends a single bulk store operation to the last level cache 104 in order to initialize a large region in main memory 108 .
- the region could be four kilobytes in size.
- the bulk store operation is used to initialize a region that has the size of page.
- a page is defined as the smallest unit of data for memory management in a virtual address space. A page is typically described by a single entry in a page table.
- a page could be the equivalent of, for example, 64 cache lines.
- the processor core 102 sends one bulk store operation instead of 64 store transactions, as one example.
- the load store unit 112 may send the bulk store operation directly to the last level cache 104 , bypassing the internal cache 114 (e.g., L1 cache, L2 cache) in the processor core 102 . Bypassing the internal cache 114 improves efficiency.
- the bulk store operation may be referred to as a page store operation when the region of memory that is initialized is a page in size. In general, the bulk store operation is used to initialize a region of memory that is multiple cache lines in size.
- the last level cache 104 forms multiple write transactions from one bulk store operation.
- the last level cache 104 is configured to send multiple write transactions to the main memory 108 for each bulk store operation to perform a bulk initialization of the main memory 108 for each bulk store operation.
- the last level cache 104 is configured to track status of the bulk store operations received from the load store unit 112 . Further details of one embodiment of the last level cache are depicted in FIG. 1 B .
- main memory 108 could be shared between the processor core 102 depicted in FIG. 1 A and other processors (not depicted in FIG. 1 A ). It is possible that such other processors could attempt to access a region of main memory 108 during bulk initialization of that region. In other words, such accesses could leave the region in a state such that it is not certain that the region still contains the values to which it was initialized.
- the region being intact means that the region still contains the values to which it was initialized.
- the last level cache 104 is configured to track such possible accesses and, if necessary, change the status from intact to not intact. Such status may be reported to the load store unit 112 .
- FIG. 1 B depicts of one embodiment of a last level cache 104 that forms multiple write transactions from one bulk store operation.
- the last level cache 104 has a bulk store engine 118 and a cache pipeline 120 .
- the bulk store engine 118 and cache pipeline 120 may be implemented in hardware.
- bulk store engine 118 comprises combinational logic and sequential logic.
- the cache pipeline 120 comprises combinational logic and sequential logic.
- to initiate a bulk initialization of main memory 108 the load store unit 112 in the processor core 102 sends a bulk store operation to the last level cache 104 .
- the bulk store engine 118 generates multiple write transactions for the bulk store operation, and sends the multiple write transactions to the cache pipeline 120 .
- the cache pipeline 120 processes each write transaction.
- the cache pipeline maintains cache coherency in the caches in the computer system 100 , while the bulk store operation is being processed.
- the write transactions may be cache line sized transactions.
- FIG. 2 depicts one embodiment of the bulk store engine 118 depicted in FIG. 1 B .
- the bulk store engine 118 is configured to perform a bulk initialization of a region in main memory 108 for each bulk store operation.
- the bulk store engine 118 has write transaction former 202 that is configured to form multiple write transactions for each bulk store operation.
- each write transaction is for a cache line sized region in main memory 108 .
- the write transaction former 202 sends the write transactions cache pipeline 120 , which sends a corresponding number of write transactions to the main memory 108 .
- the bulk store engine 118 has a bulk store status tracker 204 that is configured to track the status of each bulk store operation.
- the bulk store status tracker 204 keeps track of the status in a bulk store operation buffer 206 .
- the write transaction former 202 and the bulk store status tracker 204 may be part of a cache controller in the last level cache 104 .
- the write transaction former 202 is implemented in hardware.
- the write transaction former 202 may be implemented with sequential and/or combinational logic.
- bulk store status tracker 204 is implemented in hardware.
- the bulk store status tracker 204 may be implemented with sequential and/or combinational logic.
- the bulk store operation buffer 206 is depicted in the bulk store engine 118 .
- the bulk store operation buffer 206 may be implemented in a portion of the memory that is used for cache entries in the last level cache 104 . Further details of one embodiment of bulk store operation buffer 206 are shown in FIG. 3 .
- FIG. 3 depicts one embodiment of the bulk store operation buffer 206 .
- the bulk store engine 118 receives a bulk store operation, the bulk store engine 118 creates a new entry in the bulk store operation buffer 206 .
- Each bulk store operation includes a physical address (PA) 304 , which is the starting physical address in memory (e.g., main memory 108 ) to be initialized.
- PA physical address
- FIG. 3 there are entries for two bulk store operations. One of the entries has a physical address of 0x8000 (in hexadecimal or HEX). The other entry has a physical address of 0x9000.
- the size of the region in memory to be initialized may be a default value. Hence, the size need not be specified in the bulk store operation.
- the size of the region may be specified in the bulk store operation.
- the size is 4 kilobytes (or 1000 HEX).
- one of the bulk store operations may be used to initialize physical addresses between 0x8000 and 0x8FFF (inclusive).
- the other bulk store operation may be used to initialize physical addresses between 0x9000 and 0x9FFF (inclusive).
- the bulk store engine 118 tracks status of the bulk store operation.
- the column labeled “Progress” 306 is used to track how far along the bulk store operation has proceeded. As noted above, the bulk store engine 118 forms multiple write transactions for each bulk store operation.
- the progress column 306 is used to track how many of the write transactions have been completed. In FIG. 3 , one of the bulk store operations is done, and the other has 53 write transactions completed. There may be, for example, 64 write transactions to main memory 108 for a bulk store operation.
- the bulk store engine 118 also monitors whether the region in main memory 108 associated with a bulk store transaction is affected by any other stores to main memory 108 . For example, during the bulk initialization of a region in main memory 108 , a portion of that region could be written to. This write might come from a processor core other than the processor core that initiated the bulk store operation.
- the column labeled “Intact” (with each entry being referred to as intact flag 308 ) is used to track whether the region is intact.
- the column labeled “Valid” (with each entry being referred to as an LLC valid flag 302 ) is used to track whether the entry is still valid.
- the entry for a bulk store operation is invalidated if the intact flag 308 is set to zero. Otherwise, the entry may remain in the bulk store operation buffer 206 after the bulk store operation is complete.
- FIG. 4 depicts one embodiment of a load store unit 112 .
- the load store unit 112 has a store queue 402 , a store combine buffer 404 , and a bulk store combine buffer 406 .
- the store queue 402 , store combine buffer 404 , and bulk store combine buffer 406 may be implemented in memory in the processor core 102 .
- the bulk store manager 408 is configured to maintain the bulk store combine buffer 406 .
- the bulk store manager 408 may be implemented in hardware.
- a number of entries 402 - 1 to 402 - 8 are depicted on the store queue 402 .
- the entries may be executed in an order from entry 402 - 1 to entry 402 - 8 .
- the entries correspond to the instructions in Table I. However, since instruction I4 is a load instruction, it is not represented on the store queue 402 .
- instruction I4 is a load instruction, it is not represented on the store queue 402 .
- the physical address in main memory at which to store some value may be derived from register R1.
- register R1 contains a virtual address, which is converted into a physical address in main memory 108 .
- Entry 402 - 1 holds an operation (St0) corresponding to instruction I0 in Table I. Entry 402 - 1 is thus an operation to store the contents of register R8 to physical address 0x1000 in main memory 108 . Entry 402 - 2 holds an operation (St1) corresponding to instruction I1 in Table I. Entry 402 - 2 is thus an operation to store the contents of register R9 to physical address 0x1040 in main memory 108 . Entry 402 - 3 holds an operation (St2) corresponding to instruction I2 in Table I. Entry 402 - 3 is thus an operation to store the contents of register R9 to physical address 0x1080 in main memory 108 .
- These three store operations may be conventional store operations, which may each store to a region of memory equal to 64 bytes.
- the region may be larger or smaller (e.g., 32 bytes or 128 bytes).
- 64 bytes is the size of a cache line.
- the cache line may be larger or smaller (e.g., 32 bytes or 128 bytes).
- Entry 402 - 4 corresponds to instruction I3 in Table I and holds a bulk store operation (BlkSt0).
- the bulk store operation has a physical address of 0x8000.
- the bulk store operation is used to initialize a region of 0x1000 in main memory 108 .
- the bulk store operation is used to initialize a region the size of a page in main memory 108 .
- the page may be, for example, 4 kilobytes (or 1000 HEX) in size.
- instruction I3 specifies register R3, which indicates that the physical address may be obtained based on the contents of register R3.
- register R3 contains a virtual address, which is translated to a physical address in main memory 108 .
- Instruction I3 does not contain an operand for the data to be stored at the physical address, as the data may be implied by the DC ZVA PG instruction.
- the DC ZVA PG instruction implies that the contents of memory are to be zeroed out.
- the DC ZVA PG instruction could be used to imply some other pattern, such as initializing the memory to all ones.
- an operand could be provided in the DC ZVA PG instruction to, for example, provide a pattern to be written to memory.
- a second register could be specified in the DC ZVA PG instruction, wherein the contents of the second register contain a pattern to be written to memory. Note that this pattern may be repeated many times, as the size of the region to be initialized in memory is typically much larger than the register.
- instruction I4 is a load instruction, as opposed to a store instruction.
- a load queue (not depicted in FIG. 4 ) in the load store unit 112 on which a load operation for instruction I4 may be placed.
- the bulk store manager 408 blocks younger loads to regions of memory that are being initialized by bulk store operations. Hence, it is possible that the bulk store manager 408 could block instruction I4 from executing due a pending bulk store operation to the region of memory from which instruction I4 is to load.
- Entry 402 - 5 holds an operation (St3) corresponding to I5 in Table I.
- entry 402 - 5 is an operation to store the contents of register R9 at a physical address 0x10c0 in main memory 108 .
- Entry 402 - 6 holds a bulk store operation (BlkSt1) corresponding to instruction I6 in Table 1.
- the bulk store operation St1 has a physical address of 0x9000, which is determined based on adding 0x1000 to the contents of register R1 (see Table 1).
- the contents of register R1 could be a virtual address, which is translated to a physical address.
- Entry 402 - 7 holds a bulk store operation (BlkSt2) corresponding to instruction I7 in Table 1.
- the bulk store operation St2 has a physical address of 0xa000, which is determined based on adding 0x2000 to the contents of register R1 (see Table 1). Entry 402 - 8 holds a bulk store operation (BlkSt3) corresponding to instruction I8 in Table 1.
- the bulk store operation St3 has a physical address of 0x8000, which is determined based on the contents of register R1 (see Table 1).
- the store combine buffer 404 is used to track store operations. As indicated by the physical addresses, entries for the first three conventional store operations (St0, St1, St2) are represented in the store combine buffer 404 .
- the store combine buffer 404 has a column that indicates whether the respective store operation resulted in a cache hit.
- the store combine buffer 404 has a column that indicates whether the entry is currently valid.
- the bulk store combine buffer 406 is used to track bulk store operations. As indicated by the physical addresses in the physical address column 424 , entries for the first three bulk store operations (BlkSt0, BlkSt1, BlkSt2) are represented in the bulk store combine buffer 406 .
- the bulk store combine buffer 406 has a column that indicates whether the respective bulk store operation is pending (referred to a pending flag 426 ).
- the bulk store combine buffer 406 has a column that indicates whether the entry is currently valid (referred to a LSU valid flag 422 ).
- the bulk store manager 408 is configured to maintain the bulk store combine buffer 406 .
- the bulk store manager 408 may add entries to the bulk store combine buffer 406 when a bulk store operation is initiated.
- the bulk store manager 408 may update the status (e.g., pending, valid) in response to status reports from the bulk store engine 118 in the LLC 104 . Further details of one embodiment of maintaining the bulk store combine buffer 406 are described in connection with FIG. 12 to be discussed below.
- the bulk store manager 408 blocks younger loads to any region of main memory 108 for which a bulk store operation is pending. Further details of one embodiment of blocking younger loads are described in connection with FIG. 12 to be discussed below.
- the bulk store manager 408 may be implemented in hardware. In one embodiment, the bulk store manager 408 comprises combinational logic and sequential logic.
- FIG. 5 depicts a flowchart of one embodiment of a process 500 of performing a bulk initialization of memory.
- the process 500 may be used in computer system 100 to initialize main memory 108 .
- process 500 is performed by bulk store engine 118 in LLC 104 .
- Steps 504 - 506 in process 500 are described in a certain order as a matter of convenience of explanation and do not necessarily occur in the depicted order. Thus, steps 504 - 506 could occur in a different order. Also, steps 504 - 506 may be performed concurrently.
- Step 502 includes receiving a bulk store operation at a last level cache (LLC) 104 in a computer system 100 .
- the processor core 102 sends the bulk store operation to the LLC 104 .
- the load store unit 112 sends the bulk store operation to the LLC 104 .
- the bulk store operation may bypass the other caches, such as internal cache 114 (e.g., L1 cache and L2 cache). Therefore, the other caches may be offloaded during the bulk store operation.
- Step 504 includes performing a bulk initialization of memory for the bulk store operation.
- bulk initialization of main memory 108 is performed.
- the bulk initialization results in a zeroing out of a region of the memory.
- the contents of the region of memory may be all zeros after the bulk initialization.
- a different pattern could result from the bulk initialization.
- the contents of the region of memory may be all ones after the bulk initialization.
- a different pattern could result such as alternating ones and zeroes. Further details of one embodiment of performing a bulk initialization of memory are shown and described with respect to FIG. 9 .
- Step 506 includes tracking status of the bulk store operation.
- the bulk store engine 118 updates the bulk store operation buffer 206 .
- the bulk store engine 118 may update the progress column, the intact column, and the valid column. Further details of one embodiment of tracking status of a bulk initialization operation are shown and described with respect to FIG. 8 .
- FIG. 6 depicts a flowchart of one embodiment of a process 600 performed at load store unit 112 with respect to a bulk store operation.
- the process 600 may be initiated when instructions being executed in the processor core 102 indicate that a bulk store operation is to be performed.
- Process 600 describes two ways in which a bulk store operation may be initiated.
- Step 602 a describes Option A in which the bulk store operation is obtained from a bulk store instruction in a set of instructions executed in the processor core 102 .
- Table I shows a set of instructions that contain four bulk store instructions (Instructions I3, I6, I7, and I8).
- Step 602 b describes Option B in which the bulk store operation is formed based on a number of store instructions.
- Each of these store instructions are to store the same values to memory.
- each of the store instructions may be to zero out memory.
- each of these store instructions may be to store to a different region in memory.
- the store instructions may be configured to store to a contiguous region of the memory.
- Table II depicts example store instructions from which a bulk store operation may be formed. Forming a single bulk store operation from multiple store instructions may be referred to as code morphing.
- the bulk store manager 408 is able to perform the code morphing.
- the instructions are numbered from I0 to I63 in Table II, but these are not the same instructions as in Table I.
- each store instruction is associated with a region of memory having a size of 40 HEX (or 64 bytes).
- each of the store instructions specifies the address based on the contents of register R1.
- register R1 contains a virtual address that is translated to a physical address in main memory 108 .
- the 64 store instructions are thus to write to a contiguous region of memory totaling four kilobytes. Note that the size of the region to which each instruction writes, the total size of the region that all instructions write, and the number of instructions are all for the purpose of example. However, the store instructions from which the bulk store operation is formed should write to a contiguous region of memory.
- each of the store instructions specifies the data based on the contents of register R8. This is for the purpose of illustration. In one embodiment, the data should be the same for all of the store instructions. In one embodiment, the data is not expressly provided, but is implied. For example, the second register (R8 in Table II) need not be provided in one embodiment, wherein the data is implied. The implied data could be to zero out the memory.
- Step 604 includes calculating a physical address to be initialized in memory.
- Step 604 may include a virtual address to physical address translation.
- the addresses contained in the register(s) referenced in the instructions from which bulk store operations are formed are virtual addresses.
- the address in register R1 in the instructions in Table I may be a virtual address.
- the address in register R1 in the instructions in Table II may be a virtual address.
- Step 606 includes allocating an entry in the bulk store combine buffer 406 for the bulk store operation.
- Step 608 includes the load store unit 112 sending a bulk store operation to the last level cache 104 .
- the bulk store operation includes the physical address in main memory 108 that is to be initialized.
- the bulk store operation also includes an operand or other identifier that indicates that this is a bulk store operation.
- the load store unit 112 sends the bulk store operation directly to the last level cache 104 , bypassing all other caches in a cache hierarchy (such as internal cache 114 ). This has the benefit of offloading the other caches from processing the bulk store operation.
- Step 610 includes the load store unit 112 waiting for the bulk store operation to complete. By waiting for the bulk store operation it is meant that the load store unit 112 does not take action to initialize the main memory 108 , as that is left to the last level cache 104 .
- Step 612 is performed while waiting for the bulk store operation to complete.
- Step 612 includes blocking younger loads to the region of main memory 108 being initialized by the bulk store operation.
- a younger load means a load that, in strict accordance with the order of instructions, is to occur after the bulk store operation. Note that sometimes instructions to load from memory or store to memory may be executed out of order.
- instruction I4 is a younger load relative to instruction I3. Thus, if the bulk store operation originated from instruction I3, the load associated with instruction I4 would be blocked until the bulk store operation completes, under the assumption that the load is from a region of main memory 108 being initialized by the bulk store operation. However, instruction I4 is not a younger load with respect to instructions I6, I7 or I8. Thus, if the bulk store operation originated from any of instructions I6, I7 or I8, the load associated with instruction I4 would not be blocked. Further details of one embodiment of blocking younger loads are described below in connection with FIG. 12 .
- step 614 is performed.
- the last level cache 104 informs the load store unit 112 when the bulk store operation is finished.
- Step 614 includes releasing/updating the entry for the bulk store operation in the bulk store combine buffer 406 .
- Releasing the entry means to remove or otherwise mark the entry so that it is no longer used.
- the entry is marked invalid to release it.
- the entry is physically deleted to release it. Updating the entry means that the entry is changed in some manner and that the information in the entry may still be used.
- the pending status is changed from pending to not pending, and the LSU valid flag 422 is kept at valid when updating the entry. A status of not pending may also be referred to as complete. Further details of one embodiment of releasing/updating the entry for the bulk store operation are described below in connection with FIG. 11 .
- FIG. 7 depicts a flowchart of one embodiment of a process 700 of actions at the load store unit 112 when a bulk store operation is initiated.
- Process 700 may be performed after a bulk store operation has been added to the store queue 402 .
- Process 700 describes further details of one embodiment of step 606 in FIG. 6 .
- Step 702 includes the load store unit 112 accessing a bulk store operation from the store queue 402 .
- the bulk store operation at entry 402 - 6 will be discussed in process 700 .
- Step 704 includes creating an entry for the bulk store operation to the bulk store combine buffer 406 .
- Step 704 also includes adding the physical address for the bulk store operation to the entry.
- Step 706 includes setting the pending flag 426 in the entry to “1”.
- Step 708 includes setting the LSU valid flag 422 in the entry to “1”. With reference to FIG. 4 , the entry having physical address 0x9000 as added. The pending flag 426 for the entry is set to “1”. The LSU valid flag 422 for the entry is set to “1”.
- FIG. 8 depicts one embodiment of a process 800 of actions at the last level cache 104 to track the status of a bulk store operation.
- Process 800 provides further details of one embodiment of step 506 in FIG. 5 .
- process 800 is performed by bulk store status tracker 204 .
- Step 802 includes the last level cache 104 receiving a bulk store operation from the load store unit 112 .
- step 802 occurs as a result of step 608 in FIG. 6 .
- the bulk store operation contains an operand (or other type of identifier) that indicates that this is a bulk store operation.
- the last level cache 104 identifies this as a bulk store operation based on the operand.
- the bulk store operation also contains a physical address in main memory 108 that is to be initialized.
- Step 804 includes the bulk store engine 118 in the last level cache 104 creating an entry for the bulk store operation in the bulk store operation buffer 206 .
- Step 804 also includes adding the physical address in the bulk store operation to the buffer entry.
- Step 806 includes setting the intact flag 308 in the entry to “1”.
- Step 808 includes setting the LLC valid flag 302 in the entry to “1”.
- the pending flag for the entry is set to “1”.
- the LLC valid flag 302 for the entry is set to “1”.
- the progress field is initially set to 0 to indicate that the process of sending write transactions to the main memory 108 has not yet started.
- Step 810 includes tracking the status of the bulk store operation.
- Step 810 includes modifying the progress field as more of the memory is initialized for this bulk store operation. Further details of updating the progress field are described in connection with FIG. 9 .
- Step 810 may include modifying the intact flag 308 for the entry.
- Step 810 may include modifying the LLC valid flag 302 for the entry.
- Step 812 includes the last level cache 104 reporting the completion of the bulk store operation to the load store unit 112 .
- Step 812 also includes the last level cache 104 reporting the status of the bulk store operation to the load store unit 112 .
- the status includes the intact status.
- FIG. 9 depicts a flowchart of one embodiment of a process 900 of actions at the last level cache 104 to initialize the memory for a bulk store operation.
- Process 900 provides further details of one embodiment of step 504 in FIG. 5 .
- Step 902 includes setting an initial physical address to the address in the bulk store operation. This is a physical address in main memory 108 , in one embodiment.
- Step 904 includes forming a write transaction to write at the current physical address.
- the write transaction is a write transaction that writes one cache line.
- the write transaction is a WriteUnique transaction.
- the WriteUnique transaction is compliant with the AMBA® 5 CHI Architecture Specification, which is published by ARM Ltd. As known to those of ordinary skill in the art, there are a variety of types of WriteUnique transactions (e.g., WriteUniquePtl, WriteUniqueFull, WriteUniquePtlStash, WriteUniqueFullStash).
- Step 906 includes sending the write transaction to the main memory 108 .
- Step 906 may also include receiving a response from the main memory reporting the status of the write transaction. For the sake of discussion, it is assumed in process 900 that all write transactions complete successfully. However, if there is an error with one or more write transactions, then the process 900 could end with an error status.
- step 906 includes sending the WriteUnique transaction that was formed in step 904 to the cache pipeline 120 .
- the WriteUnique transaction may be used to remove all copies of a cache line before issuing a write transaction to main memory 108 .
- the WriteUnique transaction could result in a back snoop to the processor core 102 .
- the WriteUnique transaction could result in snoops of other processor cores, as well. After the snoops are done, the data is written to the main memory 108 .
- Step 908 includes updating the progress of the bulk store operation in the buffer 206 in the bulk store engine 118 .
- the progress field serves as a counter of the number of write transaction that have successfully completed. Thus, the progress field may be incremented by one each time a write transaction successfully completes.
- Step 910 is a determination of whether the bulk store operation is done. In other words, the bulk store engine 118 determines whether all of the write transactions have successfully completed. If not, then control passes to step 912 , wherein the physical address is incremented.
- the size of the increment is equal to the size of each write transaction, in one embodiment.
- the size of the increment is equal to the size of a cache line, in one embodiment.
- step 912 control passes to step 904 .
- step 904 another write transaction is formed using the current value of the physical address.
- control passes to step 914 .
- Step 914 includes the last level cache 104 sending a completion status for the bulk store operation to the load store unit 112 .
- the completion status includes an indication of whether the bulk store operation was successful at initializing memory.
- the completion status includes the intact status for the bulk store operation entry in buffer 206 .
- FIG. 10 depicts a flowchart of one embodiment of a process 1000 of actions at the last level cache 104 to maintain cache coherence while processing a bulk store operation.
- process 1000 is performed for each of the write transactions in step 906 of process 900 .
- process 1000 provides further details for one embodiment of step 906 .
- process 1000 is performed by the cache pipeline 120 in the last level cache 104 .
- Process 1000 may be performed for each write transaction (e.g., each WriteUnique transaction) sent to the cache pipeline 120 .
- Step 1002 includes the bulk store engine 118 sending a write transaction to the cache pipeline 120 .
- this may be a WriteUnique transaction.
- the write transaction is to write to a region of memory having the size of a cache line.
- Step 1004 includes the last level cache 104 checking the tag and the snoop filter.
- the tag may be used to determine whether the last level cache 104 has a cache line associated with the address in main memory to be initialized by the write transaction.
- the snoop filter may be examined to determine whether another cache has a cache line associated with the address in main memory to be initialized by the write transaction. The snoop filter thus keeps track of coherency states of cache lines.
- Step 1006 includes the last level caching 104 snooping. Step 1006 may result in a back snoop to the processor core 102 that initiated the bulk store operation. Step 1006 may result in a snoop of other processor cores that share the main memory 108 .
- Step 1008 includes the last level cache 104 updating the tag and the snoop filter. Hence, the last level cache is able to maintain cache coherence while processing the bulk store operation.
- Step 1010 includes updating the status for the bulk store operation, if necessary. Note that during process 1000 , other processor cores could be trying to read or write to a portion of the main memory 108 that is being initialized by the bulk store operation. In one embodiment, if any read request touches the region of main memory 108 being initialized, the intact flag 308 in the bulk store operation buffer 206 is set to 0. In one embodiment, if any snoop request touches the region of main memory 108 being initialized, the intact flag 308 is set to 0.
- Step 1012 includes the last level cache sending a write transaction to the main memory 108 .
- FIG. 11 depicts a flowchart of one embodiment of a process 1100 of actions performed at the load store unit 112 when a bulk store operation is completed.
- Step 1102 includes the load store unit 112 receiving an indication from the last level cache 104 that the bulk store operation has completed.
- Step 1104 includes the load store unit 112 checking whether an intact flag 308 in the response is set to 1 or 0.
- the last level cache 104 sets the intact flag 308 to 1 to indicate that the region of memory being initialized is still intact.
- the last level cache sets the intact flag 308 to 0 to indicate that the region of memory being initialized is no longer intact.
- Steps 1106 and 1108 are performed in response to the intact flag 308 being 1.
- the pending flag 426 in the entry for this bulk store operation in the bulk store combine buffer 406 is set to 0, which indicates that the bulk store operation is no longer pending (otherwise referred to as complete).
- Step 1108 includes keeping the LSU valid flag 422 in the entry in the bulk store combine buffer 406 at 1.
- a the LSU valid flag 422 of 1, along with a pending flag 426 of 0, may be interpreted as the region in memory that was initialized still being intact after completion of the bulk store operation.
- Step 1110 is performed in response to the intact flag 308 being 0.
- the entry for this bulk store operation in the bulk store combine buffer 406 invalidated. In one embodiment, this includes setting the LSU valid flag 422 in the entry in the bulk store combine buffer 406 to 0, which indicates that the entry is no longer valid. Other techniques may be used to invalidate the entry.
- Step 1112 includes the load store unit 112 sending a completion acknowledgment (ACK) to the bulk store engine 118 .
- ACK completion acknowledgment
- FIG. 12 depicts a flowchart of one embodiment of a process 1200 of a load store unit 112 handling loads while, or after, a bulk store operation is pending.
- Step 1202 includes the load store unit 112 accessing a load operation.
- the load operation may be accessed from a load queue in the load store unit 112 .
- the load operation may be associated with a load instruction, such as instruction I4 in Table I.
- Step 1204 includes checking the bulk store combine buffer 406 for a bulk store operation that covers the physical address in the load command.
- a first example load instruction is to load the data at 0x6040 in main memory 108 to register R3.
- a second example load instruction is to load the data at 0x8040 in main memory 108 to register R3.
- a third example load instruction is to load the data at 0x9040 in main memory 108 to register R3.
- step 1206 control passes to step 1208 to load the data for that first example instruction.
- the data at 0x6040 in main memory 108 may be loaded into, for example, register R3.
- the bulk store operation covers 0x8040 in main memory 108 .
- the bulk store operation with physical address 0x8000 in main memory 108 covers 0x8040 in main memory 108 (due to the 1000 HEX length of the bulk store operation).
- the bulk store operation with physical address 0x9000 in main memory 108 covers 0x9040 in main memory 108 (due to the 1000 HEX length of the bulk store operation).
- control would pass to step 1210 .
- Step 1210 includes a determination of whether the pending flag 426 for the bulk store operation is set. If so, control passes to step 1212 .
- the pending flag 426 is set for bulk store operation with physical address 0x9000 in main memory 108 .
- the load from 0x9040 in main memory 108 is blocked, in step 1212 .
- the load store unit 112 does not allow the third example load instruction to load the data at 0x9040 in main memory 108 into register R3. The blocking is enforced until the bulk store operation with physical address 0x9000 in main memory 108 is completed.
- Step 1214 includes a determination of whether the LSU valid flag 422 is sent for the relevant entry in the bulk store combine buffer 406 . If the LSU valid flag 422 is not set (step 1214 is no), then the data is loaded from the relevant address in main memory 108 , in step 1216 .
- step 1214 If the LSU valid flag 422 is set (step 1214 is yes), then the data need not be loaded from the relevant address in main memory 108 . Instead, since the initialization values are known, the known initialization values can be provided in step 1218 . For example, if it is known that the memory is initialized to all zeroes, then all zeroes are provided to respond to the load operation, without the need to access main memory 108 . Hence, time can be saved by avoiding a memory access. Also, it is not necessary to store the initialization values in, for example, 64 cache lines. In one embodiment, one entry in the bulk store combine buffer 406 contains information to respond to load requests in step 1218 .
- the information in the entry in the bulk store combine buffer 406 may be used to respond to load instructions that request data for any portion of a large (e.g., page sized) region in memory that was initialized by a completed bulk store operation.
- cache space may be saved by not storing initialization values in, for example, 64 cache lines.
- the load store unit means for tracking status of page store operations comprises bulk store manager. In one embodiment, the load store unit means for tracking status of page store operations is configured to perform process 700 . In one embodiment, the load store unit means for tracking status of page store operations is configured to perform process 1100 .
- means for sending multiple write transactions to the main memory for each page store operation to initialize a page of the main memory comprises one or more of bulk store engine and cache pipeline. In one embodiment, the means for sending multiple write transactions to the main memory for each page store operation to initialize a page of the main memory is configured to perform process 900 .
- means for tracking status of the page store operations and reporting the status to the load store unit comprises one or more of bulk store engine and cache pipeline. In one embodiment, the means for tracking status of the page store operations and reporting the status to the load store unit is configured to perform process 1000 .
- means for maintaining cache coherence in the hierarchy of caches when initializing the page of the main memory for each page store operation comprises one or more of bulk store engine and cache pipeline. In one embodiment, the means for maintaining cache coherence in the hierarchy of caches when initializing the page of the main memory for each page store operation is configured to perform process 1000 .
- the means for tracking page store operations that are pending, wherein each page store operation is associated with a region of the memory to be initialized comprises bulk store manager. In one embodiment, the means for tracking page store operations that are pending, wherein each page store operation is associated with a region of the memory to be initialized is configured to perform process 700 . In one embodiment, the means for tracking page store operations that are pending, wherein each page store operation is associated with a region of the memory to be initialized is configured to perform process 1100 .
- the means for blocking younger loads associated with any region of the memory associated with any pending page store operation comprises bulk store manager. In one embodiment, the means for blocking younger loads associated with any region of the memory associated with any pending page store operation is configured to perform process 1200 .
- the technology described herein can be implemented using hardware, software, or a combination of both hardware and software.
- the software used is stored on one or more of the processor readable storage devices described above to program one or more of the processors to perform the functions described herein.
- the processor readable storage devices can include computer readable media such as volatile and non-volatile media, removable and non-removable media.
- computer readable media may comprise computer readable storage media and communication media.
- Computer readable storage media may be implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
- Examples of computer readable storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer.
- a computer readable medium or media does (do) not include propagated, modulated or transitory signals.
- Communication media typically embodies computer readable instructions, data structures, program modules or other data in a propagated, modulated or transitory data signal such as a carrier wave or other transport mechanism and includes any information delivery media.
- modulated data signal means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.
- communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as RF and other wireless media. Combinations of any of the above are also included within the scope of computer readable media.
- some or all of the software can be replaced by dedicated hardware logic components.
- illustrative types of hardware logic components include Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Application-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), special purpose computers, etc.
- FPGAs Field-programmable Gate Arrays
- ASICs Application-specific Integrated Circuits
- ASSPs Application-specific Standard Products
- SOCs System-on-a-chip systems
- CPLDs Complex Programmable Logic Devices
- special purpose computers etc.
- software stored on a storage device
- the one or more processors can be in communication with one or more computer readable media/storage devices, peripherals and/or communication interfaces.
- each process associated with the disclosed technology may be performed continuously and by one or more computing devices.
- Each step in a process may be performed by the same or different computing devices as those used in other steps, and each step need not necessarily be performed by a single computing device.
Abstract
Description
- This application is a continuation of PCT Patent Application No. PCT/US2020/021153, entitled “BULK MEMORY INITIALIZATION”, filed Mar. 5, 2020, the entire contents of which is hereby incorporated by reference.
- The disclosure generally relates to memory initialization in a computing system.
- Data movement is becoming a greater performance bottleneck in modern processors. To perform any operation on data that resides in main memory, a central processing unit (CPU) first issues a series of commands to the main memory (e.g., DRAM modules) across an off-chip bus that is commonly referred to as a memory channel. The main memory responds by sending the data to CPU, after which the data is placed within a cache. This process of moving data from the main memory to the CPU incurs a long latency, and consumes a significant amount of energy.
- Memory initialization is a process of establishing known values in the memory. Initialization of a region of memory could occur in response to an allocation of that region to, for example, a computer program or operating system. In some cases, memory is initialized to all zeroes. Initializing main memory is generally decomposed into a series of store instructions. Each store instruction may initialize a small region of main memory. For example, each store instruction may initialize a region of main memory that is the size of a cache line. The series of store instructions may be executed in CPU execution unit. Each store instruction may fetch a cache line into a cache, modify the cache line and write the cache line to the main memory. In those operations, the caches are not properly leveraged if the line brought into the caches are not reused later by the CPU.
- According to one aspect of the present disclosure, there is provided a computer system for initializing memory. The computer system comprises a processor core comprising a central processing unit (CPU), a load store unit, and an internal cache. The computer system comprises a last level cache in communication with the processor core. The last level cache is configured to receive bulk store operations from the load store unit. Each bulk store operation includes a physical address in the memory to be initialized. The last level cache is configured to send multiple write transactions to the memory for each bulk store operation to perform a bulk initialization of the memory for each bulk store operation. The last level cache is configured to track status of the bulk store operations.
- Optionally, in any of the preceding aspects, the last level cache is further configured to maintain cache coherence in a hierarchy of caches in the computer system when performing the bulk initialization of the memory for each bulk store operation.
- Optionally, in any of the preceding aspects, the load store unit comprises a bulk store combine buffer, and the load store unit is configured to store status of the bulk store operations in the bulk store combine buffer.
- Optionally, in any of the preceding aspects, the load store unit is further configured to send the bulk store operations directly to the last level cache while bypassing the internal cache.
- Optionally, in any of the preceding aspects, the load store unit is further configured to track bulk store operations that are pending. Each bulk store operation is associated with a region of the memory to be initialized. The load store unit is further configured to block younger loads associated with any region of the memory associated with any pending bulk store operation.
- Optionally, in any of the preceding aspects, the load store unit is configured to either set pending status for a bulk store operation to complete or remove the bulk store operation from the bulk store combine buffer in response to the last level cache indicating that the bulk store operation is complete.
- Optionally, in any of the preceding aspects, the last level cache is further configured to store information on intact status associated with each bulk store operation. The intact status indicates whether a region of the memory initialized by a bulk store operation is intact with initialization values. The last level cache is further configured to set the intact status to not intact responsive to another processor core writing to a region of the memory associated with a bulk store operation.
- Optionally, in any of the preceding aspects, the load store unit is further configured to invalidate an entry for a first bulk store operation in the bulk store combine buffer responsive to the intact status indicating that the status is not intact. The load store unit is further configured to maintain a corresponding entry for a second bulk store operation as a valid entry in the bulk store combine buffer responsive to the intact information indicating that the status is intact.
- Optionally, in any of the preceding aspects, the load store unit is further configured to respond to a younger load instruction that loads from a region of the memory initialized by a bulk store operation that is complete by providing known initialization values if the region is still intact.
- Optionally, in any of the preceding aspects, each bulk store operation initializes a region of the memory to all zeroes.
- Optionally, in any of the preceding aspects, each write transaction initializes a region of the memory that has a size of a cache line.
- Optionally, in any of the preceding aspects, each bulk store operation initializes a region of the memory that has a size of a page.
- Optionally, in any of the preceding aspects, the computer system further comprises logic configured to create a single bulk store operation from a plurality of store instructions that each are configured to initialize a cache line sized region in the memory.
- According to one other aspect of the present disclosure, there is provided a method of initializing memory in a computer system. The method comprises receiving, at a last level cache in a hierarchy of caches in the computer system, a bulk store operation from a load store unit in a processor core in the computer system. The method comprises performing a bulk initialization of the memory for each bulk store operation, including sending multiple write transactions from the last level cache to the memory for each bulk store operation. The method comprises tracking status of the bulk store operations.
- According to still one other aspect of the present disclosure, there is provided a computer system for initializing memory. The computer system comprises main memory, a central processing unit, a load store unit, and a hierarchy of caches comprising a last level cache. The load store unit comprises load store unit means for tracking status of page store operations. Each page store operation includes a physical address in the main memory. The last level cache comprises means for sending multiple write transactions to the main memory for each page store operation to initialize a page of the main memory. The last level cache comprises last level cache means for tracking status of the page store operations and reporting the status to the load store unit.
- This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. The claimed subject matter is not limited to implementations that solve any or all disadvantages noted in the Background.
- Aspects of the present disclosure are illustrated by way of example and are not limited by the accompanying figures for which like references indicate elements.
-
FIG. 1A is a block diagram of one embodiment of a computing system that may perform bulk initialization of memory. -
FIG. 1B is a block diagram of one embodiment of a last level cache that forms multiple write transactions from one bulk store operation. -
FIG. 2 depicts one embodiment of the bulk store engine inFIG. 1B . -
FIG. 3 depicts one embodiment of bulk store operation buffer, which may reside in the bulk store engine. -
FIG. 4 depicts one embodiment of a load store unit. -
FIG. 5 depicts a flowchart of one embodiment of a process of performing a bulk initialization of memory. -
FIG. 6 depicts a flowchart of one embodiment of a process performed at load store unit with respect to a bulk store operation. -
FIG. 7 depicts a flowchart of one embodiment of a process of actions at the load store unit when a bulk store operation is initiated. -
FIG. 8 depicts one embodiment of a process of actions at the last level cache to track the status of a bulk store operation. -
FIG. 9 depicts a flowchart of one embodiment of a process of actions at the last level cache to initialize the memory for a bulk store operation. -
FIG. 10 depicts a flowchart of one embodiment of a process of actions at the last level cache to maintain cache coherence while processing a bulk store operation. -
FIG. 11 depicts a flowchart of one embodiment of a process of actions performed at the load store unit when a bulk store operation is completed. -
FIG. 12 depicts a flowchart of one embodiment of a process of a load store unit handling loads while, or after, a bulk store operation is pending. - The present disclosure will now be described with reference to the figures, which in general relate to bulk initialization of memory in a computing system. Bulk initialization of memory, as the term is used herein, refers to initializing a region of memory that is larger than a cache line in size. A cache line is a basic unit for cache storage and may also be referred to as a cache block. As one example, bulk initialization may be used to initialize a region that is four kilobytes in size (herein, a kilobyte is defined as 1024 bytes). In one embodiment, a load store unit in a processor core sends a bulk store operation to a last level cache. The last level cache is configured to send multiple write transactions to the memory for each bulk store operation in order to perform a bulk initialization of the memory. The last level cache is configured to track status of the bulk store operation. The last level cache is configured to maintain cache coherence in a hierarchy of caches when performing the bulk initialization of the memory for each bulk store operation. The bulk store operation may eliminate the need to have numerous store transactions at the load store unit, which saves considerable time. The bulk store operation may eliminate the need to transfer a series of store transactions over the cache hierarchy, thereby saving considerable time. The bulk store operation may reduce or eliminate the need to cache data in the cache hierarchy when performing bulk initialization of the memory, thereby saving considerable time and reducing complexity.
- In one embodiment, the load store unit has a bulk store combine buffer configured to hold status of the bulk store operations. The last level cache may report status of the bulk store operations to the load store unit. In one embodiment, the load store unit blocks younger loads associated with any region of the memory associated with any pending bulk store operation.
- It is understood that the present embodiments of the disclosure may be implemented in many different forms and that claims scopes should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete and will fully convey the inventive embodiment concepts to those skilled in the art. Indeed, the disclosure is intended to cover alternatives, modifications and equivalents of these embodiments, which are included within the scope and spirit of the disclosure as defined by the appended claims. Furthermore, in the following detailed description of the present embodiments of the disclosure, numerous specific details are set forth in order to provide a thorough understanding. However, it will be clear to those of ordinary skill in the art that the present embodiments of the disclosure may be practiced without such specific details.
-
FIG. 1A is a block diagram of one embodiment of acomputing system 100. Thecomputing system 100 is configured to perform bulk initialization of memory. Thecomputing system 100 includes aprocessor core 102, a last level cache (LLC) 104, andmain memory 108. Themain memory 108 is optional. In one embodiment, themain memory 108 is volatile memory, such as DRAM or SRAM. However, themain memory 108 is not required to be volatile memory. - The
processor core 102 contains at least one central processing unit (CPU) 110, a load store unit (LSU) 112, andinternal cache 114. The terminternal cache 114 refers to cache that is on the same semiconductor die (chip) as theCPU 110. In one embodiment, theinternal cache 114 contains an L1 cache and an L2 cache. Thus, theinternal cache 114 may include more than one level of caches. A computing system may use a cache to improve computing performance. For instance, a computing system may store data that it needs to access more frequently in a smaller, faster cache memory instead of storing the data in a slower, larger memory (e.g., main memory 108). - The
computing system 100 has a hierarchy of caches that are ordered in what are referred to herein as cache levels. Typically, the cache levels are numbered from a highest level cache to lowest level cache. There may be two, three, four, or even more levels in the cache hierarchy. Herein, a convention is used to refer to the highest level cache with the lowest number, with progressively lower levels receiving progressively higher numbers. For example, the highest level cache in the hierarchy may be referred to as L1 cache. Here, the lower level cache levels may be referred to as L2 cache, L3 cache, L4 cache, etc. In one embodiment, theinternal cache 114 has L1 cache, which is a small, fast cache near thecentral processing unit 110. The lowest level cache is referred to as a last level cache (LLC) 104. - In one embodiment, the
computing system 100 performs a bulk initialization ofmain memory 108. In some conventional techniques, aprocessor core 102 initializesmain memory 108 by sending commands to initialize cache line size regions ofmain memory 108. In one embodiment, theprocessor core 102 sends a single bulk store operation to thelast level cache 104 in order to initialize a large region inmain memory 108. As one example, the region could be four kilobytes in size. In one embodiment, the bulk store operation is used to initialize a region that has the size of page. Herein, a page is defined as the smallest unit of data for memory management in a virtual address space. A page is typically described by a single entry in a page table. A page could be the equivalent of, for example, 64 cache lines. Thus, theprocessor core 102 sends one bulk store operation instead of 64 store transactions, as one example. Moreover, theload store unit 112 may send the bulk store operation directly to thelast level cache 104, bypassing the internal cache 114 (e.g., L1 cache, L2 cache) in theprocessor core 102. Bypassing theinternal cache 114 improves efficiency. The bulk store operation may be referred to as a page store operation when the region of memory that is initialized is a page in size. In general, the bulk store operation is used to initialize a region of memory that is multiple cache lines in size. - The
last level cache 104 forms multiple write transactions from one bulk store operation. Thelast level cache 104 is configured to send multiple write transactions to themain memory 108 for each bulk store operation to perform a bulk initialization of themain memory 108 for each bulk store operation. Thelast level cache 104 is configured to track status of the bulk store operations received from theload store unit 112. Further details of one embodiment of the last level cache are depicted inFIG. 1B . - Note that the
main memory 108 could be shared between theprocessor core 102 depicted inFIG. 1A and other processors (not depicted inFIG. 1A ). It is possible that such other processors could attempt to access a region ofmain memory 108 during bulk initialization of that region. In other words, such accesses could leave the region in a state such that it is not certain that the region still contains the values to which it was initialized. Herein, the region being intact means that the region still contains the values to which it was initialized. Thelast level cache 104 is configured to track such possible accesses and, if necessary, change the status from intact to not intact. Such status may be reported to theload store unit 112. -
FIG. 1B depicts of one embodiment of alast level cache 104 that forms multiple write transactions from one bulk store operation. Thelast level cache 104 has abulk store engine 118 and a cache pipeline 120. Thebulk store engine 118 and cache pipeline 120 may be implemented in hardware. In one embodiment,bulk store engine 118 comprises combinational logic and sequential logic. In one embodiment, the cache pipeline 120 comprises combinational logic and sequential logic. In one embodiment, to initiate a bulk initialization ofmain memory 108, theload store unit 112 in theprocessor core 102 sends a bulk store operation to thelast level cache 104. In one embodiment, thebulk store engine 118 generates multiple write transactions for the bulk store operation, and sends the multiple write transactions to the cache pipeline 120. The cache pipeline 120 processes each write transaction. In one embodiment, the cache pipeline maintains cache coherency in the caches in thecomputer system 100, while the bulk store operation is being processed. The write transactions may be cache line sized transactions. -
FIG. 2 depicts one embodiment of thebulk store engine 118 depicted inFIG. 1B . Thebulk store engine 118 is configured to perform a bulk initialization of a region inmain memory 108 for each bulk store operation. Thebulk store engine 118 has write transaction former 202 that is configured to form multiple write transactions for each bulk store operation. In one embodiment, each write transaction is for a cache line sized region inmain memory 108. In one embodiment, the write transaction former 202 sends the write transactions cache pipeline 120, which sends a corresponding number of write transactions to themain memory 108. - The
bulk store engine 118 has a bulk store status tracker 204 that is configured to track the status of each bulk store operation. The bulk store status tracker 204 keeps track of the status in a bulkstore operation buffer 206. The write transaction former 202 and the bulk store status tracker 204 may be part of a cache controller in thelast level cache 104. In one embodiment, the write transaction former 202 is implemented in hardware. For example, the write transaction former 202 may be implemented with sequential and/or combinational logic. In one embodiment, bulk store status tracker 204 is implemented in hardware. For example, the bulk store status tracker 204 may be implemented with sequential and/or combinational logic. For purpose of discussion, the bulkstore operation buffer 206 is depicted in thebulk store engine 118. The bulkstore operation buffer 206 may be implemented in a portion of the memory that is used for cache entries in thelast level cache 104. Further details of one embodiment of bulkstore operation buffer 206 are shown inFIG. 3 . -
FIG. 3 depicts one embodiment of the bulkstore operation buffer 206. When thebulk store engine 118 receives a bulk store operation, thebulk store engine 118 creates a new entry in the bulkstore operation buffer 206. Each bulk store operation includes a physical address (PA) 304, which is the starting physical address in memory (e.g., main memory 108) to be initialized. InFIG. 3 , there are entries for two bulk store operations. One of the entries has a physical address of 0x8000 (in hexadecimal or HEX). The other entry has a physical address of 0x9000. The size of the region in memory to be initialized may be a default value. Hence, the size need not be specified in the bulk store operation. Optionally, the size of the region may be specified in the bulk store operation. For the sake of illustration, an example will be discussed in which the size is 4 kilobytes (or 1000 HEX). Hence, one of the bulk store operations may be used to initialize physical addresses between 0x8000 and 0x8FFF (inclusive). The other bulk store operation may be used to initialize physical addresses between 0x9000 and 0x9FFF (inclusive). - As each bulk store operation is being processed, the
bulk store engine 118 tracks status of the bulk store operation. The column labeled “Progress” 306 is used to track how far along the bulk store operation has proceeded. As noted above, thebulk store engine 118 forms multiple write transactions for each bulk store operation. Theprogress column 306 is used to track how many of the write transactions have been completed. InFIG. 3 , one of the bulk store operations is done, and the other has 53 write transactions completed. There may be, for example, 64 write transactions tomain memory 108 for a bulk store operation. - As each bulk store operation is being processed, the
bulk store engine 118 also monitors whether the region inmain memory 108 associated with a bulk store transaction is affected by any other stores tomain memory 108. For example, during the bulk initialization of a region inmain memory 108, a portion of that region could be written to. This write might come from a processor core other than the processor core that initiated the bulk store operation. The column labeled “Intact” (with each entry being referred to as intact flag 308) is used to track whether the region is intact. - The column labeled “Valid” (with each entry being referred to as an LLC valid flag 302) is used to track whether the entry is still valid. In one embodiment, the entry for a bulk store operation is invalidated if the
intact flag 308 is set to zero. Otherwise, the entry may remain in the bulkstore operation buffer 206 after the bulk store operation is complete. -
FIG. 4 depicts one embodiment of aload store unit 112. Theload store unit 112 has astore queue 402, astore combine buffer 404, and a bulk store combinebuffer 406. Thestore queue 402, store combinebuffer 404, and bulk store combinebuffer 406 may be implemented in memory in theprocessor core 102. Thebulk store manager 408 is configured to maintain the bulk store combinebuffer 406. Thebulk store manager 408 may be implemented in hardware. - A number of entries 402-1 to 402-8 are depicted on the
store queue 402. The entries may be executed in an order from entry 402-1 to entry 402-8. The entries correspond to the instructions in Table I. However, since instruction I4 is a load instruction, it is not represented on thestore queue 402. In the store instructions in Table I, the physical address in main memory at which to store some value may be derived from register R1. In some cases, register R1 contains a virtual address, which is converted into a physical address inmain memory 108. -
TABLE I I0: STR [R1], R8 I1: STR [R1 + 0x40], R9 I2: STR [R1 + 0x80], R9 I3: DC ZVA PG [R1] I4: LDR R3, [R1 + 0x40] I5: STR [R1 + 0xc0], R9 I6: DC ZVA PG [R1 + 0x1000] I7: DC ZVA PG [R1 + 0x2000] I8: DC ZVA PG [R1] - Entry 402-1 holds an operation (St0) corresponding to instruction I0 in Table I. Entry 402-1 is thus an operation to store the contents of register R8 to physical address 0x1000 in
main memory 108. Entry 402-2 holds an operation (St1) corresponding to instruction I1 in Table I. Entry 402-2 is thus an operation to store the contents of register R9 to physical address 0x1040 inmain memory 108. Entry 402-3 holds an operation (St2) corresponding to instruction I2 in Table I. Entry 402-3 is thus an operation to store the contents of register R9 to physical address 0x1080 inmain memory 108. These three store operations (St0, St1, St2) may be conventional store operations, which may each store to a region of memory equal to 64 bytes. The region may be larger or smaller (e.g., 32 bytes or 128 bytes). In one embodiment, 64 bytes is the size of a cache line. The cache line may be larger or smaller (e.g., 32 bytes or 128 bytes). - Entry 402-4 corresponds to instruction I3 in Table I and holds a bulk store operation (BlkSt0). The bulk store operation has a physical address of 0x8000. In one embodiment, the bulk store operation is used to initialize a region of 0x1000 in
main memory 108. In one embodiment, the bulk store operation is used to initialize a region the size of a page inmain memory 108. The page may be, for example, 4 kilobytes (or 1000 HEX) in size. Note that in Table I, instruction I3 specifies register R3, which indicates that the physical address may be obtained based on the contents of register R3. In some embodiments, register R3 contains a virtual address, which is translated to a physical address inmain memory 108. Instruction I3 does not contain an operand for the data to be stored at the physical address, as the data may be implied by the DC ZVA PG instruction. In one embodiment, the DC ZVA PG instruction implies that the contents of memory are to be zeroed out. However, the DC ZVA PG instruction could be used to imply some other pattern, such as initializing the memory to all ones. Optionally, an operand could be provided in the DC ZVA PG instruction to, for example, provide a pattern to be written to memory. For example, a second register could be specified in the DC ZVA PG instruction, wherein the contents of the second register contain a pattern to be written to memory. Note that this pattern may be repeated many times, as the size of the region to be initialized in memory is typically much larger than the register. - There is not an entry on the
store queue 402 for instruction I4, as instruction I4 is a load instruction, as opposed to a store instruction. However, there may be a load queue (not depicted inFIG. 4 ) in theload store unit 112 on which a load operation for instruction I4 may be placed. In one embodiment, thebulk store manager 408 blocks younger loads to regions of memory that are being initialized by bulk store operations. Hence, it is possible that thebulk store manager 408 could block instruction I4 from executing due a pending bulk store operation to the region of memory from which instruction I4 is to load. - Entry 402-5 holds an operation (St3) corresponding to I5 in Table I. Thus, entry 402-5 is an operation to store the contents of register R9 at a physical address 0x10c0 in
main memory 108. Entry 402-6 holds a bulk store operation (BlkSt1) corresponding to instruction I6 in Table 1. The bulk store operation St1 has a physical address of 0x9000, which is determined based on adding 0x1000 to the contents of register R1 (see Table 1). As noted above, the contents of register R1 could be a virtual address, which is translated to a physical address. Entry 402-7 holds a bulk store operation (BlkSt2) corresponding to instruction I7 in Table 1. The bulk store operation St2 has a physical address of 0xa000, which is determined based on adding 0x2000 to the contents of register R1 (see Table 1). Entry 402-8 holds a bulk store operation (BlkSt3) corresponding to instruction I8 in Table 1. The bulk store operation St3 has a physical address of 0x8000, which is determined based on the contents of register R1 (see Table 1). - The store combine
buffer 404 is used to track store operations. As indicated by the physical addresses, entries for the first three conventional store operations (St0, St1, St2) are represented in the store combinebuffer 404. The store combinebuffer 404 has a column that indicates whether the respective store operation resulted in a cache hit. The store combinebuffer 404 has a column that indicates whether the entry is currently valid. - The bulk store combine
buffer 406 is used to track bulk store operations. As indicated by the physical addresses in thephysical address column 424, entries for the first three bulk store operations (BlkSt0, BlkSt1, BlkSt2) are represented in the bulk store combinebuffer 406. The bulk store combinebuffer 406 has a column that indicates whether the respective bulk store operation is pending (referred to a pending flag 426). The bulk store combinebuffer 406 has a column that indicates whether the entry is currently valid (referred to a LSU valid flag 422). - The
bulk store manager 408 is configured to maintain the bulk store combinebuffer 406. Thebulk store manager 408 may add entries to the bulk store combinebuffer 406 when a bulk store operation is initiated. Thebulk store manager 408 may update the status (e.g., pending, valid) in response to status reports from thebulk store engine 118 in theLLC 104. Further details of one embodiment of maintaining the bulk store combinebuffer 406 are described in connection withFIG. 12 to be discussed below. In one embodiment, thebulk store manager 408 blocks younger loads to any region ofmain memory 108 for which a bulk store operation is pending. Further details of one embodiment of blocking younger loads are described in connection withFIG. 12 to be discussed below. Thebulk store manager 408 may be implemented in hardware. In one embodiment, thebulk store manager 408 comprises combinational logic and sequential logic. -
FIG. 5 depicts a flowchart of one embodiment of aprocess 500 of performing a bulk initialization of memory. Theprocess 500 may be used incomputer system 100 to initializemain memory 108. In one embodiment,process 500 is performed bybulk store engine 118 inLLC 104. Reference will be made to elements inFIG. 1A when discussingprocess 500; however,process 500 is not limited toFIG. 1A . Steps 504-506 inprocess 500 are described in a certain order as a matter of convenience of explanation and do not necessarily occur in the depicted order. Thus, steps 504-506 could occur in a different order. Also, steps 504-506 may be performed concurrently. - Step 502 includes receiving a bulk store operation at a last level cache (LLC) 104 in a
computer system 100. In one embodiment, theprocessor core 102 sends the bulk store operation to theLLC 104. In one embodiment, theload store unit 112 sends the bulk store operation to theLLC 104. The bulk store operation may bypass the other caches, such as internal cache 114 (e.g., L1 cache and L2 cache). Therefore, the other caches may be offloaded during the bulk store operation. - Step 504 includes performing a bulk initialization of memory for the bulk store operation. In one embodiment, bulk initialization of
main memory 108 is performed. In one embodiment, the bulk initialization results in a zeroing out of a region of the memory. In other words, the contents of the region of memory may be all zeros after the bulk initialization. However, a different pattern could result from the bulk initialization. For example, the contents of the region of memory may be all ones after the bulk initialization. A different pattern could result such as alternating ones and zeroes. Further details of one embodiment of performing a bulk initialization of memory are shown and described with respect toFIG. 9 . - Step 506 includes tracking status of the bulk store operation. In one embodiment, the
bulk store engine 118 updates the bulkstore operation buffer 206. For example, thebulk store engine 118 may update the progress column, the intact column, and the valid column. Further details of one embodiment of tracking status of a bulk initialization operation are shown and described with respect toFIG. 8 . -
FIG. 6 depicts a flowchart of one embodiment of aprocess 600 performed atload store unit 112 with respect to a bulk store operation. Theprocess 600 may be initiated when instructions being executed in theprocessor core 102 indicate that a bulk store operation is to be performed. -
Process 600 describes two ways in which a bulk store operation may be initiated. Step 602 a describes Option A in which the bulk store operation is obtained from a bulk store instruction in a set of instructions executed in theprocessor core 102. Table I shows a set of instructions that contain four bulk store instructions (Instructions I3, I6, I7, and I8). - Step 602 b describes Option B in which the bulk store operation is formed based on a number of store instructions. Each of these store instructions are to store the same values to memory. For example, each of the store instructions may be to zero out memory. However, each of these store instructions may be to store to a different region in memory. Collectively, the store instructions may be configured to store to a contiguous region of the memory. Table II depicts example store instructions from which a bulk store operation may be formed. Forming a single bulk store operation from multiple store instructions may be referred to as code morphing. In one embodiment, the
bulk store manager 408 is able to perform the code morphing. For convenience of explanation the instructions are numbered from I0 to I63 in Table II, but these are not the same instructions as in Table I. -
TABLE II I0: STR [R1], R8 I1: STR [R1 + 0x040], R8 I2: STR [R1 + 0x080], R8 . . . I63: STR [R1 + 0xFC0], R8 - In Table II, each store instruction is associated with a region of memory having a size of 40 HEX (or 64 bytes). In Table II, each of the store instructions specifies the address based on the contents of register R1. In one embodiment, register R1 contains a virtual address that is translated to a physical address in
main memory 108. The 64 store instructions are thus to write to a contiguous region of memory totaling four kilobytes. Note that the size of the region to which each instruction writes, the total size of the region that all instructions write, and the number of instructions are all for the purpose of example. However, the store instructions from which the bulk store operation is formed should write to a contiguous region of memory. - In Table II, each of the store instructions specifies the data based on the contents of register R8. This is for the purpose of illustration. In one embodiment, the data should be the same for all of the store instructions. In one embodiment, the data is not expressly provided, but is implied. For example, the second register (R8 in Table II) need not be provided in one embodiment, wherein the data is implied. The implied data could be to zero out the memory.
- Step 604 includes calculating a physical address to be initialized in memory. Step 604 may include a virtual address to physical address translation. In one embodiment, the addresses contained in the register(s) referenced in the instructions from which bulk store operations are formed are virtual addresses. For example, the address in register R1 in the instructions in Table I may be a virtual address. Likewise, the address in register R1 in the instructions in Table II may be a virtual address.
- Step 606 includes allocating an entry in the bulk store combine
buffer 406 for the bulk store operation. - Step 608 includes the
load store unit 112 sending a bulk store operation to thelast level cache 104. The bulk store operation includes the physical address inmain memory 108 that is to be initialized. The bulk store operation also includes an operand or other identifier that indicates that this is a bulk store operation. In one embodiment, theload store unit 112 sends the bulk store operation directly to thelast level cache 104, bypassing all other caches in a cache hierarchy (such as internal cache 114). This has the benefit of offloading the other caches from processing the bulk store operation. - Step 610 includes the
load store unit 112 waiting for the bulk store operation to complete. By waiting for the bulk store operation it is meant that theload store unit 112 does not take action to initialize themain memory 108, as that is left to thelast level cache 104. - Step 612 is performed while waiting for the bulk store operation to complete. Step 612 includes blocking younger loads to the region of
main memory 108 being initialized by the bulk store operation. A younger load means a load that, in strict accordance with the order of instructions, is to occur after the bulk store operation. Note that sometimes instructions to load from memory or store to memory may be executed out of order. With respect to Table I, instruction I4 is a younger load relative to instruction I3. Thus, if the bulk store operation originated from instruction I3, the load associated with instruction I4 would be blocked until the bulk store operation completes, under the assumption that the load is from a region ofmain memory 108 being initialized by the bulk store operation. However, instruction I4 is not a younger load with respect to instructions I6, I7 or I8. Thus, if the bulk store operation originated from any of instructions I6, I7 or I8, the load associated with instruction I4 would not be blocked. Further details of one embodiment of blocking younger loads are described below in connection withFIG. 12 . - After the bulk store operation is finished,
step 614 is performed. In one embodiment, thelast level cache 104 informs theload store unit 112 when the bulk store operation is finished. Step 614 includes releasing/updating the entry for the bulk store operation in the bulk store combinebuffer 406. Releasing the entry means to remove or otherwise mark the entry so that it is no longer used. In one embodiment, the entry is marked invalid to release it. In one embodiment, the entry is physically deleted to release it. Updating the entry means that the entry is changed in some manner and that the information in the entry may still be used. In one embodiment, the pending status is changed from pending to not pending, and the LSUvalid flag 422 is kept at valid when updating the entry. A status of not pending may also be referred to as complete. Further details of one embodiment of releasing/updating the entry for the bulk store operation are described below in connection withFIG. 11 . -
FIG. 7 depicts a flowchart of one embodiment of aprocess 700 of actions at theload store unit 112 when a bulk store operation is initiated.Process 700 may be performed after a bulk store operation has been added to thestore queue 402.Process 700 describes further details of one embodiment ofstep 606 inFIG. 6 . - Step 702 includes the
load store unit 112 accessing a bulk store operation from thestore queue 402. For the sake of illustration, the bulk store operation at entry 402-6 will be discussed inprocess 700. - Step 704 includes creating an entry for the bulk store operation to the bulk store combine
buffer 406. Step 704 also includes adding the physical address for the bulk store operation to the entry. Step 706 includes setting the pendingflag 426 in the entry to “1”. Step 708 includes setting the LSUvalid flag 422 in the entry to “1”. With reference toFIG. 4 , the entry having physical address 0x9000 as added. The pendingflag 426 for the entry is set to “1”. The LSUvalid flag 422 for the entry is set to “1”. -
FIG. 8 depicts one embodiment of aprocess 800 of actions at thelast level cache 104 to track the status of a bulk store operation.Process 800 provides further details of one embodiment of step 506 inFIG. 5 . In one embodiment,process 800 is performed by bulk store status tracker 204. - Step 802 includes the
last level cache 104 receiving a bulk store operation from theload store unit 112. In one embodiment,step 802 occurs as a result ofstep 608 inFIG. 6 . The bulk store operation contains an operand (or other type of identifier) that indicates that this is a bulk store operation. In an embodiment, thelast level cache 104 identifies this as a bulk store operation based on the operand. In an embodiment, the bulk store operation also contains a physical address inmain memory 108 that is to be initialized. - Step 804 includes the
bulk store engine 118 in thelast level cache 104 creating an entry for the bulk store operation in the bulkstore operation buffer 206. Step 804 also includes adding the physical address in the bulk store operation to the buffer entry. - Step 806 includes setting the
intact flag 308 in the entry to “1”. Step 808 includes setting the LLCvalid flag 302 in the entry to “1”. With reference toFIG. 4 , the entry having physical address 0x9000 as added, as one example. The pending flag for the entry is set to “1”. The LLCvalid flag 302 for the entry is set to “1”. The progress field is initially set to 0 to indicate that the process of sending write transactions to themain memory 108 has not yet started. - Step 810 includes tracking the status of the bulk store operation. Step 810 includes modifying the progress field as more of the memory is initialized for this bulk store operation. Further details of updating the progress field are described in connection with
FIG. 9 . Step 810 may include modifying theintact flag 308 for the entry. Step 810 may include modifying the LLCvalid flag 302 for the entry. - Step 812 includes the
last level cache 104 reporting the completion of the bulk store operation to theload store unit 112. Step 812 also includes thelast level cache 104 reporting the status of the bulk store operation to theload store unit 112. In one embodiment, the status includes the intact status. -
FIG. 9 depicts a flowchart of one embodiment of aprocess 900 of actions at thelast level cache 104 to initialize the memory for a bulk store operation.Process 900 provides further details of one embodiment ofstep 504 inFIG. 5 . - Step 902 includes setting an initial physical address to the address in the bulk store operation. This is a physical address in
main memory 108, in one embodiment. - Step 904 includes forming a write transaction to write at the current physical address. In one embodiment, the write transaction is a write transaction that writes one cache line. In one embodiment, the write transaction is a WriteUnique transaction. In one embodiment, the WriteUnique transaction is compliant with the AMBA® 5 CHI Architecture Specification, which is published by ARM Ltd. As known to those of ordinary skill in the art, there are a variety of types of WriteUnique transactions (e.g., WriteUniquePtl, WriteUniqueFull, WriteUniquePtlStash, WriteUniqueFullStash).
- Step 906 includes sending the write transaction to the
main memory 108. Step 906 may also include receiving a response from the main memory reporting the status of the write transaction. For the sake of discussion, it is assumed inprocess 900 that all write transactions complete successfully. However, if there is an error with one or more write transactions, then theprocess 900 could end with an error status. - In one embodiment,
step 906 includes sending the WriteUnique transaction that was formed instep 904 to the cache pipeline 120. The WriteUnique transaction may be used to remove all copies of a cache line before issuing a write transaction tomain memory 108. The WriteUnique transaction could result in a back snoop to theprocessor core 102. The WriteUnique transaction could result in snoops of other processor cores, as well. After the snoops are done, the data is written to themain memory 108. - Step 908 includes updating the progress of the bulk store operation in the
buffer 206 in thebulk store engine 118. In one embodiment, the progress field serves as a counter of the number of write transaction that have successfully completed. Thus, the progress field may be incremented by one each time a write transaction successfully completes. - Step 910 is a determination of whether the bulk store operation is done. In other words, the
bulk store engine 118 determines whether all of the write transactions have successfully completed. If not, then control passes to step 912, wherein the physical address is incremented. The size of the increment is equal to the size of each write transaction, in one embodiment. The size of the increment is equal to the size of a cache line, in one embodiment. - After
step 912, control passes to step 904. Instep 904 another write transaction is formed using the current value of the physical address. When all write transactions successfully complete (step 910 is yes), control passes to step 914. Step 914 includes thelast level cache 104 sending a completion status for the bulk store operation to theload store unit 112. In one embodiment, the completion status includes an indication of whether the bulk store operation was successful at initializing memory. In one embodiment, the completion status includes the intact status for the bulk store operation entry inbuffer 206. -
FIG. 10 depicts a flowchart of one embodiment of aprocess 1000 of actions at thelast level cache 104 to maintain cache coherence while processing a bulk store operation. In one embodiment,process 1000 is performed for each of the write transactions instep 906 ofprocess 900. Thus,process 1000 provides further details for one embodiment ofstep 906. In one embodiment,process 1000 is performed by the cache pipeline 120 in thelast level cache 104.Process 1000 may be performed for each write transaction (e.g., each WriteUnique transaction) sent to the cache pipeline 120. -
Step 1002 includes thebulk store engine 118 sending a write transaction to the cache pipeline 120. As noted above, this may be a WriteUnique transaction. In one embodiment, the write transaction is to write to a region of memory having the size of a cache line. -
Step 1004 includes thelast level cache 104 checking the tag and the snoop filter. The tag may be used to determine whether thelast level cache 104 has a cache line associated with the address in main memory to be initialized by the write transaction. The snoop filter may be examined to determine whether another cache has a cache line associated with the address in main memory to be initialized by the write transaction. The snoop filter thus keeps track of coherency states of cache lines. -
Step 1006 includes thelast level caching 104 snooping.Step 1006 may result in a back snoop to theprocessor core 102 that initiated the bulk store operation.Step 1006 may result in a snoop of other processor cores that share themain memory 108. -
Step 1008 includes thelast level cache 104 updating the tag and the snoop filter. Hence, the last level cache is able to maintain cache coherence while processing the bulk store operation. - Step 1010 includes updating the status for the bulk store operation, if necessary. Note that during
process 1000, other processor cores could be trying to read or write to a portion of themain memory 108 that is being initialized by the bulk store operation. In one embodiment, if any read request touches the region ofmain memory 108 being initialized, theintact flag 308 in the bulkstore operation buffer 206 is set to 0. In one embodiment, if any snoop request touches the region ofmain memory 108 being initialized, theintact flag 308 is set to 0. -
Step 1012 includes the last level cache sending a write transaction to themain memory 108. -
FIG. 11 depicts a flowchart of one embodiment of aprocess 1100 of actions performed at theload store unit 112 when a bulk store operation is completed. -
Step 1102 includes theload store unit 112 receiving an indication from thelast level cache 104 that the bulk store operation has completed. -
Step 1104 includes theload store unit 112 checking whether anintact flag 308 in the response is set to 1 or 0. Thelast level cache 104 sets theintact flag 308 to 1 to indicate that the region of memory being initialized is still intact. The last level cache sets theintact flag 308 to 0 to indicate that the region of memory being initialized is no longer intact. -
Steps intact flag 308 being 1. Instep 1106, the pendingflag 426 in the entry for this bulk store operation in the bulk store combinebuffer 406 is set to 0, which indicates that the bulk store operation is no longer pending (otherwise referred to as complete).Step 1108 includes keeping the LSUvalid flag 422 in the entry in the bulk store combinebuffer 406 at 1. A the LSUvalid flag 422 of 1, along with a pendingflag 426 of 0, may be interpreted as the region in memory that was initialized still being intact after completion of the bulk store operation. -
Step 1110 is performed in response to theintact flag 308 being 0. Instep 1110, the entry for this bulk store operation in the bulk store combinebuffer 406 invalidated. In one embodiment, this includes setting the LSUvalid flag 422 in the entry in the bulk store combinebuffer 406 to 0, which indicates that the entry is no longer valid. Other techniques may be used to invalidate the entry. - After either
steps step 1110 is performed, control passes to step 1112. Step 1112 includes theload store unit 112 sending a completion acknowledgment (ACK) to thebulk store engine 118. -
FIG. 12 depicts a flowchart of one embodiment of aprocess 1200 of aload store unit 112 handling loads while, or after, a bulk store operation is pending. -
Step 1202 includes theload store unit 112 accessing a load operation. The load operation may be accessed from a load queue in theload store unit 112. The load operation may be associated with a load instruction, such as instruction I4 in Table I. -
Step 1204 includes checking the bulk store combinebuffer 406 for a bulk store operation that covers the physical address in the load command. The following examples will be used to illustration. A first example load instruction is to load the data at 0x6040 inmain memory 108 to register R3. A second example load instruction is to load the data at 0x8040 inmain memory 108 to register R3. A third example load instruction is to load the data at 0x9040 inmain memory 108 to register R3. - With reference to the values depicted in the bulk store combine
buffer 406 inFIG. 4 , there is not a bulk store operation that covers 0x6040 inmain memory 108. Therefore, for the firstload instruction step 1206 is no. Therefore, control passes to step 1208 to load the data for that first example instruction. Hence, the data at 0x6040 inmain memory 108 may be loaded into, for example, register R3. - For the second example load instruction, there is a bulk store operation that covers 0x8040 in
main memory 108. Specifically, the bulk store operation with physical address 0x8000 inmain memory 108 covers 0x8040 in main memory 108 (due to the 1000 HEX length of the bulk store operation). For the third example load instruction, there is a bulk store operation that covers 0x9040 inmain memory 108. Specifically, the bulk store operation with physical address 0x9000 inmain memory 108 covers 0x9040 in main memory 108 (due to the 1000 HEX length of the bulk store operation). Hence, for example instructions two and three, control would pass to step 1210. -
Step 1210 includes a determination of whether the pendingflag 426 for the bulk store operation is set. If so, control passes to step 1212. InFIG. 4 , the pendingflag 426 is set for bulk store operation with physical address 0x9000 inmain memory 108. Hence, the load from 0x9040 inmain memory 108 is blocked, instep 1212. In other words, theload store unit 112 does not allow the third example load instruction to load the data at 0x9040 inmain memory 108 into register R3. The blocking is enforced until the bulk store operation with physical address 0x9000 inmain memory 108 is completed. - In
FIG. 4 , the pendingflag 426 is not set for bulk store operation with physical address 0x8000 inmain memory 108. The pendingflag 426 not being set indicates that the bulk store operation is complete. Hence, the load from 0x8040 inmain memory 108 is not blocked. Thus, for the second example load instruction, control passes to step 1214.Step 1214 includes a determination of whether the LSUvalid flag 422 is sent for the relevant entry in the bulk store combinebuffer 406. If the LSUvalid flag 422 is not set (step 1214 is no), then the data is loaded from the relevant address inmain memory 108, instep 1216. If the LSUvalid flag 422 is set (step 1214 is yes), then the data need not be loaded from the relevant address inmain memory 108. Instead, since the initialization values are known, the known initialization values can be provided instep 1218. For example, if it is known that the memory is initialized to all zeroes, then all zeroes are provided to respond to the load operation, without the need to accessmain memory 108. Hence, time can be saved by avoiding a memory access. Also, it is not necessary to store the initialization values in, for example, 64 cache lines. In one embodiment, one entry in the bulk store combinebuffer 406 contains information to respond to load requests instep 1218. Instep 1218, the information in the entry in the bulk store combinebuffer 406 may be used to respond to load instructions that request data for any portion of a large (e.g., page sized) region in memory that was initialized by a completed bulk store operation. Hence, cache space may be saved by not storing initialization values in, for example, 64 cache lines. - In one embodiment, the load store unit means for tracking status of page store operations comprises bulk store manager. In one embodiment, the load store unit means for tracking status of page store operations is configured to perform
process 700. In one embodiment, the load store unit means for tracking status of page store operations is configured to performprocess 1100. - In one embodiment, means for sending multiple write transactions to the main memory for each page store operation to initialize a page of the main memory comprises one or more of bulk store engine and cache pipeline. In one embodiment, the means for sending multiple write transactions to the main memory for each page store operation to initialize a page of the main memory is configured to perform
process 900. - In one embodiment, means for tracking status of the page store operations and reporting the status to the load store unit comprises one or more of bulk store engine and cache pipeline. In one embodiment, the means for tracking status of the page store operations and reporting the status to the load store unit is configured to perform
process 1000. - In one embodiment, means for maintaining cache coherence in the hierarchy of caches when initializing the page of the main memory for each page store operation comprises one or more of bulk store engine and cache pipeline. In one embodiment, the means for maintaining cache coherence in the hierarchy of caches when initializing the page of the main memory for each page store operation is configured to perform
process 1000. - In one embodiment, the means for tracking page store operations that are pending, wherein each page store operation is associated with a region of the memory to be initialized comprises bulk store manager. In one embodiment, the means for tracking page store operations that are pending, wherein each page store operation is associated with a region of the memory to be initialized is configured to perform
process 700. In one embodiment, the means for tracking page store operations that are pending, wherein each page store operation is associated with a region of the memory to be initialized is configured to performprocess 1100. - In one embodiment, the means for blocking younger loads associated with any region of the memory associated with any pending page store operation comprises bulk store manager. In one embodiment, the means for blocking younger loads associated with any region of the memory associated with any pending page store operation is configured to perform
process 1200. - The technology described herein can be implemented using hardware, software, or a combination of both hardware and software. The software used is stored on one or more of the processor readable storage devices described above to program one or more of the processors to perform the functions described herein. The processor readable storage devices can include computer readable media such as volatile and non-volatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer readable storage media and communication media. Computer readable storage media may be implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Examples of computer readable storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. A computer readable medium or media does (do) not include propagated, modulated or transitory signals.
- Communication media typically embodies computer readable instructions, data structures, program modules or other data in a propagated, modulated or transitory data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as RF and other wireless media. Combinations of any of the above are also included within the scope of computer readable media.
- In alternative embodiments, some or all of the software can be replaced by dedicated hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Application-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), special purpose computers, etc. In one embodiment, software (stored on a storage device) implementing one or more embodiments is used to program one or more processors. The one or more processors can be in communication with one or more computer readable media/storage devices, peripherals and/or communication interfaces.
- It is understood that the present subject matter may be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this subject matter will be thorough and complete and will fully convey the disclosure to those skilled in the art. Indeed, the subject matter is intended to cover alternatives, modifications and equivalents of these embodiments, which are included within the scope and spirit of the subject matter as defined by the appended claims. Furthermore, in the following detailed description of the present subject matter, numerous specific details are set forth in order to provide a thorough understanding of the present subject matter. However, it will be clear to those of ordinary skill in the art that the present subject matter may be practiced without such specific details.
- Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatuses (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable instruction execution apparatus, create a mechanism for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
- The description of the present disclosure has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure. The aspects of the disclosure herein were chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure with various modifications as are suited to the particular use contemplated.
- For purposes of this document, each process associated with the disclosed technology may be performed continuously and by one or more computing devices. Each step in a process may be performed by the same or different computing devices as those used in other steps, and each step need not necessarily be performed by a single computing device.
- Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
Claims (20)
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/US2020/021153 WO2020190516A2 (en) | 2020-03-05 | 2020-03-05 | Bulk memory initialization |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2020/021153 Continuation WO2020190516A2 (en) | 2020-03-05 | 2020-03-05 | Bulk memory initialization |
Publications (1)
Publication Number | Publication Date |
---|---|
US20230004493A1 true US20230004493A1 (en) | 2023-01-05 |
Family
ID=70166137
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/902,263 Pending US20230004493A1 (en) | 2020-03-05 | 2022-09-02 | Bulk memory initialization |
Country Status (3)
Country | Link |
---|---|
US (1) | US20230004493A1 (en) |
CN (1) | CN115380266A (en) |
WO (1) | WO2020190516A2 (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060136682A1 (en) * | 2004-12-21 | 2006-06-22 | Sriram Haridas | Method and apparatus for arbitrarily initializing a portion of memory |
US20160217080A1 (en) * | 2015-01-22 | 2016-07-28 | Empire Technology Development Llc | Memory initialization using cache state |
US20180165199A1 (en) * | 2016-12-12 | 2018-06-14 | Intel Corporation | Apparatuses and methods for a processor architecture |
US20180239702A1 (en) * | 2017-02-23 | 2018-08-23 | Advanced Micro Devices, Inc. | Locality-aware and sharing-aware cache coherence for collections of processors |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5835704A (en) * | 1996-11-06 | 1998-11-10 | Intel Corporation | Method of testing system memory |
-
2020
- 2020-03-05 WO PCT/US2020/021153 patent/WO2020190516A2/en active Application Filing
- 2020-03-05 CN CN202080098032.3A patent/CN115380266A/en active Pending
-
2022
- 2022-09-02 US US17/902,263 patent/US20230004493A1/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060136682A1 (en) * | 2004-12-21 | 2006-06-22 | Sriram Haridas | Method and apparatus for arbitrarily initializing a portion of memory |
US20160217080A1 (en) * | 2015-01-22 | 2016-07-28 | Empire Technology Development Llc | Memory initialization using cache state |
US20180165199A1 (en) * | 2016-12-12 | 2018-06-14 | Intel Corporation | Apparatuses and methods for a processor architecture |
US20180239702A1 (en) * | 2017-02-23 | 2018-08-23 | Advanced Micro Devices, Inc. | Locality-aware and sharing-aware cache coherence for collections of processors |
Also Published As
Publication number | Publication date |
---|---|
WO2020190516A3 (en) | 2021-01-07 |
WO2020190516A2 (en) | 2020-09-24 |
CN115380266A (en) | 2022-11-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11762780B2 (en) | Write merging on stores with different tags | |
US7676636B2 (en) | Method and apparatus for implementing virtual transactional memory using cache line marking | |
US8688951B2 (en) | Operating system virtual memory management for hardware transactional memory | |
US8924653B2 (en) | Transactional cache memory system | |
US7549025B2 (en) | Efficient marking of shared cache lines | |
US20110167222A1 (en) | Unbounded transactional memory system and method | |
US7917698B2 (en) | Method and apparatus for tracking load-marks and store-marks on cache lines | |
CA2289402C (en) | Method and system for efficiently handling operations in a data processing system | |
JPH0997214A (en) | Information-processing system inclusive of address conversion for auxiliary processor | |
US8856478B2 (en) | Arithmetic processing unit, information processing device, and cache memory control method | |
US10983914B2 (en) | Information processing apparatus, arithmetic processing device, and method for controlling information processing apparatus | |
US10853247B2 (en) | Device for maintaining data consistency between hardware accelerator and host system and method thereof | |
US6477622B1 (en) | Simplified writeback handling | |
US20230004493A1 (en) | Bulk memory initialization | |
US7774552B1 (en) | Preventing store starvation in a system that supports marked coherence | |
US20230153249A1 (en) | Hardware translation request retry mechanism | |
US8230173B2 (en) | Cache memory system, data processing apparatus, and storage apparatus | |
WO2023055508A1 (en) | Storing an indication of a specific data pattern in spare directory entries | |
JPH07101412B2 (en) | Data pre-fetching method and multiprocessor system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: HUAWEI TECHNOLOGIES CO., LTD., CHINA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:FUTUREWEI TECHNOLOGIES, INC.;REEL/FRAME:061081/0855 Effective date: 20210316 Owner name: FUTUREWEI TECHNOLOGIES, INC., TEXAS Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:XIE, YUEJIAN;WANG, QIAN;JIANG, XINGYU;SIGNING DATES FROM 20200309 TO 20200407;REEL/FRAME:061081/0799 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |