WO2012142820A1 - 预执行指导的数据预取方法及系统 - Google Patents

预执行指导的数据预取方法及系统 Download PDF

Info

Publication number
WO2012142820A1
WO2012142820A1 PCT/CN2011/080813 CN2011080813W WO2012142820A1 WO 2012142820 A1 WO2012142820 A1 WO 2012142820A1 CN 2011080813 W CN2011080813 W CN 2011080813W WO 2012142820 A1 WO2012142820 A1 WO 2012142820A1
Authority
WO
WIPO (PCT)
Prior art keywords
execution
span
prefetcher
instruction
failure
Prior art date
Application number
PCT/CN2011/080813
Other languages
English (en)
French (fr)
Inventor
程旭
党向磊
王箫音
佟冬
陆俊林
王克义
Original Assignee
北京北大众志微系统科技有限责任公司
济南众志信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京北大众志微系统科技有限责任公司, 济南众志信息技术有限公司 filed Critical 北京北大众志微系统科技有限责任公司
Publication of WO2012142820A1 publication Critical patent/WO2012142820A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3824Operand accessing
    • G06F9/383Operand prefetching

Definitions

  • the present invention relates to data prefetching techniques, and more particularly to a pre-execution guided data prefetching method and system.
  • Data prefetching technology is a widely used memory access delay inclusive technology. It predicts its memory access address and issues a request before the processor actually needs certain data, and does not need to wait until the cache fails. Save access, so as to achieve the effect of hiding the access delay.
  • prefetch can be mainly divided into software prefetching and hardware prefetching.
  • Software prefetching is usually triggered manually by a programmer or by a compiler to insert a special prefetch instruction into the program to trigger prefetching.
  • Software prefetch instructions not only take up extra processor execution cycles, but also increase code size.
  • software prefetching often inserts prefetch instructions by means of a cross-sectional view of the static memory access characteristics of the program, so that the dynamic memory access characteristics of the program runtime cannot be utilized, and the software prefetch cannot accelerate various binary executable files. Form exists in the program.
  • Hardware prefetching is typically performed by the hardware prefetcher during runtime to monitor the repeatable memory address mode during program execution and automatically initiate prefetch requests. Hardware prefetching captures and exploits the dynamic memory access characteristics of the program's runtime for accurate and timely prefetching. Hardware prefetching also avoids the overhead of execution cycles and code size caused by prefetch instructions and is not limited by whether the program can be recompiled.
  • Correlation-based prefetching discovers and records specific correlation rules and events during program execution, and triggers corresponding prefetch requests when these correlation laws and events are detected to occur repeatedly.
  • This technology requires the use of a large-capacity (usually megabit (MB) level) storage structure to record correlation history information and corresponding prefetch addresses, which can cause complexity and hardware overhead that cannot be ignored, making it difficult to be Applied to the actual processor.
  • MB megabit
  • span prefetching is a prefetching technique with low complexity and low hardware overhead. It has been widely used in commercial processing such as Intel Pentium 4 and IBM Power6. In the device.
  • the span prefetching technology mainly performs data prefetching based on the spatial locality rule of memory access, and is mainly applicable to the rule fetching mode.
  • Pre-execution technology is also a simple and effective memory access delay containment technology.
  • the pre-execution technique uses the idle cycle of the processor to pre-execute subsequent instructions of the failed memory access instruction, by parallelizing the storage stages in the program.
  • the full exploitation and effective use of the memory resources of the access memory achieve the effect of overlapping the delays of multiple main memory accesses.
  • Pre-execution technology enables accurate data prefetching of any memory access mode by pre-execution of subsequent instructions that generate L2 Cache invalidation instructions.
  • span prefetch Compared with pre-execution, the advantages of span prefetch are mainly reflected in two aspects.
  • span prefetching can prefetch addresses that match the span access mode at any time, while pre-execution only prefetches after the L2 Cache failure causes the processor to enter pre-execution mode.
  • the advance time for pre-execution prefetching is not long enough, which may result in the prefetch request for the data not being completed when the processor needs some data, and the span prefetch can initiate the prefetch request earlier, thereby ensuring the prefetch.
  • the data is retrieved in time before the processor needs it.
  • pre-execution can be accurately prefetched by the early execution of real instruction segments, while span prefetching is prefetched using the predicted fetch address.
  • pre-execution can prefetch the irregular access mode, while span prefetch can only be used. Then the memory access mode is prefetched.
  • the technical problem to be solved by the present invention is the need to provide a pre-execution guidance for data prefetching techniques.
  • the present invention provides a data prefetching method for pre-execution guidance, the method comprising:
  • the span prefetcher monitors the L2 cache invalidation fetch sequence and automatically triggers the prefetch request when the span fetch mode is captured;
  • the processor When it is detected that the secondary cache access fails, the processor backs up the current register state and switches to the pre-execution mode;
  • the processor continues to execute subsequent instructions of the instruction that causes the L2 cache to fail, accurately prefetches the non-regular fetch mode, obtains the pre-execution result, and saves the valid state to the instruction and result buffer. And extracting the required information from the captured real memory information to guide the span prefetcher to issue the prefetch request;
  • the processor clears the pipeline, restores the backup register state, and continues execution from the pre-execution fetch instruction.
  • the span prefetcher monitors the L2 cache failed fetch sequence and the step of capturing the span fetch mode, including:
  • the span prefetcher prefetches forward or backward and uses a storage area partitioning method to divide the stream.
  • the span prefetcher first initiates the prefetch request when two consecutive secondary cache failures in the same flow comply with the span access mode.
  • the span prefetcher stores prefetch data in the secondary cache.
  • the processor allocates an idle failure state processing register for the primary L2 failure and initializes the filter bit of the failure status processing register; when the secondary L2 failure occurs, an update filter Reading out the filter bit in the failure status processing register corresponding to the failed row and determining the cause of the secondary L2 failure; filtering out the update to the span prefetcher for the secondary L2 failure caused by the pre-execution; Updating the span prefetcher and the filter bit by the secondary L2 failure caused by the span prefetcher;
  • the secondary cache failure that has been accessed by the span prefetcher or the pre-execution command and has not been completed is called the secondary L2 failure, and the other secondary cache failures are the primary L2 failure.
  • the architecture state is not updated when the processor continues to execute subsequent instructions of the instruction that caused the L2 cache invalidation.
  • the step of the processor continuing to execute from the pre-execution of the fetch instruction comprises: the processor starting from the pre-execution fetch instruction, the pre-execution result saved in the instruction and the result buffer Merged into the architecture state, the pre-execution instructions that invalidate the calculation result are re-transmitted into the pipeline for execution and the execution result is submitted.
  • the invention also provides a data prefetching system for pre-execution guidance, comprising:
  • the span prefetcher is configured to: monitor the secondary cache invalid access sequence, and automatically trigger the prefetch request when the span access mode is captured;
  • the processor is configured to: when it is detected that the secondary cache access fails, the current register state is backed up and converted to the pre-execution mode; in the pre-execution mode, the subsequent instruction of the instruction that the secondary cache failure occurs is continued, Precise prefetching of the non-regular memory access mode, obtaining pre-execution results and saving the valid state into the instruction and result buffer, and extracting the required information from the captured real memory information to guide the span prefetcher
  • the prefetch request is issued; after the pre-executed L2 cache invalidation instruction completes the main memory access, the pipeline is cleared, the backup register state is restored, and execution is resumed from the prefetched fetch instruction.
  • the span prefetcher is further configured to: prefetch forward or backward when monitoring the secondary cache fail fetch sequence and capture the span fetch mode, and divide by using a storage region partitioning method flow.
  • the span prefetcher is further configured to: initiate the prefetch request for the first time when monitoring that the secondary cache failure in the same flow is consistent with the span access mode.
  • the span prefetcher is further configured to: store prefetch data in the second level cache.
  • the system further comprises: an update filter, configured to: when the secondary L2 failure occurs, read out the filter bit in the failure state processing register corresponding to the secondary L2 failure line and determine that the secondary L2 failure is caused.
  • an update filter configured to: when the secondary L2 failure occurs, read out the filter bit in the failure state processing register corresponding to the secondary L2 failure line and determine that the secondary L2 failure is caused.
  • Reason for the secondary L2 failure caused by the pre-execution, filtering out the update to the span prefetcher; for the secondary L2 failure triggered by the span prefetcher, updating the span prefetcher and the filter bit ;
  • the processor is further configured to: allocate a free invalid state processing register for the primary L2 failure and initialize a filter bit of the invalid state processing register when the primary L2 failure occurs; the row to be accessed is already used by the span prefetcher Or the pre-execution instruction initiates the main memory access and the uncompleted secondary cache failure is called the secondary L2 failure, and the remaining secondary cache failures are the primary L2 failure.
  • the processor is further configured to: not update the architectural state when continuing to execute subsequent instructions of the instruction that caused the secondary cache failure.
  • the processor is further configured to: when continuing execution from the pre-execution of the memory access instruction, from the pre-execution memory access instruction, merge the pre-execution result saved in the instruction and the result buffer into The architectural state, and the pre-executed instructions that invalidate the calculation result are re-transmitted into the pipeline to execute and submit the execution result.
  • the technical solution provided by the present invention separately uses different pre-execution and Stride Prefetching to process different memory access modes, and uses the memory access information during pre-execution to guide the span.
  • the prefetching process of the prefetcher (Stride Prefetcher) and the use of the update filter (Update Filter) optimize the pre-execution guidance process.
  • the processor uses the span prefetcher to prefetch the rule fetch mode; when the L2 cache fails, the processor enters the pre-execution mode, and in the pre-execution mode, the subsequent instructions are executed in advance.
  • the rule fetch mode performs accurate prefetching, and uses the real memory access information captured in advance during the pre-execution process to guide the prefetch process of the span prefetcher, thereby effectively improving the processor's memory access delay tolerance capability.
  • FIG. 1 is a schematic flowchart of a data prefetching method according to a pre-execution guide of the present embodiment
  • FIG. 2 is a schematic diagram of performance analysis of the technical solution of the present invention and an existing span prefetching and pre-execution; Schematic diagram of the effect of Secondary L2 Miss on the update of the span prefetcher;
  • FIG. 4 is a schematic diagram of processor state transition of the data prefetching mechanism of the present invention using pre-execution guidance
  • FIG. 5 is a schematic structural view of a span prefetcher with an update filter according to the present invention.
  • FIG. 6 is a schematic diagram of the workflow of updating the filter of the present invention.
  • FIG. 7 is a schematic diagram showing the structure of a processor of the data prefetching mechanism of the present invention using pre-execution guidance.
  • Embodiment 1 Data prefetching method of pre-execution guidance
  • FIG. 1 is a schematic diagram showing the working principle and main flow of the data prefetching mechanism of the pre-execution guidance of the embodiment. Cheng.
  • Step S110 at the initial moment, the processor is in the normal execution mode (phase with the pre-execution mode, and the other indicates that the processor normally executes and submits the instruction mode when the method of the present invention is not used), and the instruction is normally executed and submitted.
  • the span prefetcher is responsible for monitoring the L2 Cache invalid access sequence and automatically triggering the prefetch request when the span access mode is captured.
  • Step S130 when it is detected that the L2 Cache access fails, the processor backs up the current register state (Checkpoint), and then immediately switches to the pre-execution mode.
  • Step S140 the processor runs in the pre-execution mode, and continues to execute subsequent instructions of the L2 Cache invalidation instruction (the instruction corresponding to the L2 Cache access failure), and the irregular access memory module and the result buffer (Instruction and Result Buffer, IRB ), and extract useful information from the captured real memory information to guide the span prefetcher to issue the prefetch request early; but does not update the architecture state (the architecture is updated when the general instruction commits) State, to save execution results or change processor state, speculative execution or pre-execution does not update the architecture state, some components are used separately to save intermediate results).
  • the L2 Cache invalidation instruction the instruction corresponding to the L2 Cache access failure
  • IRB Instruction and Result Buffer
  • the present invention provides a store data cache (Store Cache) for storing stored data of a store (Store) instruction and forwarding it to a subsequent load (Load) instruction for access and use.
  • Store Cache store data cache
  • Load load
  • Pre-execution of instructions unrelated to the occurrence of L2 Cache invalidation instruction data can produce accurate data prefetch and valid calculation results; instructions related to the occurrence of L2 Cache failure instruction data are directly removed from the pipeline and target The register is marked as invalid ( INV ).
  • the processor saves the pre-execution result generated by the pre-execution instruction and its valid state into the Instruction and Result Buffer (IRB) to speed up the normal execution of the instruction.
  • the real access information during the pre-execution process is used to guide the prefetch process of the span prefetcher.
  • the present invention also provides an update filter for extracting useful information from the captured real memory information (the rest of the information may be referred to as harmful information), wherein the useful information is used to guide the span prefetcher to prefetch early. Requests, the rest of the harmful information is directly filtered out. Empty the pipeline, restore the state of the backup registers, and transition to a combined result mode. Step S160, in the merge result mode, the processor continues to execute from the pre-execution fetch instruction (ie, the L2 Cache invalidation instruction that triggers the pre-execution), including directly merging the valid pre-execution results saved in the IRB into the system.
  • the pre-execution fetch instruction ie, the L2 Cache invalidation instruction that triggers the pre-execution
  • the structure state, the pre-execution instruction that invalidated the calculation result is re-transmitted into the pipeline to execute and submit the corresponding execution result. During this time, if the L2 Cache access is detected to have failed, the processor will transition to pre-execution mode again.
  • Step S170 after all the pre-executed instructions have been submitted, the processor returns to the normal execution mode.
  • the span prefetcher When the span prefetcher monitors the L2 Cache failure sequence and captures the span access mode, it can prefetch in the forward or backward direction, and use the storage area division method to divide the stream. Each stream is responsible for prefetching 4 KB. Or a storage area for other data volumes.
  • the prefetch request for the subsequent address is initiated for the first time.
  • the initial prefetch request will prefetch 2 consecutive L2 Cache lines forward or backward with respect to the current failed access address (the prefetch distance of this embodiment is 2, in other embodiments also It can be other prefetch distances.
  • the span prefetcher continues to prefetch 1 L2 Cache line forward or backward, thus maintaining prefetch The distance (Prefetch Distance) is 2.
  • the span prefetcher stores the prefetched data directly in the L2 Cache.
  • Both span prefetching and pre-execution can send access requests to main memory in advance before the program actually needs some data.
  • the L2 Cache fails with the memory access instruction, and the row to be accessed has been initiated by the span prefetcher or the pre-execution instruction and has not been completed, the L2 Cache is invalidated. For L2 Miss, the other L2 Cache failures are called Primary L2 Miss.
  • the secondary L2 Miss can be divided into the secondary L2 Miss triggered by the span prefetcher according to whether the row to be accessed is initiated by the span prefetcher or by the pre-execution command. It is called the first type of Secondary L2 Miss ) and the pre-execution of Secondary L2 Miss (which can be called the second type of Secondary L2 Miss ).
  • Secondary L2 Miss has different effects in guiding the span prefetcher to capture the correct memory access mode. If the Secondary L2 Miss is triggered by the span prefetcher, indicating that the actual memory access mode matches the expected and prefetch behavior of the span prefetcher, the information is used to update the span prefetcher to continue triggering subsequent data. Prefetching helps improve the accuracy of prefetching and improves processor performance. If Secondary L2 Miss is triggered by pre-execution, it indicates that the L2 Cache has failed (Primary L2 Miss) and the span prefetcher has been updated. If the span prefetcher is updated again, It is possible to have a negative impact on the capture mode capture process of the span prefetcher, thereby reducing the accuracy of data prefetching.
  • the update filter provided by the present invention can effectively identify and remove the update of the secondary L2 Miss to the span prefetcher caused by the pre-execution, thereby effectively improving the accuracy of the prefetch.
  • the update filter can be implemented by adding a filter bit to each of the L2 Cache's Miss Status Handling Register (MSHR).
  • MSHR Miss Status Handling Register
  • the processor compares the failed address with the address in the MSHR. If there is an MSHR with the same address, the failure is Secondary L2 Miss. Otherwise, the failure is Primary L2 Miss.
  • the processor allocates an idle MSHR and initializes the corresponding filter bit: If the Primary L2 Miss is triggered by the prefetch request of the span prefetcher, its filter bit is initialized to 0; otherwise, its filter bit is initialized to 1. Then, when the Secondary L2 Miss occurs, the update filter reads the filter bit in the MSHR corresponding to the failed row and checks and determines the cause of the Secondary L2 Miss: If the filter bit is 1, the Secondary L2 is displayed. Miss is triggered by pre-execution, so it filters out the update to the span prefetcher. If the filter bit is 0, it indicates that the Secondary L2 Miss is triggered by the span prefetcher, allowing it to update the span pre-fetch.
  • the prefetch request is sent early, and the filter bit of the corresponding MSHR is set to 1 to avoid the secondary L2 Miss of the subsequent row to update the span prefetcher again.
  • the filter bit of the corresponding MSHR is set to 1 to avoid the secondary L2 Miss of the subsequent row to update the span prefetcher again.
  • the pre-fetching effect of the data pre-fetching proposed by the technical solution of the present invention is mainly reflected in the pre-fetch coverage rate, the pre-fetching timeliness and the pre-fetching accuracy rate. Aspect.
  • span prefetching can always prefetch the rule fetch mode.
  • Pre-execution can prefetch the non-regular fetch mode after the L2 Cache fails to cause the processor to enter the pre-execution mode.
  • the advantage of prefetching and pre-execution in capturing the memory access mode improves the coverage of prefetching.
  • the span prefetch can also initiate prefetch requests earlier for the pre-execution and span prefetch addresses, improving the timeliness of prefetching.
  • the update filter can improve the accuracy of prefetching by effectively removing unwanted updates to the span prefetcher during the pre-execution guidance.
  • Figure 2 shows the performance advantage analysis of the present invention compared to span prefetching and pre-execution, as follows:
  • the execution of the fetch instruction sequence in the sequential execution processor is as shown in case (a) of Figure 2, when The L2 Cache failure occurs when the processor executes the memory access instructions A, B, C, D, and E.
  • the pipeline stalls and waits for the invalidation process to complete.
  • the row addresses of the memory access instructions, B, D, and E are consecutive, which are L+l, L+2, and L+3, respectively.
  • the row address of the fetch instruction C is S.
  • the optimization effect of the span prefetch is as shown in the case (b) in Figure 2.
  • the span The prefetcher captures the span fetch mode, and then issues a prefetch request for the L+2 and L+3 rows. Therefore, an L2 Cache hit will occur when executing the fetch instructions D and E.
  • the L2 Cache failure still occurs when the memory access instruction C is executed.
  • the pre-execution optimization effect is as shown in the case (c) in Figure 2.
  • the processor pre-executes the memory access instructions B, C, and D (in the figure, b, , c, and d, indicate), and initiate host access to the memory addresses L+1, S, and L+2 in advance.
  • the memory access instructions B and C will have an L2 Cache hit.
  • the effective calculation results generated by the pre-execution are reused, so that the execution interval between the fetch instructions C and D becomes shorter. Therefore, when the fetch instruction D is executed, the prefetch request for the address L+2 has not been completed, causing it to still fail the L2 Cache.
  • the fetch command E is not overwritten, so that the L2 Cache failure still occurs when the fetch command E is executed.
  • the optimization effect of the technical solution of the present invention is as shown in the case (d) in FIG. 2, in the memory access instruction A.
  • the processor pre-executes the memory access instructions B and C (indicated by b, and c in the figure), and initiates a main memory access to the memory access addresses L+1 and S in advance.
  • the span prefetcher captures the span fetch mode (the Lth row and the L+1th row both fail), and then the pair is issued. Prefetch request for L+2 line and L+3 line.
  • FIG. 3 shows the different effects of the two secondary L2 Miss updates to the span prefetcher, as follows:
  • Fetching instruction 1 occurs when the L2 Cache fails and causes the processor to transition to pre-execution mode. In pre-execution mode, L2 Cache fails for both fetch instructions 2 through 6.
  • the span prefetcher captures the span fetch mode and issues the L+2 and L+3 rows. Prefetching. Next, both the fetch instruction 3 (line L) and the fetch instruction 5 (line L+1) have Secondary L2 Miss, and both are triggered by pre-execution. Similarly, the memory access instruction 4 (L+2 line) and the memory access instruction 6 (L+3 line) also occur Secondary L2 Miss, but is triggered by the span prefetcher.
  • the span prefetcher not only avoids the failure of the memory access failure 3 pair prefetch mode and the subsequent useless prefetching of the L and L-1 lines, but also when the span prefetcher detects the memory failure 4 Will continue to issue L+4 due to successful pattern matching A prefetch request for the row, thereby performing a correct data prefetch for the fetch command 7.
  • the memory access instruction 7 when executed, its memory access delay can be completely eliminated or partially hidden.
  • the present invention designs an update filter, which can effectively identify and filter the pre-execution triggered Secondary L2 Miss at runtime, thereby avoiding its cross-over. Destructive updates and training from the prefetcher improve the accuracy of prefetching.
  • FIG. 4 is a schematic diagram of processor state transition using the data prefetch mechanism of pre-execution guidance.
  • the specific processor execution process is as follows:
  • the span prefetcher is responsible for monitoring the L2 Cache failed fetch sequence and automatically triggers the prefetch request when the span fetch mode is captured.
  • the processor backs up the current register state and then immediately switches to pre-execution mode.
  • pre-execution mode the processor continues to execute subsequent instructions that have an L2 Cache invalidation instruction, but does not update the architectural state.
  • the pre-execution of data-independent instructions not only enables accurate data prefetching, but also produces valid calculation results.
  • the real access information during the pre-execution process is used to guide the prefetch process of the span prefetcher.
  • the update filter divides the captured real memory information into useful and harmful information. Among them, the useful information is used to guide the span prefetcher to issue prefetch requests early, and the harmful information is directly filtered out.
  • the processor In the merge result mode, the processor re-executes from the pre-execution fetch instruction and submits the corresponding execution result.
  • the valid pre-execution results stored in the IRB are merged directly into the architectural state; the invalid pre-execution instructions are re-transmitted into the pipeline for execution.
  • the processor will transition to pre-execution mode again.
  • the processor returns to normal execution mode.
  • Figure 5 shows the structure of the span prefetcher with the update filter. The details are as follows: The update filter filters and filters the L2 Cache failure information, and updates the span prefetcher with only useful information. It is filtered out directly.
  • the hardware structure of the span prefetcher is mainly composed of a Stream Table (flow table entry) and other related logic.
  • the Stream Table is a fully connected structure of 8 entries, and the pseudo least recently used (LRU) replacement algorithm is used. Each entry contains the following six fields:
  • Tag Tag field The upper 20 bits of the physical address, the prefetch storage area for identifying the stream, and the prefetch storage area for each stream using 20 bits is 4 KB (the storage area size is set to 4 KB in the embodiment of the present invention, other implementations) The example also applies to other storage area sizes);
  • Valid Valid field Used to identify whether the stream is valid.
  • Direction Direction field Used to identify the prefetch direction of the stream, 0 means prefetching backwards, 1 means prefetching forward;
  • Stride The span used to identify the access address in the stream
  • Last indexed Last Index field Used to record the index of the last failed L2 Cache line
  • Stage State It is used to identify the current phase of the flow, which can be the training phase or the prefetch phase.
  • Other logic mainly includes three parts: flow hit judgment logic, pattern matching judgment logic and prefetch address generation logic.
  • the flow hit judgment logic uses the Tag of the L2 Cache access address to perform an associative search on the Stream Table. If there is an entry matching the Tag field, it indicates that a stream hit occurs.
  • the pattern matching judgment logic compares the value of the Stride field with the difference between the Index of the L2 Cache access address and the Last Index according to the Direction field of the hit table. If they are equal, the pattern matching is successful, otherwise the pattern matching is performed. unsuccessful.
  • the prefetch address generation logic calculates the Tag and Index of the L2 Cache access address, the prefetch distance (set to 2 in the embodiment of the present invention), and the values of the Direction field and the Stride field of the hit entry. Get the prefetch address.
  • Figure 6 shows the workflow diagram of the update filter.
  • the detailed workflow is as follows: When an L2 Cache failure occurs, the processor compares the invalidated address with the address in the MSHR. If there is an MSHR with the same address, it indicates that the failure is Secondary L2. Miss, otherwise the failure is Primary L2 Miss. When Primary L2 Miss occurs, the processor allocates an idle MSHR and initializes the corresponding filter bit: if the Primary L2 Miss is triggered by a prefetch request from the span prefetcher, its filter bit is initialized to 0; Otherwise, its filter bit is initialized to 1.
  • the update filter reads out and checks the filter bits in the MSHR corresponding to the failed row: If the filter bit is 1, it indicates that the Secondary L2 Miss is triggered by pre-execution, so It filters out updates to the span prefetcher; if the filter bit is 0, it indicates that the Secondary L2 Miss is triggered by the span prefetcher, allowing it to update the span prefetcher and issue prefetch requests early, while Set the filter bit of the corresponding MSHR to 1 to prevent the secondary L2 Miss of the subsequent row from updating the span prefetcher again.
  • FIG. 7 shows the structure of the processor using the data prefetch mechanism of the pre-execution guidance.
  • the newly added structure is described in detail as follows:
  • Checkpoint is used to back up the architecture register heap when the processor enters pre-execution mode
  • INV is used to identify invalid registers in pre-execution mode
  • the Store Cache is used to save the stored data of the Store instruction in the pre-execution mode and forward it to the subsequent Load instruction for access and use;
  • the instruction and result buffer (IRB) is used to save the calculation result of the pre-execution instruction and its effective state;
  • the Stride Prefetcher is used to prefetch the access mode of the rule in the L2 Cache access;
  • the Update Filter is used to filter L2 Cache invalidation information, filter out harmful information, and update the span prefetcher only with useful information.
  • Embodiment 2 A data prefetching system with pre-execution guidance
  • a span prefetcher for monitoring a secondary cache miss access sequence and automatically triggering a prefetch request when the span access mode is captured
  • a processor configured to monitor, when a secondary cache access fails, to back up a current register state, and transition to a pre-execution mode; in the pre-execution mode, a subsequent instruction to continue executing an instruction that causes a secondary cache invalidation, Precise prefetching of irregular access patterns
  • the result and its valid state are saved in the instruction and result buffer, and the useful information is extracted from the captured real memory information to guide the span prefetcher to issue the prefetch request early; and is also used to trigger pre-execution
  • the pipeline is cleared, the backup register state is restored, and execution is resumed from the prefetched memory access instruction.
  • the span prefetcher is configured to perform prefetching forward or backward when monitoring the L2 cache fail fetch sequence and capturing the span fetch mode, and uses the storage region partitioning method to divide the stream.
  • the span prefetcher is configured to detect the prefetch request for the first time when the two secondary cache failures in the same flow meet the span access mode.
  • the span prefetcher is configured to store prefetched data in the secondary cache.
  • system may further comprise:
  • the processor is configured to allocate an idle failure state processing register for the main L2 failure and initialize a filter bit of the invalid state processing register when the main L2 failure occurs; the row to be accessed is already performed by the span prefetcher or pre-execution
  • the L2 cache failure in which the instruction initiates a main memory access and has not been completed is referred to as the secondary L2 failure, and the remaining L2 cache failures are the primary L2 failure.
  • the processor does not update the architecture state when it continues to execute subsequent instructions of the instruction that occurs the L2 cache failure.
  • the processor is configured to merge the pre-execution results saved in the instruction and the result buffer into an architecture state from the fetching instruction that causes the pre-execution from the execution of the pre-execution fetch instruction. , and the pre-execution instructions that invalidate the calculation result are re-transmitted into the pipeline to execute and submit the corresponding execution result.
  • the components of the embodiment of the system class provided by the above-mentioned invention and the steps of the embodiment of the method class can be implemented by a general computing device, which can be concentrated in a single Computing device, or distributed over a network of multiple computing devices, optionally, they may be implemented by program code executable by the computing device, such that They are stored in a storage device by a computing device, or they are separately fabricated into individual integrated circuit modules, or a plurality of modules or steps thereof are fabricated into a single integrated circuit module.
  • the invention is not limited to any specific combination of hardware and software.
  • the technical solution provided by the present invention separately uses different pre-execution and Stride Prefetching to process different memory access modes, and uses the memory access information during pre-execution to guide the span.
  • the prefetching process of the prefetcher (Stride Prefetcher) and the use of the update filter (Update Filter) optimize the pre-execution guidance process.
  • the processor uses the span prefetcher to prefetch the rule fetch mode; when the L2 cache fails, the processor enters the pre-execution mode, and in the pre-execution mode, the subsequent instructions are executed in advance.
  • the rule fetch mode performs accurate prefetching, and uses the real memory access information captured in advance during the pre-execution process to guide the prefetch process of the span prefetcher, thereby effectively improving the processor's memory access delay tolerance capability.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Advance Control (AREA)

Abstract

本发明公开了一种数据预取方法及系统,克服现有技术未很好地将预执行和跨距预取融合的不足。该方法包括:跨距预取器监测二级缓存失效访存序列,在捕获到跨距访存模式时触发预取请求;监测到二级缓存访问发生失效时,处理器对当前寄存器状态进行备份,转换到预执行模式,执行发生二级缓存失效的指令的后续指令,对非规则的访存模式进行精确预取,获得预执行结果及其有效状态保存到缓冲器中,并从捕获的真实访存信息中提取出有用信息指导跨距预取器及早发出预取请求;引发预执行的二级缓存失效指令完成主存访问后,处理器清空流水线,恢复备份的寄存器状态,从引发预执行的访存指令开始继续执行。本发明有效提升处理器的访存延时包容能力。

Description

预执行指导的数据预取方法及系统
技术领域
本发明涉及数据预取技术, 尤其涉及一种预执行(Pre-execution )指导的 数据预取方法及系统。
背景技术
随着处理器和存储器性能差距的不断扩大, 访存延时对处理器性能的影 响日益严重, 已经成为制约处理器性能提升的主要瓶颈。 虽然高速緩存 ( Cache )的使用可以有效地填补处理器与存储器之间的性能鸿沟, 但是高速 緩存设计往往釆用按需数据取回策略, 无法有效处理应用程序中复杂多样的 访存地址模式。 随着应用程序工作集的不断扩大, 即使釆用大容量片上緩存 也可能难以满足应用程序的数据访问需求。 因此, 如何有效降低或隐藏访存 延时, 是高性能处理器设计的关键问题之一。
数据预取技术是一种已经被广泛应用的访存延时包容技术, 它在处理器 真正需要某数据之前就预测其访存地址并提前发出请求, 而不需要等到发生 緩存失效后再发起主存访问, 从而达到将访存延时隐藏的效果。
预取的实现主要可以分为软件预取和硬件预取两种。
软件预取通常由程序员手动或编译器自动在程序中插入专门的预取指令 来触发预取。 软件预取指令不仅会占用额外的处理器执行周期, 而且会增加 代码体积。 此外, 软件预取往往借助对程序静态访存特性的剖视来插入预取 指令, 因而无法对程序运行时刻的动态访存特性加以利用, 并且, 软件预取 无法加速各种以二进制可执行文件形式存在的程序。
硬件预取通常由硬件预取器在运行时监测程序执行过程中可重复的访存 地址模式并自动发起预取请求。 硬件预取能够捕获并利用程序运行时刻的动 态访存特性, 从而进行准确和及时的预取。 硬件预取还可以避免由预取指令 带来的执行周期和代码体积的开销, 并且不受程序是否可重新编译的限制。
传统的硬件预取技术可主要分为基于相关性的预取( Correlation-based Prefetching )和跨距预取 ( Stride Prefetching ) 两类。
基于相关性的预取发掘并记录程序执行过程中特定的相关性规律和事 件, 并在监测到这些相关性规律和事件重复发生时触发相应的预取请求。 该 技术需要使用大容量(通常为兆比特(MB )级)的存储结构对相关性历史信 息及相应的预取地址进行记录, 这会造成不容忽视的复杂度和硬件开销, 使 其很难被应用到实际的处理器中。
与基于相关性的预取相比, 跨距预取是一种复杂度和硬件开销较低的预 取技术, 目前已被广泛应用于英特尔( Intel )奔腾( Pentium ) 4和 IBM Power6 等商用处理器中。 跨距预取技术主要基于访存的空间局部性规律进行数据预 取, 主要适用于规则的访存模式。
预执行技术也是一种简单有效的访存延时包容技术。
为了避免流水线由于长延时的緩存失效(比如二级緩存 ( L2 Cache ) 失 效) 而停顿, 预执行技术利用处理器的空闲周期预先执行失效访存指令的后 续指令, 通过对程序中存储级并行的充分发掘以及对访存带宽资源的有效利 用, 达到将多个主存访问的延时相重叠的效果。 通过对发生 L2 Cache失效指 令的后续指令的预先执行, 预执行技术能够对任意访存模式进行精确的数据 预取。
本发明的发明人经分析后发现:
与预执行相比, 跨距预取的优势主要体现在两个方面。
首先, 跨距预取可以在任意时刻对符合跨距访存模式的地址进行预取, 而预执行只在 L2 Cache失效引发处理器进入预执行模式后才进行预取。
其次, 预执行发起预取的提前时间不够长, 可能导致在处理器需要某数 据时, 对该数据的预取请求尚未完成, 而跨距预取能够较早地发起预取请求, 从而保证预取数据在处理器需要之前及时返回。
与跨距预取相比, 预执行的优势主要体现在两个方面。
首先, 预执行能够通过对真实指令片段的提前执行进行精确的预取, 而 跨距预取则是使用预测的访存地址进行预取。
其次, 预执行能够对非规则的访存模式进行预取, 而跨距预取只能对规 则的访存模式进行预取。
由以上分析可以看出,跨距预取与预执行有着各自的特点和优势, 因此, 可以考虑将两者的优势进行有效结合, 使各自都能发挥更大效用, 从而进一 步改善处理器的性能。
发明内容
本发明所要解决的技术问题是需要提供一种预执行指导的数据预取技 的不足。
为了解决上述技术问题,本发明提供了一种预执行指导的数据预取方法, 该方法包括:
跨距预取器监测二级緩存失效访存序列, 并在捕获到跨距访存模式时自 动触发预取请求;
在监测到二级緩存访问发生失效时, 处理器对当前的寄存器状态进行备 份, 转换到预执行模式;
在预执行模式下, 该处理器继续执行发生二级緩存失效的指令的后续指 令, 对非规则的访存模式进行精确的预取, 获得预执行结果及有效状态保存 到指令与结果緩冲器中, 并从捕获的真实访存信息中提取出所需信息指导跨 距预取器发出该预取请求;
当引发预执行的二级緩存失效指令完成主存访问后, 该处理器清空流水 线, 恢复备份的寄存器状态, 从引发预执行的访存指令开始继续执行。
优选地, 该跨距预取器监测该二级緩存失效访存序列及捕获该跨距访存 模式的步骤, 包括:
该跨距预取器向前或向后进行预取,并使用存储区域划分方法来划分流。 优选地, 该跨距预取器监测到同一个流中连续两次二级緩存失效符合该 跨距访存模式时, 初次发起该预取请求。
优选地, 该跨距预取器将预取数据存放在该二级緩存中。 优选地, 当主要 L2失效发生时, 该处理器为该主要 L2失效分配一空闲 的失效状态处理寄存器, 并初始化该失效状态处理寄存器的过滤位; 当次要 L2失效发生时, 一更新过滤器读出该失效行对应的失效状态处理寄存器中的 过滤位并判断引发该次要 L2失效的原因; 对于由预执行引发的该次要 L2失 效, 过滤掉对跨距预取器的更新; 对于由跨距预取器引发的该次要 L2失效, 更新跨距预取器及该过滤位;
其中, 需访问的行已经由跨距预取器或预执行指令发起主存访问且尚未 完成的二级緩存失效称为该次要 L2失效, 其余二级緩存失效为该主要 L2失 效。
优选地, 该处理器继续执行发生二级緩存失效的指令的后续指令时, 不 更新体系结构状态。
优选地, 该处理器从引发预执行的访存指令开始继续执行的步骤, 包括: 该处理器从引发预执行的访存指令开始, 将保存在该指令与结果緩冲器 中的预执行结果合并到体系结构状态, 将计算结果无效的预执行指令重新发 射到流水线中执行并提交执行结果。
本发明还提供了一种预执行指导的数据预取系统, 包括:
跨距预取器, 其设置为: 监测二级緩存失效访存序列, 并在捕获到跨距 访存模式时自动触发预取请求;
处理器, 其设置为: 监测到二级緩存访问发生失效时, 对当前的寄存器 状态进行备份, 转换到预执行模式; 在预执行模式下, 继续执行发生二级緩 存失效的指令的后续指令, 对非规则的访存模式进行精确的预取, 获得预执 行结果及有效状态保存到指令与结果緩冲器中, 并从捕获的真实访存信息中 提取出所需信息指导跨距预取器发出该预取请求; 当引发预执行的二级緩存 失效指令完成主存访问后, 清空流水线, 恢复备份的寄存器状态, 从引发预 执行的访存指令开始继续执行。
优选地, 该跨距预取器还设置为: 在监测该二级緩存失效访存序列及捕 获该跨距访存模式时, 向前或向后进行预取, 并使用存储区域划分方法来划 分流。 优选地, 该跨距预取器还设置为: 监测到同一个流中连续两次二级緩存 失效符合该跨距访存模式时, 初次发起该预取请求。
优选地, 该跨距预取器还设置为: 将预取数据存放在该二级緩存中。 优选地, 该系统还包括: 更新过滤器, 设置为: 当次要 L2失效发生时, 读出该次要 L2 失效行对应的失效状态处理寄存器中的过滤位并判断引发该 次要 L2失效的原因; 对于由预执行引发的该次要 L2失效, 过滤掉对跨距预 取器的更新; 对于由跨距预取器引发的该次要 L2失效, 更新跨距预取器及该 过滤位;
其中, 该处理器还设置为: 主要 L2失效发生时, 为该主要 L2失效分配 一空闲的失效状态处理寄存器并初始化该失效状态处理寄存器的过滤位; 需访问的行已经由跨距预取器或预执行指令发起主存访问且尚未完成的 二级緩存失效称为该次要 L2失效, 其余二级緩存失效为该主要 L2失效。
优选地, 该处理器还设置为: 继续执行发生二级緩存失效的指令的后续 指令时, 不更新体系结构状态。
优选地, 该处理器还设置为: 从引发预执行的访存指令开始继续执行时, 从引发预执行的访存指令开始, 将保存在该指令与结果緩冲器中的预执行结 果合并到体系结构状态, 以及将计算结果无效的预执行指令重新发射到流水 线中执行并提交执行结果。
与现有技术相比, 本发明提供的技术方案, 使用预执行(Pre-execution ) 和跨距预取( Stride Prefetching )分别处理不同的访存模式, 使用预执行期间 的访存信息指导跨距预取器( Stride Prefetcher )的预取过程和使用更新过滤器 ( Update Filter )对预执行的指导过程进行优化。 在正常模式下, 处理器使用 跨距预取器对规则的访存模式进行预取; 当发生 L2 Cache失效时, 处理器进 入到预执行模式, 在预执行模式下预先执行后续指令以对非规则的访存模式 进行精确的预取, 利用预执行过程中提前捕获到的真实访存信息来指导跨距 预取器的预取过程, 从而有效提升处理器的访存延时包容能力。
本发明的其它特征和优点将在随后的说明书中阐述, 并且, 部分地从说 明书中变得显而易见, 或者通过实施本发明而了解。 本发明的目的和其他优 点可通过在说明书、权利要求书以及附图中所特别指出的结构来实现和获得。 附图概述
附图用来提供对本发明技术方案的进一步理解, 并且构成说明书的一部 分, 与本发明的实施例一起用于解释本发明的技术方案, 并不构成对本发明 技术方案的限制。 在附图中:
图 1所示为本实施例的预执行指导的数据预取方法的流程示意图; 图 2是本发明技术方案与现有的跨距预取和预执行的性能分析示意图; 图 3是本发明两种 Secondary L2 Miss对跨距预取器的更新产生的效果示 意图;
图 4 是本发明釆用预执行指导的数据预取机制的处理器状态转换示意 图;
图 5是本发明带有更新过滤器的跨距预取器结构示意图;
图 6是本发明更新过滤器的工作流程示意图;
图 7是本发明釆用预执行指导的数据预取机制的处理器结构示意图。
本发明的较佳实施方式
以下将结合附图及实施例来详细说明本发明的实施方式, 借此对本发明 如何应用技术手段来解决技术问题, 并达成技术效果的实现过程能充分理解 并据以实施。
前提下的相互结合, 均在本发明的保护范围之内。 另外, 在附图的流程图示 出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行, 并且, 虽 然在流程图中示出了逻辑顺序, 但是在某些情况下, 可以以不同于此处的顺 序执行所示出或描述的步骤。
实施例一、 预执行指导的数据预取方法
图 1所示为本实施例的预执行指导的数据预取机制的工作原理及主要流 程。
步骤 S110, 在初始时刻, 处理器处于正常执行模式(与预执行模式相区 另' 表示没有使用本发明方法时处理器正常执行并提交指令的模式) , 正常 执行并提交指令。
步骤 S120,跨距预取器负责监测 L2 Cache失效访存序列,并在捕获到跨 距访存模式时自动触发预取请求。
步骤 S130, 当监测到 L2 Cache访问发生失效时,处理器对当前的寄存器 状态进行备份(Checkpoint ) , 然后立即转换到预执行模式。
步骤 S140,处理器运行于预执行模式,继续执行发生 L2 Cache失效的指 令(L2 Cache访问发生失效所对应的指令) 的后续指令, 对非规则的访存模 令与结果緩冲器(Instruction and Result Buffer, IRB ) 中, 并从捕获的真实访 存信息中提取出有用信息指导跨距预取器及早地发出该预取请求; 但是并不 更新体系结构状态 (一般指令提交时才更新体系结构状态, 以保存执行结果 或改变处理器状态, 推测式执行或预执行时并不更新体系结构状态, 会单独 使用一些部件保存中间结果) 。
本发明提供一个存储数据高速緩存( Store Cache ) ,用于保存存储( Store ) 指令的存储数据, 并将其前递给后续的载入(Load )指令访问和使用。 对与 发生 L2 Cache失效指令数据无关的指令所进行的预执行, 可以产生精确的数 据预取和有效的计算结果; 与发生 L2 Cache失效指令数据相关的指令则被直 接移出流水线, 并将其目标寄存器标记为无效( INV ) 。
在预执行过程中, 处理器将预执行指令产生的预执行结果及其有效状态 保存到指令与结果緩冲器(Instruction and Result Buffer, IRB ) 中, 以加快指 令正常执行的速度。 同时, 预执行过程中的真实访存信息被用来指导跨距预 取器的预取过程。
本发明还提供了一个更新过滤器, 从捕获的真实访存信息中提取出有用 信息 (其余信息可以称之为有害信息) , 其中, 有用信息用于指导跨距预取 器及早地发出预取请求, 其余的有害信息则直接过滤掉。 清空流水线, 恢复备份的寄存器状态, 并转换到一合并结果模式。 步骤 S160, 在合并结果模式下, 处理器从引发预执行的访存指令(即引 发预执行的 L2 Cache失效指令 )开始继续执行, 包括将保存在 IRB中的有效 的预执行结果直接合并到体系结构状态, 将计算结果无效的预执行指令重新 发射到流水线中执行并提交相应的执行结果。在此期间,如果监测到 L2 Cache 访问发生失效, 处理器将再次转换到预执行模式。
步骤 S170, 当所有预执行的指令都完成提交后, 处理器返回正常执行模 式。
跨距预取器监测 L2 Cache失效序列和捕获该跨距访存模式时, 可以向前 或向后这两个方向进行预取, 使用存储区域划分方法来划分流, 每个流负责 预取 4KB或者其他数据量的存储区域。
当监测到同一个流中连续两次 L2 Cache失效符合跨距访存模式时, 初次 发起对后续地址的预取请求。 对于每一个流, 初次发起的预取请求将相对于 本次失效访存地址向前或向后预取 2个连续的 L2 Cache行 (本实施例的预取 距离为 2 , 其他实施例中也可以是其他的预取距离) ; 以后, 每当一个已预 取的 L2 Cache行被处理器消耗时, 跨距预取器继续向前或向后预取 1个 L2 Cache行, 从而维持预取距离 ( Prefetch Distance )为 2。
为了减少面积开销及额外的数据一致性维护代价,并避免对一级緩存( L1 Cache )造成污染, 跨距预取器将预取数据直接存放在 L2 Cache中。
跨距预取与预执行均能够在程序实际需要某数据之前, 就提前向主存发 起访问请求。
在程序执行的过程中, 如果有访存指令发生 L2 Cache失效, 并且需访问 的行已经由跨距预取器或预执行指令发起主存访问且尚未完成, 则将该 L2 Cache失效称为次要 L2失效( Secondary L2 Miss ) , 将其它的 L2 Cache失效 称为主要 L2失效 ( Primary L2 Miss ) 。
而根据需访问的行是由跨距预取器还是由预执行指令提前发起主存访 问, 可将 Secondary L2 Miss划分为跨距预取器引发的 Secondary L2 Miss (可 以称之为第一类 Secondary L2 Miss )和预执行引发的 Secondary L2 Miss (可 以称之为第二类 Secondary L2 Miss )这两类。
上述两类 Secondary L2 Miss在指导跨距预取器捕获正确访存模式方面具 有不同的影响。 如果 Secondary L2 Miss是由跨距预取器引发, 表明实际发生 的访存模式符合跨距预取器的预期和预取行为, 因此, 使用其信息更新跨距 预取器以继续触发对后续数据的预取, 有利于提高预取的准确率从而提高处 理器性能。 如果 Secondary L2 Miss是预执行所引发, 表明之前已经有相同地 址的访存指令发生 L2 Cache失效( Primary L2 Miss )且更新过跨距预取器, 如果此时再次更新跨距预取器, 将有可能对跨距预取器的访存模式捕获过程 造成负面影响, 从而降低数据预取的准确率。
本发明提供的更新过滤器,可以有效识别并去除预执行引发的 Secondary L2 Miss对跨距预取器的更新, 从而有效提高预取的准确率。
更新过滤器可以通过在 L2 Cache 的每个失效状态处理寄存器 (Miss Status Handling Register , MSHR ) 中添加一位的过滤位实现。
当一个 L2 Cache失效发生时 ,处理器将该失效地址与 MSHR中的地址进 行全相联比较。 若存在地址相同的 MSHR, 则表明该失效为 Secondary L2 Miss , 否则该失效为 Primary L2 Miss。
当 Primary L2 Miss发生时 , 处理器为其分配一个空闲的 MSHR, 并且初 始化相应的过滤位: 如果该 Primary L2 Miss是由跨距预取器的预取请求引发 的,则其过滤位被初始化为 0;否则,其过滤位被初始化为 1。之后,当 Secondary L2 Miss发生时, 更新过滤器将该失效行对应的 MSHR中的过滤位读出并对 引发该 Secondary L2 Miss的原因进行检查和判断: 若过滤位为 1 , 则表明该 Secondary L2 Miss是由预执行引发的,所以将其对跨距预取器的更新过滤掉; 如果过滤位为 0, 则表明该 Secondary L2 Miss是由跨距预取器引发的, 允许 其更新跨距预取器以及早发出预取请求, 同时, 将相应 MSHR的过滤位设置 为 1 , 以避免后续该行的 Secondary L2 Miss再次更新跨距预取器。 当对 L2 Cache失效行的主存访问完成时, 其对应的过滤位随 MSHR—起被释放。
与跨距预取和预执行相比, 本发明技术方案提出的预执行指导的数据预 取对预取效果的改善主要体现在预取覆盖率、 预取及时性和预取准确率三个 方面。
第一, 跨距预取可以一直对规则的访存模式进行预取, 预执行可以在 L2 Cache 失效引发处理器进入预执行模式后对非规则的访存模式进行预取, 这 样就可以结合跨距预取与预执行在捕获访存模式方面的优势, 提高预取的覆 盖率。
第二, 在预执行期间的真实访存信息的指导下, 跨距预取还可以对预执 行和跨距预取均能够产生的地址更早地发起预取请求, 改善预取的及时性。
第三, 更新过滤器通过有效去除预执行指导过程中对跨距预取器的有害 更新, 能够提高预取的准确率。
图 2展示了本发明相比于跨距预取和预执行的性能优势分析,具体如下: 访存指令序列在按序执行处理器中执行情况如图 2中的情形 (a )所示, 当处理器在执行访存指令 A、 B、 C、 D和 E时均发生 L2 Cache失效, 流水线 停顿并等待失效处理完成。 访存指令 、 B、 D和 E的行地址是连续的, 分别 为 L+l、 L+2和 L+3。 访存指令 C的行地址为 S。
跨距预取的优化效果如图 2中的情形 ( b )所示, 当访存指令 A和 B导 致连续的 L2 Cache行, 即第 L行和第 L+1行均发生失效时,跨距预取器捕获 到跨距访存模式, 于是发出对第 L+2行和第 L+3行的预取请求。 因此, 在执 行访存指令 D和 E时将发生 L2 Cache命中。但是, 由于行 S不符合跨距访存 模式, 执行访存指令 C时仍会发生 L2 Cache失效。
预执行的优化效果如图 2中的情形( c )所示, 在访存指令 A进行主存访 问的过程中, 处理器预先执行访存指令 B、 C和 D (在图中用 b, 、 c, 和 d, 表示) , 并提前对访存地址 L+1、 S和 L+2发起主存访问。 当处理器返回正 常执行时,访存指令 B和 C将发生 L2 Cache命中。正常执行阶段会复用预执 行产生的有效计算结果, 使得访存指令 C和 D之间的执行间隔变短。 因此, 在访存指令 D执行时 ,对地址 L+2的预取请求尚未完成,导致其仍然发生 L2 Cache失效。 此外, 预执行期间未能覆盖到访存指令 E, 使得执行访存指令 E 时仍然会发生 L2 Cache失效。
本发明技术方案的优化效果如图 2 中的情形 (d )所示, 在访存指令 A 进行主存访问的过程中,处理器预先执行访存指令 B和 C (在图中用 b, 和 c, 表示) , 并提前对访存地址 L+1和 S发起主存访问。 在对访存指令 B的预执 行( b, )发生 L2 Cache失效时, 跨距预取器捕获到跨距访存模式(第 L行 和第 L+1行均发生失效), 于是发出对第 L+2行和第 L+3行的预取请求。 这 不仅能够产生对访存指令 E的预取(第 L+3行) , 而且能够更早地发起对访 存指令 D的预取(第 L+2行, 早于对访存指令 D的预执行, 在图中用 d' 表 示) 。 当处理器返回正常执行时, 访存指令^ C、 D和 E均发生 L2 Cache 命中, 使得本发明获得比跨距预取和预执行更好的优化效果。 可以看出, 本 发明改善了预取的覆盖率和及时性。
图 3展示了两种 Secondary L2 Miss对跨距预取器的更新产生的不同效 果, 具体说明如下:
访存指令①发生 L2 Cache失效并导致处理器转换到预执行模式。 在预执 行模式, 访存指令②至⑥均发生 L2 Cache失效。
预执行的具体过程如下。
在监测到访存指令①和②均发生失效(第 L和 L+1行) 时, 跨距预取器 捕获到跨距访存模式, 并发出对第 L+2行和第 L+3行的预取。 接下来, 访存 指令③ (第 L行)和访存指令⑤ (第 L+1行 ) 均发生了 Secondary L2 Miss, 且两者均由预执行所引发。 类似地, 访存指令④ (第 L+2行)和访存指令⑥ (第 L+3行)也都发生了 Secondary L2 Miss, 但是由跨距预取器所引发。
上述失效访存序列对跨距预取器的预取准确率的影响分析如下。
如果所有 Secondary L2 Miss均更新跨距预取器, 当跨距预取器监测到访 存失效③时, 将由于模式匹配失败而停止对后续数据的预取, 然后重新设置 预取模式并返回到训练阶段。 访存指令④和⑤的执行使得连续的 L+2和 L+1 行均发生失效, 此时, 新的预取模式训练完成并触发对第 L和 L-1行的预取 (此例中为无用的预取) 。 如果禁止预执行引发的 Secondary L2 Miss (即访 存失效③和⑤) 对跨距预取器进行更新, 而只使用跨距预取器引发的 Secondary L2 Miss (即访存失效④和⑥) 更新跨距预取器, 则不仅可以避免 访存失效③对预取模式的破坏以及后续对第 L和 L-1行的无用预取, 而且当 跨距预取器监测到访存失效④时, 将由于模式匹配成功而继续发出对第 L+4 行的预取请求, 从而为访存指令⑦进行了正确的数据预取。 这样, 当访存指 令⑦执行时, 其访存延时能够完全消除或部分隐藏。
根据上述分析, 对于跨距预取器引发的 Secondary L2 Miss, 如果使用其 信息对跨距预取器进行更新和训练, 将有利于提高预取的准确率, 从而将更 多程序执行所需的数据提前从主存中取回。而对于预执行引发的 Secondary L2 Miss, 如果使用其信息对跨距预取器进行更新和训练, 将有可能会破坏预取 跨距预取器对正确访存模式的捕获。 为了使跨距预取器的预取效果在预执行 处理器中得到充分发挥, 本发明设计了更新过滤器, 可以在运行时刻有效识 别并过滤预执行引发的 Secondary L2 Miss, 从而避免其对跨距预取器的破坏 性更新与训练, 改善预取的准确率。
图 4为釆用预执行指导的数据预取机制的处理器状态转换示意图, 具体 的处理器执行过程如下:
在初始时刻, 处理器处于正常执行模式。 跨距预取器负责监测 L2 Cache 失效访存序列, 并在捕获到跨距访存模式时自动触发预取请求。 当监测到 L2 Cache访问发生失效时, 处理器对当前的寄存器状态进行备份, 然后立即转 换到预执行模式。
在预执行模式, 处理器继续执行发生 L2 Cache失效指令的后续指令, 但 是并不更新体系结构状态。 数据无关的指令的预执行, 不仅可以进行精确的 数据预取, 而且能够产生有效的计算结果。 同时, 预执行过程中的真实访存 信息被用来指导跨距预取器的预取过程。 更新过滤器将捕获的真实访存信息 划分成有用信息和有害信息。 其中, 有用信息被用于指导跨距预取器及早地 发出预取请求, 有害信息则被直接过滤掉。 当引发预执行的 L2 Cache失效指 令完成主存访问后, 处理器清空流水线, 恢复备份的寄存器状态, 并转换到 合并结果模式。
在合并结果模式, 处理器从引发预执行的访存指令开始重新执行并提交 相应的执行结果。 保存在 IRB中的有效的预执行结果被直接合并到体系结构 状态; 结果无效的预执行指令被重新发射到流水线中执行。 在此期间, 如果 再次监测到 L2 Cache访问发生失效, 处理器将再次转换到预执行模式。 当所 有预执行的指令都完成提交后, 处理器返回正常执行模式。 图 5展示了带有更新过滤器的跨距预取器结构示意图, 详细描述如下: 更新过滤器会对 L2 Cache失效信息进行筛选和过滤, 只使用有用信息更 新跨距预取器, 无用的信息则被直接过滤掉。
跨距预取器的硬件结构主要由 Stream Table (流表项 )及其他相关逻辑组 成。 Stream Table为 8表项全相联结构, 釆用伪最近最少使用 (LRU )替换算 法, 每个表项包含以下六个域:
标签 Tag域: 物理地址的高 20位, 用于标识流的预取存储区域, 使用 20位表示每个流的预取存储区域为 4KB (本发明实施例将存储区域大小设置 为 4KB, 其他实施例也适用于其它存储区域大小) ;
有效 Valid域: 用于标识流是否有效;
方向 Direction域: 用于标识流的预取方向, 0表示向后预取, 1表示向 前预取;
跨距 Stride域: 用于标识该流中访存地址的跨距;
最近一次索引 Last Index域: 用于记录最近一次发生失效的 L2 Cache行 的索引;
阶段 State域:用于标识流当前所处的阶段,可以是训练阶段或预取阶段。 其他逻辑主要包括流命中判断逻辑、 模式匹配判断逻辑和预取地址生成 逻辑这三部分。流命中判断逻辑使用 L2 Cache访问地址的 Tag对 Stream Table 进行相联查找, 若存在 Tag域与其相匹配的表项则表明发生流命中。 当发生 流命中时, 模式匹配判断逻辑根据命中表项 Direction域, 使用 Stride域的数 值与 L2 Cache访问地址的 Index和 Last Index的差值进行比较, 若相等则表 明模式匹配成功, 否则表明模式匹配不成功。 在发生流命中且模式匹配成功 时,预取地址生成逻辑根据 L2 Cache访问地址的 Tag和 Index、预取距离(本 发明实施例设置为 2 ) 以及命中表项 Direction域和 Stride域的数值, 计算得 到预取地址。
图 6展示了更新过滤器的工作流程示意图, 其详细的工作流程如下: 当一个 L2 Cache失效发生时,处理器将该失效地址与 MSHR中的地址进 行全相联比较。 若存在地址相同的 MSHR, 则表明该失效为 Secondary L2 Miss, 否则该失效为 Primary L2 Miss。 当 Primary L2 Miss发生时, 处理器为 其分配一个空闲的 MSHR,并且初始化相应的过滤位:如果该 Primary L2 Miss 是由跨距预取器的预取请求引发的, 则其过滤位被初始化为 0; 否则, 其过 滤位被初始化为 1。 之后, 当 Secondary L2 Miss发生时, 更新过滤器将该失 效行对应的 MSHR 中的过滤位读出并进行检查: 若过滤位为 1 , 则表明该 Secondary L2 Miss是由预执行引发的,所以将其对跨距预取器的更新过滤掉; 如果过滤位为 0, 则表明该 Secondary L2 Miss是由跨距预取器引发的, 允许 其更新跨距预取器以及早发出预取请求, 同时, 将相应 MSHR的过滤位设置 为 1 , 以避免后续该行的 Secondary L2 Miss再次更新跨距预取器。
图 7展示了釆用预执行指导的数据预取机制的处理器结构示意图, 新增 加的结构详细描述如下:
Checkpoint (检查点)用于在处理器进入预执行模式时备份体系结构寄存 器堆;
INV用于在预执行模式标识无效的寄存器;
Store Cache (存储緩存 )用于在预执行模式保存 Store指令的存储数据, 并将其前递给后续的 Load指令访问和使用;
指令与结果緩冲器 ( IRB )用于保存预执行指令的计算结果及其有效状态; 跨距预取器( Stride Prefetcher )用于对 L2 Cache访问中规则的访存模式 进行预取;
更新过滤器( Update Filter )用于对 L2 Cache失效信息进行筛选, 过滤掉 有害信息, 只使用有用信息更新跨距预取器。
实施例二、 一种预执行指导的数据预取系统
本实施例的数据预取系统, 主要包括:
跨距预取器, 用于监测二级緩存失效访存序列, 并在捕获到跨距访存模 式时自动触发预取请求;
处理器, 用于监测到二级緩存访问发生失效时, 对当前的寄存器状态进 行备份, 转换到预执行模式; 在预执行模式下, 用于继续执行发生二级緩存 失效的指令的后续指令, 对非规则的访存模式进行精确的预取, 获得预执行 结果及其有效状态保存到指令与结果緩冲器中, 并从捕获的真实访存信息中 提取出有用信息指导跨距预取器及早地发出该预取请求; 还用于当引发预执 行的二级緩存失效指令完成主存访问后, 清空流水线, 恢复备份的寄存器状 态, 从引发预执行的访存指令开始继续执行。
其中, 该跨距预取器在监测该二级緩存失效访存序列及捕获该跨距访存 模式时, 用于向前或向后进行预取, 并使用存储区域划分方法来划分流。
其中, 该跨距预取器用于监测到同一个流中连续两次二级緩存失效符合 该跨距访存模式时, 初次发起该预取请求。
其中, 该跨距预取器用于将预取数据存放在该二级緩存中。
其中, 该系统还可以包括:
更新过滤器, 用于当次要 L2失效发生时, 读出该次要 L2失效行对应的 失效状态处理寄存器中的过滤位并判断引发该次要 L2失效的原因;对于由预 执行引发的该次要 L2失效, 用于过滤掉对跨距预取器的更新; 对于由跨距预 取器引发的该次要 L2失效, 用于更新跨距预取器及该过滤位;
其中, 该处理器用于主要 L2失效发生时, 为该主要 L2失效分配一空闲 的失效状态处理寄存器并初始化该失效状态处理寄存器的过滤位; 需访问的 行已经由跨距预取器或预执行指令发起主存访问且尚未完成的二级緩存失效 称为该次要 L2失效, 其余二级緩存失效为该主要 L2失效。
其中, 该处理器用于继续执行发生二级緩存失效的指令的后续指令时, 不更新体系结构状态。
其中, 该处理器从引发预执行的访存指令开始继续执行时, 用于从引发 预执行的访存指令开始, 将保存在该指令与结果緩冲器中的预执行结果合并 到体系结构状态, 以及将计算结果无效的预执行指令重新发射到流水线中执 行并提交相应的执行结果。
本领域的技术人员应该明白, 上述的本发明所提供的系统类的实施例中 各组成部分, 以及方法类的实施例中各步骤, 可以用通用的计算装置来实现, 它们可以集中在单个的计算装置上, 或者分布在多个计算装置所组成的网络 上, 可选地, 它们可以用计算装置可执行的程序代码来实现, 从而, 可以将 它们存储在存储装置中由计算装置来执行, 或者将它们分别制作成各个集成 电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。 这样, 本发明不限制于任何特定的硬件和软件结合。
虽然本发明所揭露的实施方式如上, 但所述的内容只是为了便于理解本 发明而釆用的实施方式, 并非用以限定本发明。 任何本发明所属技术领域内 的技术人员, 在不脱离本发明所揭露的精神和范围的前提下, 可以在实施的 形式上及细节上作任何的修改与变化, 但本发明的专利保护范围, 仍须以所 附的权利要求书所界定的范围为准。
工业实用性
与现有技术相比, 本发明提供的技术方案, 使用预执行(Pre-execution ) 和跨距预取( Stride Prefetching )分别处理不同的访存模式, 使用预执行期间 的访存信息指导跨距预取器( Stride Prefetcher )的预取过程和使用更新过滤器 ( Update Filter )对预执行的指导过程进行优化。 在正常模式下, 处理器使用 跨距预取器对规则的访存模式进行预取; 当发生 L2 Cache失效时, 处理器进 入到预执行模式, 在预执行模式下预先执行后续指令以对非规则的访存模式 进行精确的预取, 利用预执行过程中提前捕获到的真实访存信息来指导跨距 预取器的预取过程, 从而有效提升处理器的访存延时包容能力。

Claims

权 利 要 求 书
1、 一种预执行指导的数据预取方法, 该方法包括:
跨距预取器监测二级緩存失效访存序列, 并在捕获到跨距访存模式时自 动触发预取请求;
在监测到二级緩存访问发生失效时, 处理器对当前的寄存器状态进行备 份, 转换到预执行模式;
在预执行模式下, 该处理器继续执行发生二级緩存失效的指令的后续指 令, 对非规则的访存模式进行精确的预取, 获得预执行结果及有效状态保存 到指令与结果緩冲器中, 并从捕获的真实访存信息中提取出所需信息指导跨 距预取器发出该预取请求;
当引发预执行的二级緩存失效指令完成主存访问后, 该处理器清空流水 线, 恢复备份的寄存器状态, 从引发预执行的访存指令开始继续执行。
2、 根据权利要求 1所述的方法, 其中, 该跨距预取器监测该二级緩存失 效访存序列及捕获该跨距访存模式的步骤包括:
该跨距预取器向前或向后进行预取,并使用存储区域划分方法来划分流。
3、 根据权利要求 2所述的方法, 其中:
该跨距预取器监测到同一个流中连续两次二级緩存失效符合该跨距访存 模式时, 初次发起该预取请求。
4、 根据权利要求 1所述的方法, 其中:
该跨距预取器将预取数据存放在该二级緩存中。
5、 根据权利要求 1所述的方法, 其中:
当主要 L2失效发生时, 该处理器为该主要 L2失效分配一空闲的失效状 态处理寄存器, 并初始化该失效状态处理寄存器的过滤位;
当次要 L2失效发生时,一更新过滤器读出该失效行对应的失效状态处理 寄存器中的过滤位并判断引发该次要 L2失效的原因;对于由预执行引发的该 次要 L2失效, 过滤掉对跨距预取器的更新; 对于由跨距预取器引发的该次要 L2失效, 更新跨距预取器及该过滤位;
其中, 需访问的行已经由跨距预取器或预执行指令发起主存访问且尚未 完成的二级緩存失效称为该次要 L2失效, 其余二级緩存失效为该主要 L2失 效。
6、 根据权利要求 1所述的方法, 其中:
该处理器继续执行发生二级緩存失效的指令的后续指令时, 不更新体系 结构状态。
7、 根据权利要求 1所述的方法, 其中, 该处理器从引发预执行的访存指 令开始继续执行的步骤包括:
该处理器从引发预执行的访存指令开始, 将保存在该指令与结果緩冲器 中的预执行结果合并到体系结构状态, 将计算结果无效的预执行指令重新发 射到流水线中执行并提交执行结果。
8、 一种预执行指导的数据预取系统, 包括:
跨距预取器, 其设置为: 监测二级緩存失效访存序列, 并在捕获到跨距 访存模式时自动触发预取请求;
处理器, 其设置为: 监测到二级緩存访问发生失效时, 对当前的寄存器 状态进行备份, 转换到预执行模式; 在预执行模式下, 继续执行发生二级緩 存失效的指令的后续指令, 对非规则的访存模式进行预取, 获得预执行结果 及有效状态保存到指令与结果緩冲器中, 并从捕获的真实访存信息中提取出 所需信息指导跨距预取器发出该预取请求; 当引发预执行的二级緩存失效指 令完成主存访问后, 清空流水线, 恢复备份的寄存器状态, 从引发预执行的 访存指令开始继续执行。
9、 根据权利要求 8所述的系统, 其中:
所述跨距预取器还设置为: 在监测该二级緩存失效访存序列及捕获该跨 距访存模式时, 向前或向后进行预取, 并使用存储区域划分方法来划分流。
10、 根据权利要求 9所述的系统, 其中:
所述跨距预取器还设置为: 监测到同一个流中连续两次二级緩存失效符 合该跨距访存模式时, 初次发起该预取请求。
11、 根据权利要求 8所述的系统, 其中:
所述跨距预取器还设置为: 将预取数据存放在该二级緩存中。
12、 根据权利要求 8所述的系统, 所述系统还包括:
更新过滤器, 设置为: 当次要 L2失效发生时, 读出该次要 L2失效行对 应的失效状态处理寄存器中的过滤位并判断引发该次要 L2失效的原因;对于 由预执行引发的该次要 L2失效, 过滤掉对跨距预取器的更新; 对于由跨距预 取器引发的该次要 L2失效, 更新跨距预取器及该过滤位;
所述处理器还设置为: 主要 L2失效发生时, 为该主要 L2失效分配一空 闲的失效状态处理寄存器并初始化该失效状态处理寄存器的过滤位;
其中, 需访问的行已经由跨距预取器或预执行指令发起主存访问且尚未 完成的二级緩存失效称为该次要 L2失效, 其余二级緩存失效为该主要 L2失 效。
13、 根据权利要求 8所述的系统, 其中:
所述处理器还设置为:继续执行发生二级緩存失效的指令的后续指令时, 不更新体系结构状态。
14、 根据权利要求 8所述的系统, 其中:
所述处理器还设置为: 从引发预执行的访存指令开始继续执行时, 从引 发预执行的访存指令开始, 将保存在该指令与结果緩冲器中的预执行结果合 并到体系结构状态, 以及将计算结果无效的预执行指令重新发射到流水线中 执行并提交执行结果。
PCT/CN2011/080813 2011-04-18 2011-10-14 预执行指导的数据预取方法及系统 WO2012142820A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201110096900.7 2011-04-18
CN2011100969007A CN102156633A (zh) 2011-04-18 2011-04-18 预执行指导的数据预取方法及系统

Publications (1)

Publication Number Publication Date
WO2012142820A1 true WO2012142820A1 (zh) 2012-10-26

Family

ID=44438141

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2011/080813 WO2012142820A1 (zh) 2011-04-18 2011-10-14 预执行指导的数据预取方法及系统

Country Status (2)

Country Link
CN (1) CN102156633A (zh)
WO (1) WO2012142820A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10609174B2 (en) 2017-04-11 2020-03-31 Microsoft Technology Licensing, Llc Parallel prefetching log/meta stream sub-portions to recreate partition states in a distributed computing system

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156633A (zh) * 2011-04-18 2011-08-17 北京北大众志微系统科技有限责任公司 预执行指导的数据预取方法及系统
CN102385622B (zh) * 2011-10-25 2013-03-13 曙光信息产业(北京)有限公司 一种文件系统stride访问模式的预读方法
CN102521158B (zh) * 2011-12-13 2014-09-24 北京北大众志微系统科技有限责任公司 一种实现数据预取的方法及装置
CN103019657B (zh) * 2012-12-31 2015-09-16 东南大学 支持数据预取与重用的可重构系统
CN104750696B (zh) * 2013-12-26 2018-07-20 华为技术有限公司 一种数据预取方法及装置
CN106776371B (zh) * 2015-12-14 2019-11-26 上海兆芯集成电路有限公司 跨距参考预取器、处理器和将数据预取到处理器的方法
CN109799897B (zh) * 2019-01-29 2019-11-26 吉林大学 一种减少gpu二级缓存能耗的控制方法及装置
CN114721726B (zh) * 2022-06-10 2022-08-12 成都登临科技有限公司 一种多线程组并行获取指令的方法、处理器及电子设备

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100407165C (zh) * 2003-07-31 2008-07-30 飞思卡尔半导体公司 数据处理系统中的预取控制
CN101467135A (zh) * 2006-06-07 2009-06-24 先进微装置公司 预取数据的设备及方法
CN102156633A (zh) * 2011-04-18 2011-08-17 北京北大众志微系统科技有限责任公司 预执行指导的数据预取方法及系统

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7343602B2 (en) * 2000-04-19 2008-03-11 Hewlett-Packard Development Company, L.P. Software controlled pre-execution in a multithreaded processor

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100407165C (zh) * 2003-07-31 2008-07-30 飞思卡尔半导体公司 数据处理系统中的预取控制
CN101467135A (zh) * 2006-06-07 2009-06-24 先进微装置公司 预取数据的设备及方法
CN102156633A (zh) * 2011-04-18 2011-08-17 北京北大众志微系统科技有限责任公司 预执行指导的数据预取方法及系统

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10609174B2 (en) 2017-04-11 2020-03-31 Microsoft Technology Licensing, Llc Parallel prefetching log/meta stream sub-portions to recreate partition states in a distributed computing system

Also Published As

Publication number Publication date
CN102156633A (zh) 2011-08-17

Similar Documents

Publication Publication Date Title
WO2012142820A1 (zh) 预执行指导的数据预取方法及系统
US7085955B2 (en) Checkpointing with a write back controller
US7925839B1 (en) System and method for performing memory operations in a computing system
US8041900B2 (en) Method and apparatus for improving transactional memory commit latency
TWI263169B (en) Method and data processing system having an external instruction set and an internal instruction set
JP4520790B2 (ja) 情報処理装置およびソフトウェアプリフェッチ制御方法
JP2916420B2 (ja) チェックポイント処理加速装置およびデータ処理方法
US9354918B2 (en) Migrating local cache state with a virtual machine
US11756618B1 (en) System and method for atomic persistence in storage class memory
US7958318B2 (en) Coherency maintaining device and coherency maintaining method
Lupon et al. FASTM: A log-based hardware transactional memory with fast abort recovery
US8601240B2 (en) Selectively defering load instructions after encountering a store instruction with an unknown destination address during speculative execution
JPH0670779B2 (ja) フェッチ方法
WO2008125534A1 (en) Checkpointed tag prefetching
TWI722438B (zh) 用於進行中操作的指令排序的裝置及方法
CN107003897B (zh) 监控事务处理资源的利用率
CN111274584A (zh) 一种基于缓存回滚以防御处理器瞬态攻击的装置
CN111259384B (zh) 一种基于缓存随机无效的处理器瞬态攻击防御方法
US20160070701A1 (en) Indexing accelerator with memory-level parallelism support
Duan et al. SCsafe: Logging sequential consistency violations continuously and precisely
Li et al. A fast restart mechanism for checkpoint/recovery protocols in networked environments
Qian et al. Rainbow: Efficient memory dependence recording with high replay parallelism for relaxed memory model
JP2001249846A (ja) キャッシュメモリ装置及びデータ処理システム
JP3320562B2 (ja) キャッシュメモリを有する電子計算機
US8065670B2 (en) Method and apparatus for enabling optimistic program execution

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11863855

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11863855

Country of ref document: EP

Kind code of ref document: A1