WO2023035654A1 - 偏移预取方法、执行偏移预取的装置、计算设备和介质 - Google Patents

偏移预取方法、执行偏移预取的装置、计算设备和介质 Download PDF

Info

Publication number
WO2023035654A1
WO2023035654A1 PCT/CN2022/093310 CN2022093310W WO2023035654A1 WO 2023035654 A1 WO2023035654 A1 WO 2023035654A1 CN 2022093310 W CN2022093310 W CN 2022093310W WO 2023035654 A1 WO2023035654 A1 WO 2023035654A1
Authority
WO
WIPO (PCT)
Prior art keywords
offset
prefetch
value
prefetcher
value table
Prior art date
Application number
PCT/CN2022/093310
Other languages
English (en)
French (fr)
Inventor
胡世文
Original Assignee
海光信息技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 海光信息技术股份有限公司 filed Critical 海光信息技术股份有限公司
Publication of WO2023035654A1 publication Critical patent/WO2023035654A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • G06F9/30047Prefetch instructions; cache control instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0811Multiuser, multiprocessor or multiprocessing cache systems with multilevel cache hierarchies
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • Embodiments of the present disclosure relate to an offset prefetching method, an apparatus for performing offset prefetching, a computing device, and a medium.
  • a central processing unit In a central processing unit (Central Processing Unit, CPU) architecture, program instructions and data are generally stored in a memory such as a dynamic random access memory (Dynamic Random Access Memory, DRAM).
  • a dynamic random access memory DRAM
  • the operating frequency of the CPU core Core
  • the CPU needs to wait for hundreds of CPU clock cycles to directly obtain program instructions and data from the memory, which will cause the CPU to be unable to continue to run. Instructions cause idling, resulting in performance loss. Therefore, modern high-performance CPUs are equipped with multi-level cache architectures to store recently accessed data.
  • the data prefetcher is also used to identify the rules of CPU access to data, so as to prefetch the data that may be accessed into the cache in advance, so that the CPU can quickly read data from the cache.
  • Some embodiments of the present disclosure provide an offset prefetch method, an apparatus for performing offset prefetch, a computing device, and a medium, which are used to improve the offset prefetch value by reducing the time required for the offset prefetcher to generate
  • the efficiency of the offset prefetcher can quickly respond to the CPU's demand for data prefetch, and further improve the operating efficiency of the CPU system.
  • an offset prefetching method includes: using an offset prefetcher to select K offset prefetch values for generating a prefetch request from a preset offset value table , wherein the preset offset value table includes preset N offset values, and the K offset prefetch values are the offset prefetch newly selected by the offset prefetcher from the preset offset value table in time Value, wherein, N and K are positive integers, and N is greater than K; Record K offset prefetch values for forming a recent offset value table comprising K offset prefetch values; and utilize the offset prefetcher from The first offset prefetch value is selected from the recent offset value table for performing data prefetch based on the first offset prefetch value.
  • the process of selecting the first offset prefetch value is represented as the first training phase
  • the method further includes: for the second training phase of selecting the second offset prefetch value, determining based on the rule is Whether to use the preset offset value table or the recent offset value table to select the second offset prefetch value, wherein the second offset prefetch value is used for data prefetch.
  • the rules include: determining whether the training score of the first offset prefetch value is greater than a score threshold, and if the training score is greater than the score threshold, determining to use the recent offset value table to select the second offset prefetch value.
  • the rules include: using the preset offset value table and the recent offset value table to select the second offset prefetch value in an alternating manner, wherein the alternating manner indicates that the recent offset value is used in the first training phase
  • a preset offset value table is used to select a second offset prefetch value in the second training stage.
  • the method further includes: when it is determined to use the preset offset value table to select the second offset prefetch value, using an offset prefetcher to select the second offset prefetch value from the preset offset value table Select a second offset prefetch value, and perform data prefetch based on the second offset prefetch value; or use the offset prefetch value table to select the second offset prefetch value
  • the fetcher selects a second offset prefetch value from the recent offset value table, and performs data prefetch based on the second offset prefetch value.
  • the method further includes: updating the recent offset value table with the second offset prefetch value.
  • updating the recent offset value table includes: determining whether the second offset prefetch value exists in the recent offset value table; and determining that the second offset prefetch value does not exist in the recent offset value table; In the case of the offset value table, the recent offset value table is updated according to one of the first-in-first-out algorithm, the least-recently-used algorithm, or the pseudo-least-recently-used algorithm.
  • the offset prefetcher is an optimal offset prefetcher
  • the K offset prefetch values are K optimal offset prefetch values
  • performing data prefetching based on the first offset prefetch value includes: determining shift the prefetch value and the requested address to form the prefetch request address; determine whether the data corresponding to the prefetch request address exists in the L-level cache; and determine that the data corresponding to the prefetch request address does not exist in the L-level In the cache, trigger a prefetch request corresponding to the prefetch request address, where L is an integer greater than 1.
  • an apparatus for performing offset prefetch wherein the apparatus for performing offset prefetch includes an offset prefetch unit configured to: use the offset The prefetcher selects K offset prefetch values for generating the prefetch request from the preset offset value table, wherein the preset offset value table includes preset N offset values, K offset prefetch values The value is the latest offset prefetch value selected by the offset prefetcher from the preset offset value table in time, wherein N and K are positive integers, and N is greater than K; record K offset prefetch values, For forming a recent offset value table comprising K offset prefetch values; and using an offset prefetcher to select a first offset prefetch value from the recent offset value table for prefetching based on the first offset Value for data prefetching.
  • an offset prefetch unit configured to: use the offset The prefetcher selects K offset prefetch values for generating the prefetch request from the preset offset value table, wherein the preset offset value table includes preset N offset values, K
  • the process of selecting the first offset prefetch value is represented as the first training phase
  • the offset prefetch unit is further configured to: for the second training phase of selecting the second offset prefetch value, It is determined based on a rule whether to use a preset offset value table or a recent offset value table to select the second offset prefetch value, wherein the second offset prefetch value is used for data prefetch.
  • the rules include: determining whether the training score of the first offset prefetch value is greater than a score threshold, and if the training score is greater than the score threshold, determining to use the recent offset value table to select the second offset prefetch value.
  • the rules include: using the preset offset value table and the recent offset value table to select the second offset prefetch value in an alternating manner, wherein the alternating manner indicates that the recent offset value is used in the first training phase
  • a preset offset value table is used to select a second offset prefetch value in the second training stage.
  • the offset prefetch unit is further configured to: use the offset prefetcher to select the second offset prefetch value from the preset offset value table when it is determined to use the preset offset value table Select the second offset prefetch value in the offset value table, and perform data prefetch based on the second offset prefetch value; or when it is determined to use the recent offset value table to select the second offset prefetch value,
  • the offset prefetcher is used to select a second offset prefetch value from the recent offset value table, and to perform data prefetch based on the second offset prefetch value.
  • the offset prefetching unit is further configured to: update the recent offset value table with the second offset prefetch value.
  • the offset prefetch unit in order to update the recent offset value table, is further configured to: determine whether the second offset prefetch value exists in the recent offset value table; If the shift prefetch value does not exist in the recent offset value table, update the recent offset value table according to one of the first-in-first-out algorithm, the least recently used algorithm or the pseudo-least recently used algorithm.
  • the offset prefetcher is an optimal offset prefetcher
  • the K offset prefetch values are K optimal offset prefetch values
  • the offset prefetch unit in order to perform data prefetch based on the first offset prefetch value, the offset prefetch unit It is also configured to: determine the prefetch request address composed of the first offset prefetch value and the requested address; determine whether the data corresponding to the prefetch request address exists in the L-level cache; and determine and prefetch the request address If the corresponding data does not exist in the L-level cache, a prefetch request corresponding to the prefetch request address is triggered, where L is an integer greater than 1.
  • a computing device including: a processor; and a memory, wherein computer readable code is stored in the memory, and when the computer readable code is executed by the processor, executes the above-mentioned The offset prefetch method for .
  • a non-transitory computer-readable storage medium on which instructions are stored, and when executed by a processor, the instructions cause the processor to execute the offset prefetching method as described above.
  • the device for performing offset prefetching, computing equipment, and media by recording the latest K offsets selected by the offset prefetcher based on the preset offset value table in time
  • the prefetch value is used to construct the recent offset value table, so that the offset prefetcher can use the constructed recent offset value table to generate the offset prefetch value, thereby reducing the offset prefetcher required to generate the offset prefetch value time, thereby improving the efficiency of the offset prefetcher and further improving the operating efficiency of the CPU system.
  • FIG. 1 shows a schematic diagram of a multi-level cache architecture including a prefetcher
  • FIG. 2 shows a schematic flowchart of an offset prefetching method according to an embodiment of the present disclosure
  • FIG. 3 shows an overall schematic diagram of using an offset prefetcher to select an offset prefetch value from a preset offset value table according to an embodiment of the present disclosure
  • Fig. 4 shows a schematic block diagram of using an offset prefetcher to select an offset prefetch value from a preset offset value table according to an embodiment of the present disclosure
  • FIG. 5 shows another schematic flowchart of selecting an offset prefetch value from a preset offset value table by using an offset prefetcher according to an embodiment of the present disclosure
  • FIG. 6 shows an overall schematic diagram of using an offset prefetcher to alternately perform data prefetching based on a preset offset value table and a recent offset value table according to an embodiment of the present disclosure
  • FIG. 7 shows an execution flowchart of alternately performing data prefetching by using an offset prefetcher based on a preset offset value table and a recent offset value table according to an embodiment of the present disclosure
  • FIG. 8 shows another execution flow chart showing data prefetching performed by using an offset prefetcher based on a preset offset value table and a recent offset value table according to an embodiment of the present disclosure
  • FIG. 9 shows a schematic block diagram of an apparatus for performing offset prefetching according to an embodiment of the present disclosure.
  • Figure 10 shows a schematic block diagram of a computing device according to an embodiment of the disclosure
  • FIG. 11 shows a schematic diagram of a non-transitory computer-readable storage medium according to an embodiment of the disclosure.
  • the cache may refer to a high-speed memory with a fast data access speed, which exchanges data with the CPU before the memory, and the setting of the cache enables the computer system to exert higher performance.
  • the CPU When the CPU needs to read data, it first looks it up in the cache. If the data exists in the cache, it is read immediately and sent to the CPU for processing. If the data does not exist in the cache, it is accessed from a relatively slow The data is read in the memory and sent to the CPU for processing, and at the same time, the data block where the data is located is transferred into the cache, so that in the subsequent stage, the CPU reads the entire block of data from the cache without having to call the memory.
  • Such a reading mechanism makes the hit rate of the CPU reading the cache relatively high, which means that the data that the CPU will read next time will most likely exist in the cache, and only a small part needs to be read from the memory. This greatly saves the time for the CPU to directly read the memory, and basically does not need to wait for the CPU to read data.
  • the order in which the CPU reads data is cache first and then memory.
  • FIG. 1 shows a schematic diagram of a multi-level cache architecture including a prefetcher.
  • the CPU first accesses the level 1 (L1) cache, which has the fastest access speed but the smallest data capacity , the speed of the second-level (L2) cache is lower than that of the L1 cache, but the data capacity is greater than that of the L1 cache.
  • L1 cache the level 1
  • L2 cache the speed of the second-level cache
  • L1 cache the highest-level cache
  • the LLC cache is the L3 cache.
  • the CPU first searches for data from the L1 cache, and if it is not found, it searches down to the memory level by level, and then stores the accessed data. Data is returned to the CPU step by step.
  • a prefetcher is also provided in the multi-level cache architecture to use the prefetcher to discover the rules of CPU access to data, so as to prefetch the data and program instructions to be accessed into the cache.
  • the prefetched content is program instructions, it is called an instruction prefetcher
  • the prefetched content is data
  • the data prefetcher can be further subdivided into L1 prefetcher (that is, prefetch data to L1 cache), L2 prefetcher (that is, prefetch data to L2 cache), LLC prefetcher (ie, prefetch data to last level cache, LLC cache), etc.
  • L1 prefetcher that is, prefetch data to L1 cache
  • L2 prefetcher that is, prefetch data to L2 cache
  • LLC prefetcher ie, prefetch data to last level cache, LLC cache
  • the L2 cache 1 has an L2 prefetcher that prefetches data to the L2 cache by generating a prefetch request.
  • the L2 cache can also verify the prefetch request, which is equivalent to a non-prefetch data request. If the data corresponding to the prefetch request already exists in the L2 cache, the L2 cache will discard the prefetch request, that is, not from the Prefetching is performed in the next-level cache/memory. If the data corresponding to the prefetch request does not exist in the L2 cache, the prefetch request is sent to the lower-level cache/memory to prefetch the data to the L2 cache.
  • Data prefetching is one of the key technologies to improve CPU operating efficiency. Since the cache can only store recently accessed data, when the CPU needs to read data that has never been accessed or is replaced out of the cache due to cache size limitations, the CPU still needs to wait for tens or even hundreds of clock cycles to read data from memory, which will cause a performance penalty.
  • the data prefetcher prefetches the data to be used into the cache in advance by analyzing the past data access rules, thereby reducing the clock cycle of the CPU waiting for data and improving the overall performance of the CPU.
  • Offset prefetcher is a widely used data prefetcher. According to the overall data prefetch rule, it finds the offset prefetch value used to form the prefetch address, and generates the prefetch value based on the found offset prefetch value. Fetch request.
  • the prefetch address of the prefetch request is equal to the offset prefetch value plus the address of the program that triggers the prefetch.
  • the process by which the offset prefetcher finds the offset prefetch value may be referred to as a training process.
  • L1 prefetchers often use virtual addresses for training and send virtual addresses. prefetch requests.
  • the current offset prefetcher generally tries to find the most suitable offset prefetch value from a set of preset data.
  • the amount of data in this set of data is usually large. This reduces the efficiency of obtaining the offset prefetch value, and cannot meet the high-speed operation requirements of the CPU.
  • some embodiments of the present disclosure provide an offset prefetch method for improving the efficiency of the offset prefetcher by reducing the time required for the offset prefetcher to obtain the offset prefetch value , so as to quickly respond to the CPU's demand for data prefetching, and further improve the operating efficiency of the CPU system.
  • FIG. 2 shows a schematic flowchart of an offset prefetching method according to an embodiment of the present disclosure, and the offset prefetching method according to some embodiments of the present disclosure will be described below with reference to FIG. 2 .
  • processors such as graphics processing units (Graphics Processing Unit, GPU) and other types of processors, as long as they have a cache architecture
  • an offset prefetcher can be implemented without limitation.
  • all terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.
  • an offset prefetcher is used to select K offset prefetch values for generating a prefetch request from a preset offset value table.
  • the offset prefetcher may be an offset prefetcher that prefetches data to any level-1 cache in a multi-level cache architecture, such as the L2 prefetcher shown in FIG. 1 , or it can also be L1 prefetcher or LLC prefetcher, etc. There is no limitation here. For different levels of prefetchers, the process of selecting offset prefetch values is similar. In this paper, offset prefetch The fetcher is an L2 prefetcher is described as a specific example.
  • the preset offset value table may include preset N offset values.
  • the K offset prefetch values obtained by using the offset prefetcher are offset prefetch values newly selected by the offset prefetcher from the preset offset value table in time, and N is greater than K.
  • the process by which the offset prefetcher obtains an offset prefetch value can be called a training phase, for example, the process of generating the first offset prefetch value can be called the first training phase, and similarly, will generate The process of prefetching the Kth offset value is called the Kth training stage.
  • the training process and the prefetching process are carried out continuously and in parallel, and the end of one training phase is the beginning of the next training phase.
  • the end of a training phase will select an offset prefetch value, for example, after generating the first offset prefetch value, the first offset prefetch value is used to generate prefetch requests for the next period of time, synchronous Generally, the offset prefetcher will carry out the second training phase, and select the second offset prefetch value at the end of the second training phase, and the second offset prefetch value will be used to generate the prefetch value for a period of time thereafter. Fetch requests until a new offset prefetch value is selected in the third training phase. Wherein, the offset prefetch values selected in each stage may be the same or may be different. The specific process for the offset prefetcher to select the offset prefetch value for generating the prefetch request from the preset offset value table will be described in detail below.
  • step S102 K offset prefetch values are recorded for forming a recent offset value table including K offset prefetch values.
  • the offset prefetcher is used to select a first offset prefetch value from the recent offset value table for data prefetching based on the first offset prefetch value.
  • an offset prefetcher is used to form a recent offset value table from K newly selected offset prefetch values from the preset offset value table, and based on the recent offset value table to output the offset prefetch value for the next stage. Since K is smaller than N, that is, the number of offset values in the recent offset value table is less than the number of offset values in the preset offset value table, the offset prefetcher selects the offset prefetcher from the recent offset value table. The time to fetch the value will be shortened.
  • the efficiency of the fetcher can quickly respond to the CPU's demand for data prefetching, and further improve the operating efficiency of the CPU system.
  • data prefetching is realized by providing a simple and easy-to-implement recent offset value table, which can effectively reduce the time of the training phase while ensuring the accuracy of data prefetching Consumption, thereby further improving the overall performance of the corresponding CPU.
  • the proposed scheme avoids the impact on the architecture of the existing prefetcher and CPU, and is convenient for wide application and realization.
  • FIG. 3 shows an overall schematic diagram of using an offset prefetcher to select an offset prefetch value from a preset offset value table according to an embodiment of the present disclosure.
  • the training process and the prefetching process are carried out continuously and in parallel.
  • an offset prefetch value D1 will be generated for data Prefetching, in the process of data prefetching using D1.
  • the offset prefetcher enters the training phase 2 synchronously, and generates a new offset prefetch value D2 for data prefetch in the next period of time, and so on.
  • each training phase multiple training rounds will be included, such as round 1, round 2 up to the maximum number of rounds (Round Max, RMax).
  • the offset prefetcher will test each offset value included in the table used, and if the test passes, the score of the tested offset value is increased, for example, the score is increased by 1 , to select the optimal offset value, for example, the offset value with the highest score, according to the respective scores of each offset value.
  • the offset value with the highest score or the maximum score value is determined as the optimal offset value, that is, as the offset prefetch value D1 for performing data prefetch.
  • the optimal offset value generated can also be set to zero, which means that this training phase will not Send a prefetch request.
  • the offset prefetcher shown in FIG. 3 that generates an optimal offset value (for example, the offset value with the highest score) by testing scores may also be referred to as an optimal offset prefetcher.
  • the offset prefetcher mentioned in step S101 may be the above optimal offset prefetcher that selects the optimal offset value based on the score, in this case, the above K Offset prefetch values can be expressed as K best offset prefetch values.
  • the offset prefetchers involved in the present disclosure can also be other types of prefetchers, which is not limited here, and the method according to the embodiments of the present disclosure is also applicable to other offset prefetcher designs .
  • Fig. 4 shows a schematic block diagram of using an offset prefetcher to select an offset prefetch value from a preset offset value table according to an embodiment of the present disclosure.
  • the offset prefetcher is an L2 prefetcher, that is, it receives a request from the L1 cache for prefetch training, and then generates a prefetch request to prefetch data to the L2 cache.
  • the L2 cache receives a data request from the L1 cache, which includes address X. Since the L2 prefetcher shown in FIG. 4 is the address X is a physical address.
  • the physical address X used for training can be any address from the L1 cache, i.e., all addresses accessed by the L2 cache, or only the L2 cache miss address (e.g., denoted as L2Miss), i.e., the physical address X
  • L2Miss the L2 cache miss address
  • the corresponding data does not exist in the L2 cache, and it is determined whether the data corresponding to the address X exists in the L2 cache is shown as "X?" in FIG. 4 .
  • the address for training is an L2 cache miss address (ie, L2Miss) as an example, but it is not limited thereto.
  • each offset value in the offset value table is tested one by one, including several offset values in the offset value table, for example, it can be the above-mentioned preset offset value table or the above-mentioned Table of recent offset values.
  • the offset value d currently being tested it will be subtracted from the test address X, and judge whether X-d hits the recent request table (Recent Request table, RRT). If it hits, the score of the offset value d will be increased by 1.
  • RRT Previous Request table
  • Corresponding counters are used to accumulate the scores of each offset value in the training phase.
  • the RRT may be a single-column table, which is used to save the recently returned hash value of the physical address from the L3 response (for example, the hash value H2(Y-D) may be generated by using the hash function H2).
  • the table is indexed using another hash value of the physical address (eg, the hash value H1(Y-D) may be generated using a hash function H1).
  • the physical address Y returned by the L3 cache can also be filled in the queue first, so that the recently returned physical address will not be pushed to the RRT immediately, but needs to wait in the queue , in this way, the address that has been recently accessed but not particularly recently can be stored in the RRT.
  • the optimal offset value D will be generated.
  • the L2 cache For the obtained address X+D, it will first go to the L2 cache to judge whether the data corresponding to the address X+D already exists in the L2 cache, if not, then generate a prefetch to the L3 cache based on the address X+D Request to prefetch data to L2 cache.
  • the test time will be significantly reduced due to the reduced number of offset values to be tested, and, due to the offset included therein
  • the value is the latest offset prefetch value selected in time, which ensures the accuracy of the offset prefetch value selected especially, and ensures prefetch accuracy while saving training time.
  • FIG. 5 Another flowchart for selecting the offset prefetch value in the value table.
  • the offset prefetcher shown in FIG. 5 can be the L2 prefetcher in FIG. 1 , which can be trained based on the L2Miss address and prefetch data to the L2 cache.
  • ScoreMax(SMax) The maximum score value. When the score value of an offset value reaches SMax, the current training stage will be ended directly, and the offset value will be selected as the best offset value;
  • ScoreMin(SMin) The minimum score value. When the training phase is over, for example, 100 rounds of testing are completed. If the determined optimal offset value D has a score less than SMin, D can be set equal to 0 and will not be sent prefetch request;
  • Round Max the maximum number of test rounds
  • Offsets multiple preset offset values, the best offset value will be selected from among them.
  • NO Number of Offsets
  • N the total number of offset values included in the offset value table
  • K the total number of offset values included in the offset value table
  • the constants listed above may be values configured by the system, for example, the maximum number of test rounds RMax may be set to be equal to 100 or other appropriate values, which is not limited here.
  • Round(R) Record the current number of test rounds
  • Offsets[P] indicates the offset value of the current test
  • Scores[P] indicates the score corresponding to the offset value of the current test
  • BestOffset The offset value corresponding to BS.
  • the process shown therein may correspond to a training phase.
  • the L2Miss address X is used for testing.
  • S502 judge whether Offsets[P] hits RRT, if hit RRT, then carry out S503, the corresponding score of this offset value Offsets[P] Scores[P] increases, for example, by 1, that is, Scores[P]++. If Offsets[P] does not match the RRT, go to S507.
  • R is not less than RMax, it means that the current round is the largest round, that is, the current training stage is completed.
  • S5011 it is judged whether the obtained BS is greater than a preset minimum score value SMin.
  • SMin can be set equal to 1. If BS is not greater than SMin, then enter S5012, D can be set equal to 0, and no prefetch request will be sent, that is, data prefetch based on the offset value with a lower score value is prohibited to avoid unnecessary data occupying the cache space .
  • BS is greater than SMin
  • go to S5013 determine the offset value BO as the best offset value D selected in the current training stage, and assign the score BS of the offset value BO to the score S of D.
  • the optimal offset value D can be generated.
  • the optimal offset value D will be used for data prefetch in the next period of time, and the prefetch address is equal to the offset value D plus the physical address X that triggers the prefetch.
  • performing data prefetching based on the first offset prefetch value includes: determining shift the prefetch value and the requested address to form the prefetch request address; determine whether the data corresponding to the prefetch request address exists in the L-level cache; and determine that the data corresponding to the prefetch request address does not exist in the L-level In the cache, trigger a prefetch request corresponding to the prefetch request address, where L is an integer greater than 1.
  • the L2 cache can also perform an offset based on the offset value D
  • the prefetch request that is, if the data corresponding to the prefetch request based on the offset value D already exists in the L2 cache, the L2 cache will discard the prefetch request, that is, the next level of cache/memory will not be prefetched. If the data corresponding to the prefetch request does not exist in the L2 cache, the prefetch request is sent to the lower-level cache/memory to prefetch the data to the L2 cache.
  • the L2 prefetcher will follow the process shown in Figure 5 for the next training stage to generate a new optimal offset value (Example denoted as D2).
  • the new optimal offset value D2 will also be used for data prefetching in a subsequent period of time until the next new optimal offset value D3 is generated, and so on.
  • the system since the L2 prefetcher is trained based on the physical address, the system also needs to check whether the addresses X and X+D that trigger the prefetch request are in the same memory page, and if they are not in the same page, Then no prefetch request based on X+D is sent, that is, the requested address exceeds the range of address access.
  • the implementation process of the L2 prefetcher for data prefetching is described in detail in conjunction with Figure 3- Figure 5. It can be understood that the L2 prefetcher is based on the preset offset value table or recent offset The process of data prefetching for value tables is similar.
  • the offset prefetching method may further include: updating the recent offset value table with the second offset prefetch value.
  • updating the recent offset value table includes: determining whether the second offset prefetch value exists in the recent offset value table; and determining that the second offset prefetch value does not exist in the recent offset value table
  • the recent offset value table is updated according to one of the first-in-first-out algorithm, least-recently-used algorithm, or pseudo-least-recently-used algorithm.
  • the table ROT can also be updated based on the offset value D2, if the offset value D2 is not in the ROT, then D2 can be inserted into In ROT.
  • a certain offset value in the ROT may be replaced according to a replacement policy.
  • the replacement policy may be a first-in-first-out rule, or may be based on other replacement rules, such as a recently used (Least Recently Used, RLU) rule or a pseudo-recently used (Pseudo Least Recently Used, PRLU) rule.
  • the LRU information corresponding to the offset value D can also be updated, and the LRU information is used to perform an update for the offset value, for example, is the number of times the offset value is used, etc.
  • tables in order to balance the time required for the training phase of the prefetcher and the accuracy of prefetching, tables can be selected according to rules, and training can be performed based on the selected table.
  • the process of selecting the first offset prefetch value D1 in step S101 can be represented as the first training phase, thus, for the second training phase of selecting the second offset prefetch value D2, Whether to use a preset offset value table or a recent offset value table to select the second offset prefetch value D2 may be determined based on a predetermined rule.
  • the offset prefetching method may further include: when it is determined that a preset offset value table is used to select the second offset prefetch value, using an offset prefetcher to select Select the second offset prefetch value in the offset value table, and perform data prefetch based on the second offset prefetch value; or when it is determined to use the recent offset value table to select the second offset prefetch value, The offset prefetcher is used to select a second offset prefetch value from the recent offset value table, and to perform data prefetch based on the second offset prefetch value.
  • a preset offset value table is used to select the second offset prefetch value
  • an offset prefetcher to select Select the second offset prefetch value in the offset value table, and perform data prefetch based on the second offset prefetch value
  • the offset prefetcher is used to select a second offset prefetch value from the recent offset value table, and to perform data prefetch based on the second offset prefetch value.
  • the above rule may refer to alternately using the preset offset value table and the recent offset value table to select the second offset prefetch value.
  • This alternating manner may refer to the fact that the preset offset value table and the recent offset value table are alternately used one by one, that is, the situation that the first offset prefetch value D1 is selected using the recent offset value table in the first training stage
  • the preset offset value table is used to select the second offset prefetch value D2
  • the recent offset value table is used again to select the third shift prefetch value D3, And so on.
  • FIG. 6 shows an overall schematic diagram of using an offset prefetcher to alternately perform data prefetching based on a preset offset value table and a recent offset value table according to an embodiment of the present disclosure.
  • at least K times of training phases have been performed based on the preset offset value table, ie, a recent offset value table including a predetermined number of K offset prefetched values has been formed.
  • FIG. 6 schematically shows the process of alternately using the preset offset value table and the recent offset value table.
  • the training phase based on the preset offset value table is called the full training phase
  • the training phase based on the recent offset value table is called the fast training phase.
  • the preset offset value table performs the full training phase, for example, RMax rounds are performed in sequence, and each round tests the N offset values in the preset offset value table one by one to obtain the value of each offset value Scores are accumulated and an optimal offset value D1 is determined based on the scores.
  • enter the fast training phase that is, test based on the recent offset value table, and test the K offset values in the recent offset value table one by one in each round to obtain the cumulative score of each offset value , and then determine the optimal offset value D2 based on the score.
  • the next step is to enter the full training phase again, that is, the full training phase is alternated with the fast training phase.
  • FIG. 7 shows an execution flow chart of using an offset prefetcher to alternately perform data prefetching based on a preset offset value table and a recent offset value table according to an embodiment of the present disclosure.
  • the variable FT representing the full training phase (Full Training, FT) can be set to true as the initial value of FT.
  • FT Frull Training
  • the prefetcher performs training according to the selected table, and generates an optimal offset value D.
  • the ROT is updated, for example, if the generated optimal offset D is not in the ROT, the offset value D is inserted into the ROT. If the number of offset values included in the ROT has reached K, a certain offset value in the ROT may be replaced according to a replacement policy (for example, LRU).
  • LRU replacement policy
  • FT is reversed, for example, if the current FT is true (for example, expressed as "0"), then in this step, FT is reversed as false (for example, expressed as "1"), then return to S701, and re- Determine the table to use for training.
  • the above rule may also be to determine whether the training score S1 of the first offset prefetch value D1 is greater than a score threshold, in the training score If it is greater than the score threshold, it is determined that the recent offset value table is used to select the second offset prefetch value.
  • the above tables are not alternately performed one by one, but are judged based on the generated offset prefetch value score. If the training score S1 of D1 is greater than the score threshold, it means that the recent offset value table is based on the The offset value is relatively accurate and meets the prefetching requirements, so training can continue based on the recent offset value table.
  • the score threshold can be preset by the system.
  • FIG. 8 shows another execution flow diagram illustrating data prefetching by using an offset prefetcher based on a preset offset value table and a recent offset value table according to an embodiment of the present disclosure.
  • steps S801 - S808 are similar to the processes of steps S701 - S708 shown in FIG. 7 , and will not be described again.
  • step S809 is added. After updating the ROT, it is judged whether FT is false and meets a specific condition, wherein, for example, the specific condition can be whether the training score S1 based on D1 is greater than the score threshold, that is , BS>ST.
  • FT If FT is false, it means that the current training stage is the fast training stage.
  • the purpose of the specific condition is to judge whether to continue to use fast training, such as judging whether the BS is greater than the score threshold ST. If these two conditions are met, it can return to S801. At this time, in the judgment of S802, because FT is false, then enter S803, that is, select ROT to train.
  • S809 the maximum number of continuous fast training can also be set, and when the number is reached, it is forced to enter S806, so that FT flipping becomes true.
  • the most recently generated optimal offsets are often the most recently used optimal offsets. This makes it possible to ensure that even if the number of offset values included in the table of recent offset values is small, selecting an offset value therefrom guarantees accuracy.
  • the methods provided by some embodiments of the present disclosure are based on this insight into CPU program operation and data regularity. Using the offset prefetching method according to some embodiments of the present disclosure, it is possible to form a recent offset value table with fewer offset values, so as to perform offset prefetcher based on the preset offset value table and the recent offset value table.
  • the efficiency of the offset prefetcher can be improved by reducing the time required for the offset prefetcher to generate the offset prefetch value, which can quickly respond to the CPU's demand for data prefetch and further improve the performance of the CPU system. operating efficiency.
  • Fig. 9 shows a schematic block diagram of an apparatus for performing offset prefetching according to an embodiment of the present disclosure.
  • an apparatus 1000 for performing offset prefetching may include an offset prefetch unit 1010 .
  • the offset prefetching unit 1010 in the apparatus 1000 for performing offset prefetching may be configured to perform the following steps: use the offset prefetcher to select from the preset offset value table for generating the prefetch Get the requested K offset prefetch values, wherein the preset offset value table includes preset N offset values, and the K offset prefetch values are obtained from the preset offset value table by the offset prefetcher The latest selected offset prefetch value in time, where N and K are positive integers, and N is greater than K; record K offset prefetch values for forming recent offset values including K offset prefetch values table; and using the offset prefetcher to select a first offset prefetch value from the recent offset value table for data prefetching based on the first offset prefetch value.
  • the offset prefetcher may be an offset prefetcher that prefetches data to any level-1 cache in a multi-level cache architecture, such as the L2 prefetcher shown in FIG. 1 , or it may also be an L1 prefetcher or an LLC prefetcher, etc., and there is no limitation here.
  • the process of selecting an offset prefetch value is similar.
  • the preset offset value table may include preset N offset values.
  • the K offset prefetch values obtained by using the offset prefetcher are offset prefetch values newly selected by the offset prefetcher from the preset offset value table in time, and N is greater than K.
  • the process of obtaining an offset prefetch value by the offset prefetcher can be called a training phase, for example, the process of generating the first offset prefetch value can be called the first training phase, similarly, will generate The process of prefetching the Kth offset value is called the Kth training stage.
  • the training process and the prefetching process are carried out continuously and in parallel, and the end of one training phase is the beginning of the next training phase.
  • the end of a training phase will select an offset prefetch value, for example, after generating the first offset prefetch value, the first offset prefetch value is used to generate prefetch requests for the next period of time, synchronous Generally, the offset prefetcher will carry out the second training phase, and select the second offset prefetch value at the end of the second training phase, and the second offset prefetch value will be used to generate the prefetch value for a period of time thereafter. Fetch requests until a new offset prefetch value is selected in the third training phase. Wherein, the offset prefetch values selected in each stage may be the same or may be different.
  • an offset prefetcher is used to form a recent offset value table from K newly selected offset prefetch values from the preset offset value table, and based on the recent offset value table to output the offset prefetch value for the next stage. Since K is smaller than N, that is, the number of offset values in the recent offset value table is less than the number of offset values in the preset offset value table, the offset prefetcher selects the offset prefetcher from the recent offset value table. The time to fetch the value will be shortened.
  • the efficiency of the fetcher can quickly respond to the CPU's demand for data prefetching, and further improve the operating efficiency of the CPU system.
  • data prefetching is realized by providing a simple and easy-to-implement recent offset value table, which can effectively reduce the training phase while ensuring the accuracy of data prefetching. Time consumption, thereby further improving the overall performance of the corresponding CPU.
  • the proposed scheme avoids the impact on the architecture of the existing prefetcher and CPU, and is convenient for wide application and realization.
  • the process of selecting the first offset prefetch value may be represented as a first training phase.
  • the offset prefetching unit 1010 may also be configured to: for the second training phase of selecting the second offset prefetch value, determine based on the rules whether to use the preset offset value table or the recent offset value table to select the second offset A prefetch value, wherein the second offset prefetch value is used for data prefetch.
  • the training process and the prefetching process are carried out continuously and in parallel.
  • the first offset prefetching value will be generated for data prefetching.
  • the offset prefetcher enters the second training phase synchronously, and generates a second offset prefetch value for data prefetch in the next period of time, and so on.
  • the above rules include: using the preset offset value table and the recent offset value table to select the second offset prefetch value in an alternate manner, wherein the alternate manner means using In the case of the recent offset value table, the preset offset value table is used in the second training stage to select the second offset prefetch value.
  • This alternating manner may refer to the fact that the preset offset value table and the recent offset value table are alternately used one by one, that is, the situation that the first offset prefetch value D1 is selected using the recent offset value table in the first training stage
  • the preset offset value table is used to select the second offset prefetch value D2
  • the recent offset value table is used again to select the third shift prefetch value D3, And so on.
  • the above rules include: determining whether the training score of the first offset prefetch value is greater than a score threshold; Shift prefetch value.
  • the recent offset value table and the preset offset value table are not alternately performed, but are judged based on the score of the generated offset prefetch value, for example, if the training score S1 of D1 is greater than the score threshold , which indicates that the offset value generated based on the recent offset value table is relatively accurate and meets the prefetch requirements, thus, training can continue based on the recent offset value table.
  • the score threshold can be preset by the system.
  • the score threshold Score Threshold, ST
  • the offset prefetch unit 1010 may also be configured to: use the offset prefetcher to select the second offset prefetch value from the prefetch Select the second offset prefetch value in the set offset value table, and perform data prefetch based on the second offset prefetch value; or when it is determined to use the recent offset value table to select the second offset prefetch value Next, the offset prefetcher is used to select a second offset prefetch value from the recent offset value table, and data prefetch is performed based on the second offset prefetch value.
  • the offset prefetching unit 1010 may also be configured to: use the second offset prefetch value to update the recent offset value table.
  • the offset prefetch unit 1010 in order to update the recent offset value table, is further configured to: determine whether the second offset prefetch value exists in the recent offset value table; If the offset prefetch value does not exist in the recent offset value table, the recent offset value table is updated according to one of the first-in first-out algorithm, the least recently used algorithm or the pseudo-least recently used algorithm.
  • the offset prefetcher is an optimal offset prefetcher
  • the K offset prefetch values are K optimal offset prefetch values
  • the offset prefetch unit 1010 in order to perform data prefetch based on the first offset prefetch value, is further configured to: determine the prefetch request address composed of the first offset prefetch value and the requested address; determine whether the data corresponding to the prefetch request address exists in the L-th level cache; If the data corresponding to the address does not exist in the L-level cache, a prefetch request corresponding to the prefetch request address is triggered, where L is an integer greater than 1.
  • FIG. 10 shows a schematic block diagram of a computing device according to an embodiment of the disclosure.
  • the computing device 2000 may include a processor 2010 and a memory 2020 .
  • computer-readable codes are stored in the memory 2020, and when the computer-readable codes are executed by the processor 2010, the above-mentioned offset prefetching method can be executed.
  • the processor 2010 can perform various actions and processes according to programs stored in the memory 2020 .
  • the processor 2010 may be an integrated circuit with signal processing capabilities.
  • the general-purpose processor may be a microprocessor, or the processor may be any conventional processor, etc., and may be an X86 architecture or an ARM architecture, or the like.
  • the processor here may refer to a CPU with a multi-level cache architecture, and an offset prefetcher is also provided for the multi-level cache architecture therein, and the offset prefetcher is configured to be able to execute the
  • the offset prefetching method in some embodiments improves the efficiency of the offset prefetcher by reducing the time required to find the offset prefetch value, so that it can quickly respond to the data prefetch demand and further improve the operating efficiency of the CPU.
  • the memory 2020 stores computer-executable instruction codes, which are used to implement the offset prefetch method according to the embodiments of the present disclosure when executed by the processor 2010 .
  • Memory 2020 can be volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. It should be noted that the memory described herein may be any suitable type of memory.
  • a processor such as a CPU can implement fast data prefetching, wherein the offset prefetcher can be based on a maintained table of preset offset values and recent offset values Tables are used to quickly generate offset prefetch values for generating data prefetch requests, so that the CPU can process data and run program instructions more efficiently, thereby improving the system performance of the CPU.
  • FIG. 11 shows a schematic diagram of a non-transitory computer-readable storage medium according to an embodiment of the disclosure.
  • Computer readable storage media include, but are not limited to, for example, volatile memory and/or nonvolatile memory.
  • a computer program product or computer program comprising computer readable instructions stored in a computer readable storage medium.
  • a processor of a computer device may read the computer-readable instructions from a computer-readable storage medium, and the processor executes the computer-readable instructions, so that the computer device executes the offset prefetching method described in the foregoing embodiments.
  • the device for performing offset prefetch the computing device and the medium, and record the offset prefetch newly selected by the offset prefetcher in time based on the preset offset value table value to construct the recent offset value table, so that the offset prefetcher can use the constructed recent offset value table to generate the offset prefetch value, thereby reducing the time required for the offset prefetcher to generate the offset prefetch value , thereby improving the efficiency of the offset prefetcher and further improving the operating efficiency of the CPU system.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

本公开提供了一种偏移预取方法、执行偏移预取的装置、计算设备和介质。该偏移预取方法包括:利用偏移预取器从预置偏移值表格中选择用于生成预取请求的K个偏移预取值,其中,预置偏移值表格包括预先设置的N个偏移值,K个偏移预取值为偏移预取器从预置偏移值表格中在时间上最新选择的偏移预取值,其中,N和K为正整数,N大于K;记录K个偏移预取值,用于形成包括K个偏移预取值的近期偏移值表格;以及利用偏移预取器从近期偏移值表格中选择第一偏移预取值,用于基于第一偏移预取值进行数据预取。

Description

偏移预取方法、执行偏移预取的装置、计算设备和介质
本申请要求于2021年9月9日提交的中国专利申请第202111054692.4的优先权,该中国专利申请的全文通过引用的方式结合于此以作为本申请的一部分。
技术领域
本公开的实施例涉及一种偏移预取方法、执行偏移预取的装置、计算设备和介质。
背景技术
在中央处理器(Central Processing Unit,CPU)架构中,程序指令与数据一般存储在诸如动态随机存取存储器(Dynamic Random Access Memory,DRAM)的内存中。通常,CPU核心(Core)的运行频率远远高于内存的运行频率,因此,CPU从内存直接获取程序指令与数据需要CPU等待上百个CPU时钟周期,这将会造成CPU由于无法继续运行相关指令而产生空转,造成性能损失。因此,现代高性能CPU设置有多级缓存架构,以存储最近被访问的数据。同时,针对多级缓存架构,还利用数据预取器来识别CPU访问数据的规律,以提前地将可能被访问的数据预取到缓存中,以便于CPU能快速地从缓存中读取数据。
发明内容
本公开的一些实施例提供了一种偏移预取方法、执行偏移预取的装置、计算设备和介质,用于通过降低偏移预取器生成偏移预取值所需的时间来提高偏移预取器的效率,从而能够快速地响应CPU对于数据预取的需求,进一步提升CPU系统的运行效率。
根据本公开的一方面,提供了一种偏移预取方法,该方法包括:利用偏移预取器从预置偏移值表格中选择用于生成预取请求的K个偏移预取值,其中,预置偏移值表格包括预先设置的N个偏移值,K个偏移预取值为偏移预 取器从预置偏移值表格中在时间上最新选择的偏移预取值,其中,N和K为正整数,N大于K;记录K个偏移预取值,用于形成包括K个偏移预取值的近期偏移值表格;以及利用偏移预取器从近期偏移值表格中选择第一偏移预取值,用于基于第一偏移预取值进行数据预取。
根据本公开的一些实施例,将选择第一偏移预取值的过程表示为第一训练阶段,该方法还包括:对于选择第二偏移预取值的第二训练阶段,基于规则确定是采用预置偏移值表格还是采用近期偏移值表格来选择第二偏移预取值,其中,第二偏移预取值用于进行数据预取。
根据本公开的一些实施例,规则包括:确定第一偏移预取值的训练分数是否大于分数阈值,在训练分数大于分数阈值的情况下,确定采用近期偏移值表格来选择第二偏移预取值。
根据本公开的一些实施例,规则包括:按照交替方式来采用预置偏移值表格和近期偏移值表格来选择第二偏移预取值,其中,交替方式表示在第一训练阶段采用近期偏移值表格的情况下,在第二训练阶段采用预置偏移值表格来选择第二偏移预取值。
根据本公开的一些实施例,该方法还包括:在确定采用预置偏移值表格来选择第二偏移预取值的情况下,利用偏移预取器来从预置偏移值表格中选择第二偏移预取值,并基于第二偏移预取值来进行数据预取;或者在确定采用近期偏移值表格来选择第二偏移预取值的情况下,利用偏移预取器来从近期偏移值表格中选择第二偏移预取值,并基于第二偏移预取值来进行数据预取。
根据本公开的一些实施例,该方法还包括:利用第二偏移预取值来更新近期偏移值表格。
根据本公开的一些实施例,更新近期偏移值表格包括:确定第二偏移预取值是否存在于近期偏移值表格之中;以及在确定第二偏移预取值不存在于近期偏移值表格之中的情况下,按照先入先出算法、最近最少使用算法或伪最近最少使用算法中的其中一种来更新近期偏移值表格。
根据本公开的一些实施例,偏移预取器为最佳偏移预取器,K个偏移预取值为K个最佳偏移预取值。
根据本公开的一些实施例,在偏移预取器为用于第L级缓存的偏移预取 器的情况下,基于第一偏移预取值进行数据预取包括:确定由第一偏移预取值与请求的地址组成的预取请求地址;确定与预取请求地址对应的数据是否存在于第L级缓存中;以及在确定与预取请求地址对应的数据不存在于第L级缓存中的情况下,触发与预取请求地址对应的预取请求,其中,L为大于1的整数。
根据本公开的另一方面,还提供了一种执行偏移预取的装置,其特征在于,执行偏移预取的装置包括偏移预取单元,偏移预取单元配置成:利用偏移预取器从预置偏移值表格中选择用于生成预取请求的K个偏移预取值,其中,预置偏移值表格包括预先设置的N个偏移值,K个偏移预取值为偏移预取器从预置偏移值表格中在时间上最新选择的偏移预取值,其中,N和K为正整数,N大于K;记录K个偏移预取值,用于形成包括K个偏移预取值的近期偏移值表格;以及利用偏移预取器从近期偏移值表格中选择第一偏移预取值,用于基于第一偏移预取值进行数据预取。
根据本公开的一些实施例,将选择第一偏移预取值的过程表示为第一训练阶段,偏移预取单元还配置成:对于选择第二偏移预取值的第二训练阶段,基于规则确定是采用预置偏移值表格还是采用近期偏移值表格来选择第二偏移预取值,其中,第二偏移预取值用于进行数据预取。
根据本公开的一些实施例,规则包括:确定第一偏移预取值的训练分数是否大于分数阈值,在训练分数大于分数阈值的情况下,确定采用近期偏移值表格来选择第二偏移预取值。
根据本公开的一些实施例,规则包括:按照交替方式来采用预置偏移值表格和近期偏移值表格来选择第二偏移预取值,其中,交替方式表示在第一训练阶段采用近期偏移值表格的情况下,在第二训练阶段采用预置偏移值表格来选择第二偏移预取值。
根据本公开的一些实施例,偏移预取单元还配置成:在确定采用预置偏移值表格来选择第二偏移预取值的情况下,利用偏移预取器来从预置偏移值表格中选择第二偏移预取值,并基于第二偏移预取值来进行数据预取;或者在确定采用近期偏移值表格来选择第二偏移预取值的情况下,利用偏移预取器来从近期偏移值表格中选择第二偏移预取值,并基于第二偏移预取值来进行数据预取。
根据本公开的一些实施例,偏移预取单元还配置成:利用第二偏移预取值来更新近期偏移值表格。
根据本公开的一些实施例,为更新近期偏移值表格,偏移预取单元还配置为:确定第二偏移预取值是否存在于近期偏移值表格之中;以及在确定第二偏移预取值不存在于近期偏移值表格之中的情况下,按照先入先出算法、最近最少使用算法或伪最近最少使用算法中的其中一种来更新近期偏移值表格。
根据本公开的一些实施例,偏移预取器为最佳偏移预取器,K个偏移预取值为K个最佳偏移预取值。
根据本公开的一些实施例,在偏移预取器为用于第L级缓存的偏移预取器的情况下,为基于第一偏移预取值进行数据预取,偏移预取单元还配置为:确定由第一偏移预取值与请求的地址组成的预取请求地址;确定与预取请求地址对应的数据是否存在于第L级缓存中;以及在确定与预取请求地址对应的数据不存在于第L级缓存中的情况下,触发与预取请求地址对应的预取请求,其中,L为大于1的整数。
根据本公开的又一方面,还提供了一种计算设备,包括:处理器;和存储器,其中,存储器中存储有计算机可读代码,计算机可读代码在由处理器运行时,执行如上所述的偏移预取方法。
根据本公开的又一方面,还提供了一种非暂时性计算机可读存储介质,其上存储有指令,指令在被处理器执行时,使得处理器执行如上所述的偏移预取方法。
利用本公开实施例提供的偏移预取方法、执行偏移预取的装置、计算设备和介质,通过记录偏移预取器基于预置偏移值表格在时间上最新选择的K个偏移预取值来构建近期偏移值表格,以使得偏移预取器能够利用构建的近期偏移值表格来生成偏移预取值,从而降低偏移预取器生成偏移预取值所需的时间,由此实现提高偏移预取器的效率以及进一步提升CPU系统的运行效率。
附图说明
为了更清楚地说明本公开实施例或现有技术中的技术方案,下面将对实 施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1示出了包括预取器的多级缓存架构的示意图;
图2示出了根据本公开实施例的偏移预取方法的示意性流程图;
图3示出了根据本公开实施例的利用偏移预取器从预置偏移值表格中选择偏移预取值的整体示意图;
图4示出了根据本公开实施例的利用偏移预取器从预置偏移值表格中选择偏移预取值的示意性框图;
图5示出了根据本公开实施例的利用偏移预取器从预置偏移值表格中选择偏移预取值的另一流程示意图;
图6示出了根据本公开实施例的利用偏移预取器基于预置偏移值表格和近期偏移值表格交替进行数据预取的整体示意图;
图7示出了根据本公开实施例的利用偏移预取器基于预置偏移值表格和近期偏移值表格交替进行数据预取的执行流程图;
图8示出了示出了根据本公开实施例的利用偏移预取器基于预置偏移值表格和近期偏移值表格进行数据预取的另一执行流程图;
图9示出了根据本公开实施例的执行偏移预取的装置的示意性框图;
图10示出了根据本公开实施例的计算设备的示意性框图;
图11示出了根据本公开实施例的非暂时性计算机可读存储介质的示意图。
具体实施方式
下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述。显然,所描述的实施例仅是本公开一部分的实施例,而不是全部的实施例。基于本公开中的实施例,本领域普通技术人员在无需创造性劳动的前提下所获得的所有其他实施例,都属于本公开保护的范围。
此外,如本公开和权利要求书中所示,除非上下文明确提示例外情形,“一”、“一个”、“一种”和/或“该”等词并非特指单数,也可包括复数。本公开中使用的“第一”、“第二”以及类似的词语并不表示任何顺序、数 量或者重要性,而只是用来区分不同的组成部分。同样,“包括”或者“包含”等类似的词语意指出现该词前面的元件或者物件涵盖出现在该词后面列举的元件或者物件及其等同,而不排除其他元件或者物件。“连接”或者“相连”等类似的词语并非限定于物理的或者机械的连接,而是可以包括电性的连接,不管是直接的还是间接的。
由于CPU核心(Core)的运行频率远远高于DRAM内存的运行频率,因此,CPU从内存直接获取数据与程序指令需要等待上百个CPU的时钟周期,为了避免CPU直接地从内存访问数据所带来的时间延迟,通常配置有缓存(Cache)。缓存可以是指数据访问速度较快的高速存储器,其先于内存与CPU进行数据交换,缓存的设置使得计算机系统能够发挥出更高的性能。
当CPU需要读取数据时,首先在缓存中进行查找,如果数据存在于缓存中,则立即读取并发送给CPU进行处理,如果数据并未存在于缓存中,则从访问速率相对较慢的内存中读取数据并发送给CPU进行处理,同时把这个数据所在的数据块调入缓存,从而使得在后续阶段CPU对整块数据的读取都从缓存中进行,而不必再调用内存。这样的读取机制使得CPU读取缓存的命中率比较高,也就是说CPU下一次要读取的数据较大概率都存在于缓存中,只有少部分需要从内存中读取。这大大节省了CPU直接读取内存的时间,也使CPU读取数据时基本无需等待。总的来说,CPU读取数据的顺序是先缓存后内存。
图1示出了包括预取器的多级缓存架构的示意图,如图1所示,在多级缓存架构中,CPU首先访问一级(L1)缓存,其访问速度最快,但是数据容量最小,二级(L2)缓存的速度低于L1缓存但是数据容量大于L1缓存,还可以包括三级或者更高级的缓存,一般地,最高级别的缓存(LLC)的数据容量最大,但是访问速度也最慢。例如,在图1示出的的三级缓存架构中,LLC缓存即为L3缓存,CPU首先从L1缓存中查找数据,如果未找到则逐级地向下查找直至内存,然后再将访问到的数据逐级地返回至CPU。
此外,在多级缓存架构中还设置有预取器,以利用预取器发现CPU访问数据的规律,从而预取即将被访问的数据、程序指令到缓存中。如果预取的内容是程序指令,则称为指令预取器,如果预取的内容是数据,则称为数据预取器。根据数据的目标缓存位置,数据预取器还可以进一步被细分为L1 预取器(即,将数据预取到L1缓存)、L2预取器(即,将数据预取到L2缓存)、LLC预取器(即,将数据预取到最后一级缓存,LLC缓存)等。如图1中所示的L2缓存具有L2预取器,该L2预取器通过生成预取请求来将数据预取至L2缓存。此外,L2缓存还可以对预取请求进行校验,等同于非预取数据请求,如果预取请求所对应的数据已经存在于L2缓存中,则L2缓存将丢弃该预取请求,即不从下一级缓存/内存中进行预取,如果预取请求所对应的数据并未存在于L2缓存中,则将该预取请求发送至下级缓存/内存以将数据预取至L2缓存。
数据预取是提升CPU运行效率的关键技术之一。由于缓存只能保存最近被访问过的数据,当CPU需要读取从未被访问的数据或是由于缓存大小限制而被替换出缓存的数据时,CPU仍然需要等待数十甚至上百个时钟周期以从内存读取数据,这将造成性能损失。数据预取器正是通过分析以往数据访问规律来提前预取即将被使用的数据到缓存中,从而减少CPU等待数据的时钟周期,并提升CPU整体性能。
偏移预取器是一类应用较为广泛的数据预取器,根据整体数据预取规律,发现用于组成预取地址的偏移预取值,并基于发现的偏移预取值来生成预取请求。预取请求的预取地址等于该偏移预取值加上触发该预取的程序地址。偏移预取器发现偏移预取值的过程可以称为训练过程。相对于L2或LLC预取器只能使用物理地址进行训练(即基于物理地址来触发预取并得到偏移预取值),L1预取器往往使用虚拟地址进行训练,并发送由虚拟地址组成的预取请求。
然而,当前偏移预取器一般是从预先设置的一组数据中尝试寻找最合适的偏移预取值,为了覆盖较完整的偏移值范围,这组数据的数据量通常较大,这使得降低了得到偏移预取值的效率,无法满足CPU高速运行的需求。
为了解决上述技术问题,本公开的一些实施例提供了一种偏移预取方法,用于通过降低偏移预取器获取偏移预取值所需的时间来提高偏移预取器的效率,从而能够快速地响应CPU对于数据预取的需求,进一步提升CPU系统的运行效率。
具体的,图2示出了根据本公开实施例的偏移预取方法的示意性流程图,以下将结合图2描述根据本公开一些实施例的偏移预取方法。可以的理解的 是,在本文中涉及的应用场景中,除了CPU,还可以应用于其他类型的处理器,诸如图形处理器(Graphics Processing Unit,GPU)等各类处理器,只要其中具有缓存架构并且能够实施偏移预取器,在此不作限制。除非另有定义,本文中使用的所有术语具有与本公开所属领域的普通技术人员共同理解的相同含义。
如图2所示,首先,在步骤S101,利用偏移预取器从预置偏移值表格中选择用于生成预取请求的K个偏移预取值。根据本公开的一些实施例,该偏移预取器可以是将数据预取至多级缓存架构中的任意一级缓存的偏移预取器,例如可以是图1中示出的L2预取器,或者也可以是诸如L1预取器或LLC预取器等,在此不作限制,针对不同级别的预取器,其选择偏移预取值的过程是类似的,本文中将以偏移预取器为L2预取器作为具体示例进行描述。
根据本公开的一些实施例,预置偏移值表格可以包括预先设置的N个偏移值。如上所述,为了覆盖较完整的偏移值范围,预置偏移值表格中包括的偏移值的数目一般较多,例如,N=52个偏移值,偏移预取器从这52个偏移值中选择当前最合适的偏移值作为偏移预取值,该选择的偏移预取值用于生成预取请求从而将数据预取至诸如L2缓存。
根据本公开的一些实施例,利用偏移预取器得到的K个偏移预取值为偏移预取器从预置偏移值表格中在时间上最新选择的偏移预取值,并且N大于K。可以将偏移预取器得到一个偏移预取值的过程称为一个训练阶段,例如,可以将产生第一个偏移预取值的过程称为第一个训练阶段,类似地,将产生第K个偏移预取值的过程称为第K个训练阶段。在偏移预取器中,训练过程与预取过程是持续、并行地进行的,一次训练阶段的结束就是下一次训练阶段的开始。一个训练阶段的结束将会选择一个偏移预取值,例如,在产生第一偏移预取值之后,该第一偏移预取值被用来生成接下来一段时间的预取请求,同步地,偏移预取器将进行第二训练阶段,并在第二训练阶段结束选择出第二偏移预取值,此后一段时间内,该第二偏移预取值将被用来生成预取请求,直至第三训练阶段选出新的偏移预取值。其中,各个阶段选出的偏移预取值可以相同,也可能不相同。关于偏移预取器从预置偏移值表格中选择用于生成预取请求的偏移预取值的具体过程将在下文详细描述。
接着如图2所示,在步骤S102,记录K个偏移预取值,用于形成包括K个偏移预取值的近期偏移值表格。以及,在步骤S103,利用偏移预取器从近期偏移值表格中选择第一偏移预取值,用于基于第一偏移预取值进行数据预取。
在根据本公开的一些实施例中,利用偏移预取器从预置偏移值表格中最近选择的K个偏移预取值来组成近期偏移值表格,并基于该近期偏移值表格来产出下一阶段的偏移预取值。由于K小于N,也就是说,近期偏移值表格中偏移值的数目小于预置偏移值表格中偏移值的数目,使得偏移预取器从近期偏移值表格选择偏移预取值的时间将被缩短。在K明显小于N的情况下,这一时间优势将更加明显,作为示例,K可以诸如等于4或8,相比于从N=52个偏移值中选择偏移预取值所需的时间,从K=4个偏移值中选择偏移预取值所需的时间将明显降低,由此能够实现通过降低偏移预取器生成偏移预取值所需的时间来提高偏移预取器的效率,从而能够快速地响应CPU对于数据预取的需求,进一步提升CPU系统的运行效率。
进一步地,大多数程序的某一个运行阶段往往仅用到很少的几个偏移预取值。因此,最新产生K个偏移预取值往往针对某一个运行阶段普遍适用。由此,通过将在时间上最新产生的K个偏移预取值形成为近期偏移值表格(Recent Offsets Table,ROT)并从中训练选出下一阶段的偏移预取值,往往预取结果也会比较准确,并且由于ROT中包括的偏移值的数目远远小于预置偏移值表格(Preset Offsets Table,POT)中包括的偏移的数目,由此使得利用ROT进行的训练阶段将会比利用POT进行的训练阶段快很多。
在根据本公开一些实施例的偏移预取方法中,通过提供简便、易实现的近期偏移值表格来实现数据预取,能够在保证数据预取的准确性的同时有效降低训练阶段的时间消耗,从而进一步提升相应CPU的整体性能。并且,提出的方案避免对现有预取器及CPU的架构产生影响,便于广泛应用实现。
以下将描述利用偏移预取器从预置偏移值表格中选择偏移预取值的具体实现过程,可以理解的是,利用偏移预取器从预置偏移值表格中选择近期偏移值表格的过程是类似的。
图3示出了根据本公开实施例的利用偏移预取器从预置偏移值表格中选择偏移预取值的整体示意图。如图3所示,在偏移预取器中,训练过程与预 取过程是持续、并行地进行的,例如,在训练阶段1的结束将产生偏移预取值D1,以用于进行数据预取,在利用D1进行数据预取的过程中。偏移预取器同步进入训练阶段2,并产生新的偏移预取值D2,以用于接下来一段时间的数据预取,以此类推。
在每个训练阶段中,将包括多个训练轮次,例如轮次1、轮次2直至最大数目轮次(Round Max,RMax)。RMax的具体数值是可以设置的,例如,RMax=100。在每个训练轮次中,偏移预取器都将对所使用的表格中包括的每个偏移值进行测试,若测试通过,则该被测偏移值的分数增加,例如分数加1,以根据每个偏移值各自的分数来从中选择出最优的偏移值,例如分数最高的偏移值。例如,在所使用的表格为预置偏移值表格的情况下,其中将包括N个预置偏移值,例如,N=52。
在训练阶段1的结束,也就是完成第RMax轮次的测试,或者某一偏移值的分数值大于或等于预先设置的最大分数值之后,将分数最高的偏移值或者该具有最大分数值的相应偏移值确定为最佳偏移值,即,作为进行数据预取的偏移预取值D1。此外,如果测试得到的偏移值的分数过小,例如小于或等于预先设置的最小分数值,则还可以将产生的最佳偏移值设置为零,也就意味着此训练阶段将不会发送预取请求。换句话说,如果测试产生的偏移预取值的分数较低,则表示其预取准确性也很可能较低,则该数据被用到的可能性也较低,由此不产生预取请求,避免不必要的数据占用缓存空间。由此,图3中示出的通过测试分数来产生最佳偏移值(例如,分数最高的偏移值)的偏移预取器也可以称为最佳偏移预取器。在根据本公开的实施例中,例如步骤S101中提及的偏移预取器可以是以上基于分数选择最佳偏移值的最佳偏移预取器,在此种情况下,上述K个偏移预取值可以表示为K个最佳偏移预取值。然而,可以理解的是,本公开中涉及的偏移预取器还可以是其他类的预取器,在此不作限制,根据本公开实施例的方法对其他的偏移预取器设计同样适用。
图4示出了根据本公开实施例的利用偏移预取器从预置偏移值表格中选择偏移预取值的示意性框图。在图4的示例中,偏移预取器为L2预取器,即,接收来自L1缓存的请求以进行预取训练,然后产生将数据预取至L2缓存的预取请求。
在缓存架构中,如果请求的数据在缓存中,则表示为命中(Hit),如果未存在于缓存中,则表示为未命中(Miss)。如图4所示,首先,L2缓存接收来自L1缓存的数据请求,其中包括地址X,由于图4中示出的为L2预取器,则地址X为物理地址。对于用于进行训练的物理地址X可以是来自L1缓存的任意地址,即,L2缓存访问的所有地址,也可以是仅L2缓存未命中地址(例如,表示为L2Miss),即,该物理地址X对应的数据未存在于L2缓存中,判断地址X对应的数据是否存在于L2缓存在图4中示出为“X?”。在本文中,将以进行训练的地址为L2缓存未命中地址(即,L2Miss)作为示例进行描述,但是并不对其进行限制。
接着,如图4所示,对于偏移值表格中的各个偏移值逐个的进行测试,该偏移值表格中包括若干个偏移值,例如可以是上述预置偏移值表格也可以上述近期偏移值表格。对于当前进行测试的偏移值d,将与测试地址X进行减法,并判断X-d是否命中最近请求表(Recent Request table,RRT),如果命中则偏移值d的分数加1,例如可以通过设置相应的计数器来累计各个偏移值在训练阶段的分数。其中,RRT可以是一个单列的表格,用于保存最近返回的来自L3应答的物理地址的哈希值(例如,可以用哈希函数H2生成该哈希值H2(Y-D))。此外,该表格使用该物理地址的另一个哈希值(例如,可以用哈希函数H1生成该哈希值H1(Y-D))进行索引。此外,在被存储在RRT表格之前,L3缓存返回的物理地址Y还可以先填充在队列中,这是为了使得最近返回的物理地址不会立即被推送到RRT中,而是需要在队列中等待,通过这种方式可以使得RRT中存储近期而又不是特别近被访问的地址。
如上所述的,在经过RMaxx轮次或者某一偏移值的分数达到预设的最大分数的情况下,将产生最佳偏移值D。以预置偏移值表格中包括N=52个偏移值为例,在进行了RMax=100个轮次的测试的情况下,将需要N*RMax=5200个L2Miss地址来完成测试过程,并从中确定出分数最高的偏移值作为最佳偏移值D。
对于得到的地址X+D,将首先去往L2缓存进行判断,地址X+D对应的数据是否已经存在在L2缓存中,如果不存在,则基于地址X+D生成去往L3缓存的预取请求,以将数据预取值L2缓存。
由此可以看出,在偏移值表格中包括的偏移值数量较多的情况下,测试时间也明显增加,这会造成训练时间过长,对程序行为改变响应不及时。但是在预置偏移值的数目较少的情况下,也会由于实际的最佳偏移值不在预置偏移值表格中,造成只能选出次优或者不佳的偏移值,例如选出的最佳偏移值的分数较低。相比较地,在偏移值表格为根据本公开一些实施例生成的近期偏移值表格的情况下,测试时间由于待测偏移值数目降低而将明显减少,并且,由于其中包括的偏移值为在时间上最新选出的偏移预取值,保证了尤其中选出的偏移预取值的准确性,在节省训练时间的同时保证预取准确性。
为了更详细的介绍偏移预取器基于偏移值表格选择最佳偏移值的过程,提供图5,图5示出了根据本公开实施例的利用偏移预取器从预置偏移值表格中选择偏移预取值的另一流程示意图。作为示例,图5中示出的偏移预取器可以是图1中的L2预取器,其可以基于L2Miss地址进行训练,并将数据预取至L2缓存。
首先,对在图5中涉及的常量进行描述。
ScoreMax(SMax):最大分数值,当一个偏移值的分数值达到SMax,则直接结束当前训练阶段,并选择该偏移值为最佳偏移值;
ScoreMin(SMin):最小分数值,当训练阶段结束后,例如,完成了100轮测试,如果确定的最佳偏移值D的分数小于SMin,则可以将D设置为等于0,并且不会发送预取请求;
Round Max(RMax):最大测试轮数;
Offsets:预置的多个偏移值,最佳偏移值将这其中选出。
Number of Offsets(NO),偏移值表格中包括的偏移值的总数目,例如,在训练阶段使用的表格为预置偏移值表格的情况下,NO=N,例如52。在训练阶段使用的表格为近期偏移值表格的情况下,NO=K,例如4或8。
可以理解的是,以上列出的常量可以是由系统配置的值,例如,最大测试轮次RMax可以设置为等于100或者其他合适的值,在此不作限制。
接着,对在图5中涉及的变量进行描述,这些变量用于存储训练阶段的相关信息。
D:选出的最佳偏移值;
S:D对应的训练分数值;
Round(R):记录当前测试轮数;
Scores:当前测试的偏移值的分数;
P:Offsets及Scores的指针,其中,Offsets[P]表示当前测试的偏移值,Scores[P]表示该当前测试的偏移值所对应的分数;
BestScore(BS):目前最高的测试分数;
BestOffset(BO):BS所对应的偏移值。
如图5示出,其中示出的流程可以对应于一个训练阶段。训练阶段的开始将对变量进行初始化,例如,设置为R=1,BS=0以及P=1。
从R=1轮次开始,对于表格中的偏移值,以L2Miss地址X进行测试。首先,在S501,得到测试地址Z=X-Offsets[P],然后,在S502,判断Offsets[P]是否命中RRT,如果命中RRT,则进行S503,该偏移值Offsets[P]的对应分数Scores[P]增加,例如加1,即,Scores[P]++。如果Offsets[P]未命中RRT,则进入S507。
接着,在S504,判断Scores[P]是否大于目前最高的测试分数BS,如果确定大于BS,则进入S505,即,将当前偏移值Offsets[P]的分数Scores[P]确定为新的BS,即,BS=Scores[P],并且,将该当前偏移值Offsets[P]确定为新的当前最佳偏移值,即,BO=Offsets[P]。否则,将进入S507。
在S506,判断BS是否大于等于预先设置的最大分数值SMax,如果是,则表明该偏移值Offsets[P]已经完成测试,也就是说其符合分数要求,可以直接结束当前测试阶段,并将该偏移值Offsets[P]确定为最佳偏移值,即进入S5013。
如果BS小于预先设置的最大分数值SMax,则进入S507,判断指针P是否对应于偏移值的总数目NO,如果否,则表示当前偏移值不是表格中的最后一个偏移值,这意味着当前轮次(例如,R=1)的测试并未结束,则进入S508,对指针P加1,然后返回S501,针对表格中的下一个偏移值进行测试。否则,即,P=NO,则表示当前偏移值为表格中的最后一个偏移值,意味着完成了当前轮次的测试,即,进入S509,判断当前轮次R是否小于最大轮次RMax,如果是,则进入S5010,对当前轮次R加1并且对指针P加1,以返回S501,进入新一轮次的测试。
接着,如图5所示,如果R不小于RMax,则表示当前轮次为最大轮次, 即完成当前训练阶段。在S5011,判断得到的BS是否大于预先设置的最小分数值SMin,作为示例,SMin可以设置为等于1。如果BS不大于SMin,则进入S5012,可以将D设置为等于0,并且不会发送预取请求,即禁止基于分数值较低的偏移值进行数据预取,避免不必要的数据占用缓存空间。
如果BS大于SMin,则进入S5013,将偏移值BO确定为当前训练阶段选出的最佳偏移值D,并将该偏移值BO的分数BS赋值给D的分数S。
由此,经过以上步骤S501-S5013,能够产生最佳偏移值D。该最佳偏移值D将在接下来一段时间内用于数据预取,预取地址等于该偏移值D加上触发该预取的物理地址X。
根据本公开的一些实施例,在偏移预取器为用于第L级缓存的偏移预取器的情况下,基于第一偏移预取值进行数据预取包括:确定由第一偏移预取值与请求的地址组成的预取请求地址;确定与预取请求地址对应的数据是否存在于第L级缓存中;以及在确定与预取请求地址对应的数据不存在于第L级缓存中的情况下,触发与预取请求地址对应的预取请求,其中,L为大于1的整数。以偏移预取器为用于第2级缓存的偏移预取器,即L2预取器为例,在产生最佳偏移值D1之后,L2缓存还可以对基于该偏移值D的预取请求进行校验,即,如果基于偏移值D的预取请求所对应的数据已经存在于L2缓存中,则L2缓存将丢弃该预取请求,即不对下一级缓存/内存进行预取,如果预取请求所对应的数据并未存在于L2缓存中,则将该预取请求发送至下级缓存/内存以将数据预取至L2缓存。在基于当前最佳偏移值D(例如,表示为D1)进行预取的过程中,L2预取器将按照图5所示的过程进行下一训练阶段,以产生新的最佳偏移值(例如表示为D2)。该新的最佳偏移值D2也将用于后续一段时间内的数据预取,直到产生下一个新的最佳偏移值D3,以此类推。
此外,可以理解的是,由于L2预取器是基于物理地址进行训练,由此,系统还需要检验触发预取请求的地址X与X+D是否在同一内存页面中,如果不在同一页面中,则不发送基于X+D的预取请求,也就是说,请求的地址超出了地址访问的范围。
以上以L2预取器为例,结合图3-图5详细描述了L2预取器进行数据预取的实现过程,可以理解的是,L2预取器基于预置偏移值表格或者近期偏移 值表格进行数据预取的过程是类似的。
根据本公开的一些实施例,偏移预取方法还可以包括:利用第二偏移预取值来更新近期偏移值表格。其中,更新近期偏移值表格包括:确定第二偏移预取值是否存在于近期偏移值表格之中;以及在确定第二偏移预取值不存在于近期偏移值表格之中的情况下,按照先入先出算法、最近最少使用算法或伪最近最少使用算法中的其中一种来更新近期偏移值表格。
为了维护近期偏移值表格,在确定得到第二偏移预取值D2之后,还可以基于该偏移值D2对表格ROT进行更新,如果偏移值D2不在ROT中,则可以将D2插入到ROT中。此外,如果ROT中包括的偏移值数目已达到K个,则可以根据替换策略替换掉ROT中的某个偏移值。替换策略可以是先入先出规则,或者也可以是基于其他的替换规则,例如最近使用(Least Recently Used,RLU)规则或者伪最近使用(Pseudo Least Recently Used,PRLU)规则等。作为示例,在替换策略为RLU规则的情况下,对于包括偏移值D的ROT,还可以更新与偏移值D对应的LRU信息,LRU信息用于执行针对偏移值的更新,例如,可以是偏移值的使用次数等。
在根据本公开的一些实施例中,为了平衡预取器训练阶段所需的时间以及预取准确性,能够根据规则来选择表格,并基于选择的表格进行训练。
根据本公开的一些实施例,可以将步骤S101中选择第一偏移预取值D1的过程表示为第一训练阶段,由此,对于选择第二偏移预取值D2的第二训练阶段,可以基于预定规则来确定是采用预置偏移值表格还是采用近期偏移值表格来选择该第二偏移预取值D2。
根据本公开的一些实施例,偏移预取方法还可以包括:在确定采用预置偏移值表格来选择第二偏移预取值的情况下,利用偏移预取器来从预置偏移值表格中选择第二偏移预取值,并基于第二偏移预取值来进行数据预取;或者在确定采用近期偏移值表格来选择第二偏移预取值的情况下,利用偏移预取器来从近期偏移值表格中选择第二偏移预取值,并基于第二偏移预取值来进行数据预取。基于预置偏移值表格或者近期偏移值表格来选择第二偏移预取值的过程可以参照以上结合图3-图5进行的描述。
根据本公开的一些实施例,上述规则可以是指按照交替地方式来采用预置偏移值表格和近期偏移值表格来选择第二偏移预取值。该交替方式可以是 指预置偏移值表格和近期偏移值表格逐个地交替被使用,也就是说,在第一训练阶段采用近期偏移值表格选择第一偏移预取值D1的情况下,则在第二训练阶段采用预置偏移值表格来选择第二偏移预取值D2,而后,将在第三训练阶段再次采用近期偏移值表格选择第三移预取值D3,依次类推。
图6示出了根据本公开实施例的利用偏移预取器基于预置偏移值表格和近期偏移值表格交替进行数据预取的整体示意图。在图6所示出的场景中,已经基于预置偏移值表格至少进行了K次训练阶段,即,已经形成了包括预定数目K个偏移预取值的近期偏移值表格。在图6中示意性地示出了交替使用预置偏移值表格和近期偏移值表格的过程。
作为示例,在图6中将基于预置偏移值表格进行的训练阶段称为全训练阶段,并将基于近期偏移值表格进行的训练阶段称为快训练阶段,在图6中,首先基于预置偏移值表格进行全训练阶段,例如,依次进行RMax个轮次,并且每个轮次逐个对预置偏移值表格中的N个偏移值进行测试,以得到各个偏移值的累计分数,并基于分数确定最佳偏移值D1。接着,进入快训练阶段,即基于近期偏移值表格来进行测试,并在每个轮次逐个对近期偏移值表格中的K个偏移值进行测试,以得到各个偏移值的累计分数,然后基于分数确定最佳偏移值D2。下一步将再次进入全训练阶段,即,全训练阶段与快训练阶段交替进行。
图7示出了根据本公开实施例的利用偏移预取器基于预置偏移值表格和近期偏移值表格交替进行数据预取的执行流程图。如图7所示,首先进行初始化,例如,可以将表示全训练阶段(Full Training,FT)的变量FT设置为真,作为FT的初始值。在S701,判断近期偏移值表格ROT中包括的偏移值数目是否大于等于K,如果否,则表示还未形成近期偏移值表格ROT,则可以直接将FT设置为等于真(即,S707)。如果ROT>=K,则进入S702,确定FT是否为真,如果FT为真则指示选择预置偏移值表格POT进行训练(S708)。否则,进入S703,即,F为假,则选择近期偏移值表格ROT进行训练。
在S704,预取器按照选择的表格进行训练,并产生最佳偏移值D。在S705,对ROT进行更新,例如,如果产生的最佳偏移D不在ROT中,则将该偏移值D插入ROT中。若ROT中包括的偏移值数目已达到K个,则可以 根据替换策略(例如,LRU)替换掉ROT中的某个偏移值。此外,如果D已经存在于ROT中,根据替换策略,也有可能更新与偏移值D对应的LRU信息,LRU信息用于执行针对偏移值的更新。
接着,在S706,对FT进行翻转,例如,在当前FT为真(例如表示为“0”),则在此步骤,FT翻转为假(例如表示为“1”),然后返回至S701,重新确定用于进行训练的表格。
除了以上基于图6和图7描述的交替方法,在根据本公开的另一些实施例中,上述规则也可以是确定第一偏移预取值D1的训练分数S1是否大于分数阈值,在训练分数大于分数阈值的情况下,确定采用近期偏移值表格来选择第二偏移预取值。在这些实施例中,上述表格并非逐个交替进行的,而是基于产生的偏移预取值的分数来进行判断,如果D1的训练分数S1大于分数阈值,这说明基于近期偏移值表格产生的偏移值比较准确,符合预取要求,由此,可以继续基于该近期偏移值表格来进行训练。关于分数阈值可以由系统预先设置,作为示例,例如可以将分数阈值(Score Threshold,ST)设置为最高分数值SMax乘以一个系数,例如ST=SMax*0.8,或者可以将ST设置为其他数值。
图8示出了示出了根据本公开实施例的利用偏移预取器基于预置偏移值表格和近期偏移值表格进行数据预取的另一执行流程图。在图8所示的执行流程中,步骤S801-S808与图7中示出的步骤S701-S708的过程类似,不再重复描述。对于图8中的执行流程,增加了步骤S809,在更新了ROT之后,进行判断,FT是否为假并且满足特定条件,其中,特定条件例如可以是基于D1的训练分数S1是否大于分数阈值,即,BS>ST。如果FT为假,则意味着当前训练阶段为快训练阶段,特定条件目的是判断是否继续使用快训练,例如判断BS是否大于分数阈值ST,如果满足这两个条件,则可以返回至S801,此时在S802的判断中由于FT为假则进入S803,即选择ROT进行训练。此外,在S809中,还可以设置连续进行快训练的最多次数,在达到该次数的情况下则强制进入S806,以使得FT翻转为真。
在根据本公开一些实施例中,由于大多数程序的某一个运行阶段往往仅有很少的几个最佳偏移值,因此最近产生的最佳偏移往往是最近使用过的这些最佳偏移值之一,这使得能够保证,即使近期偏移值表格中包括的偏移值 数目较少,从中选择出偏移值也可以保证准确性。本公开一些实施例提供方法正是基于对于CPU程序运行以及数据规律的这一洞见。利用根据本公开一些实施例的偏移预取方法,能够形成具有较少偏移值的近期偏移值表格,以基于预置偏移值表格和近期偏移值表格进行偏移预取器的训练阶段,由此,通过降低偏移预取器生成偏移预取值所需的时间来提高偏移预取器的效率,能够快速地响应CPU对于数据预取的需求,进一步提升CPU系统的运行效率。
根据本公开的另一方面,还提供了一种执行偏移预取的装置。图9示出了根据本公开实施例的执行偏移预取的装置的示意性框图。
如图9所示,执行偏移预取的装置1000可以包括偏移预取单元1010。
根据本公开的一些实施例,执行偏移预取的装置1000中的偏移预取单元1010可以配置成执行以下步骤:利用偏移预取器从预置偏移值表格中选择用于生成预取请求的K个偏移预取值,其中,预置偏移值表格包括预先设置的N个偏移值,K个偏移预取值为偏移预取器从预置偏移值表格中在时间上最新选择的偏移预取值,其中,N和K为正整数,N大于K;记录K个偏移预取值,用于形成包括K个偏移预取值的近期偏移值表格;以及利用偏移预取器从近期偏移值表格中选择第一偏移预取值,用于基于第一偏移预取值进行数据预取。
根据本公开的一些实施例,该偏移预取器可以是将数据预取至多级缓存架构中的任意一级缓存的偏移预取器,例如可以是图1中示出的L2预取器,或者也可以是诸如L1预取器或LLC预取器等,在此不作限制,针对不同级别的预取器,其选择偏移预取值的过程是类似的。
根据本公开的一些实施例,预置偏移值表格可以包括预先设置的N个偏移值。如上所述,为了覆盖较完整的偏移值范围,预置偏移值表格中包括的偏移值的数目一般较多,例如,N=52个偏移值,偏移预取器从这52个偏移值中选择当前最合适的偏移值作为偏移预取值,该选择的偏移预取值用于生成预取请求从而将数据预取至诸如L2缓存。
根据本公开的一些实施例,利用偏移预取器得到的K个偏移预取值为偏移预取器从预置偏移值表格中在时间上最新选择的偏移预取值,并且N大于K。可以将偏移预取器得到一个偏移预取值的过程称为一个训练阶段,例如, 可以将产生第一个偏移预取值的过程称为第一个训练阶段,类似地,将产生第K个偏移预取值的过程称为第K个训练阶段。在偏移预取器中,训练过程与预取过程是持续、并行地进行的,一次训练阶段的结束就是下一次训练阶段的开始。一个训练阶段的结束将会选择一个偏移预取值,例如,在产生第一偏移预取值之后,该第一偏移预取值被用来生成接下来一段时间的预取请求,同步地,偏移预取器将进行第二训练阶段,并在第二训练阶段结束选择出第二偏移预取值,此后一段时间内,该第二偏移预取值将被用来生成预取请求,直至第三训练阶段选出新的偏移预取值。其中,各个阶段选出的偏移预取值可以相同,也可能不相同。
在根据本公开的一些实施例中,利用偏移预取器从预置偏移值表格中最近选择的K个偏移预取值来组成近期偏移值表格,并基于该近期偏移值表格来产出下一阶段的偏移预取值。由于K小于N,也就是说,近期偏移值表格中偏移值的数目小于预置偏移值表格中偏移值的数目,使得偏移预取器从近期偏移值表格选择偏移预取值的时间将被缩短。在K明显小于N的情况下,这一时间优势将更加明显,作为示例,K可以诸如等于4或8,相比于从N=52个偏移值中选择偏移预取值所需的时间,从K=4个偏移值中选择偏移预取值所需的时间将明显降低,由此能够实现通过降低偏移预取器生成偏移预取值所需的时间来提高偏移预取器的效率,从而能够快速地响应CPU对于数据预取的需求,进一步提升CPU系统的运行效率。
进一步地,大多数程序的某一个运行阶段往往仅用到很少的几个偏移预取值。因此,最新产生K个偏移预取值往往针对某一个运行阶段普遍适用。由此,通过将在时间上最新产生的K个偏移预取值形成为近期偏移值表格(Recent Offsets Table,ROT)并从中训练选出下一阶段的偏移预取值,往往预取结果也会比较准确,并且由于ROT中包括的偏移值的数目远远小于预置偏移值表格(Preset Offsets Table,POT)中包括的偏移的数目,由此使得利用ROT进行的训练阶段将会比利用POT进行的训练阶段快很多。
在根据本公开一些实施例的偏移预取的装置中,通过提供简便、易实现的近期偏移值表格来实现数据预取,能够在保证数据预取的准确性的同时有效降低训练阶段的时间消耗,从而进一步提升相应CPU的整体性能。并且,提出的方案避免对现有预取器及CPU的架构产生影响,便于广泛应用实现。
在根据本公开的一些实施例中,为便于描述,可以将选择第一偏移预取值的过程表示为第一训练阶段。偏移预取单元1010还可以配置成:对于选择第二偏移预取值的第二训练阶段,基于规则确定是采用预置偏移值表格还是采用近期偏移值表格来选择第二偏移预取值,其中,第二偏移预取值用于进行数据预取。
在偏移预取器中,训练过程与预取过程是持续、并行地进行的,例如,在第一训练阶段的结束将产生第一偏移预取值,以用于进行数据预取,在利用第一偏移预取值进行数据预取的过程中。偏移预取器同步进入第二训练阶段,并产生第二偏移预取值,以用于接下来一段时间的数据预取,以此类推。数据预取器通过多个轮次进行训练得到偏移预取值的过程可以参照上文结合附图的描述。
根据本公开的一些实施例,上述规则包括:按照交替方式来采用预置偏移值表格和近期偏移值表格来选择第二偏移预取值,其中,交替方式表示在第一训练阶段采用近期偏移值表格的情况下,在第二训练阶段采用预置偏移值表格来选择第二偏移预取值。
该交替方式可以是指预置偏移值表格和近期偏移值表格逐个地交替被使用,也就是说,在第一训练阶段采用近期偏移值表格选择第一偏移预取值D1的情况下,则在第二训练阶段采用预置偏移值表格来选择第二偏移预取值D2,而后,将在第三训练阶段再次采用近期偏移值表格选择第三移预取值D3,依次类推。
根据本公开的一些实施例,上述规则包括:确定第一偏移预取值的训练分数是否大于分数阈值,在训练分数大于分数阈值的情况下,确定采用近期偏移值表格来选择第二偏移预取值。在这些实施例中,近期偏移值表格和预置偏移值表格并非交替进行的,而是基于产生的偏移预取值的分数来进行判断,例如,如果D1的训练分数S1大于分数阈值,这说明基于近期偏移值表格产生的偏移值比较准确,符合预取要求,由此,可以继续基于该近期偏移值表格来进行训练。关于分数阈值可以由系统预先设置,作为示例,例如可以将分数阈值(Score Threshold,ST)设置为最高分数值SMax乘以一个系数,例如ST=SMax*0.8,或者可以将ST设置为其他数值。
根据本公开的一些实施例,偏移预取单元1010还可以配置成:在确定采 用预置偏移值表格来选择第二偏移预取值的情况下,利用偏移预取器来从预置偏移值表格中选择第二偏移预取值,并基于第二偏移预取值来进行数据预取;或者在确定采用近期偏移值表格来选择第二偏移预取值的情况下,利用偏移预取器来从近期偏移值表格中选择第二偏移预取值,并基于第二偏移预取值来进行数据预取。
根据本公开的一些实施例,偏移预取单元1010还可以配置成:利用第二偏移预取值来更新近期偏移值表格。根据本公开的一些实施例,为更新近期偏移值表格,偏移预取单元1010还配置成:确定第二偏移预取值是否存在于近期偏移值表格之中;以及在确定第二偏移预取值不存在于近期偏移值表格之中的情况下,按照先入先出算法、最近最少使用算法或伪最近最少使用算法中的其中一种来更新近期偏移值表格。
根据本公开的一些实施例,偏移预取器为最佳偏移预取器,K个偏移预取值为K个最佳偏移预取值。
根据本公开的一些实施例,在偏移预取器为用于第L级缓存的偏移预取器的情况下,为基于第一偏移预取值进行数据预取,偏移预取单元1010还配置为:确定由第一偏移预取值与请求的地址组成的预取请求地址;确定与预取请求地址对应的数据是否存在于第L级缓存中;以及在确定与预取请求地址对应的数据不存在于第L级缓存中的情况下,触发与预取请求地址对应的预取请求,其中,L为大于1的整数。
关于执行偏移预取的装置1000执行的步骤的具体实现过程可以参照以上结合附图描述的根据本公开的偏移预取方法,在此不再重复描述。
根据本公开的又一方面,还提供了一种计算设备。图10示出了根据本公开实施例的计算设备的示意性框图。
如图10所示,计算设备2000可以包括处理器2010以及存储器2020。根据本公开实施例,存储器2020中存储有计算机可读代码,该计算机可读代码当由处理器2010运行时,可以执行如上所述的偏移预取方法。
处理器2010可以根据存储在存储器2020中的程序执行各种动作和处理。具体地,处理器2010可以是一种集成电路,具有信号处理能力。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等,可以是X86架构或者是ARM架构等。例如,此处的处理器可以是指具有多级缓存架构的 CPU,并且,针对其中的多级缓存架构还设置有偏移预取器,该偏移预取器配置成能够执行根据本公开的一些实施例的偏移预取方法,以通过降低发现偏移预取值所需的时间来提高偏移预取器的效率,从而能够快速响应数据预取需求,进一步提升CPU的运行效率。
存储器2020存储有计算机可执行指令代码,该指令代码在被处理器2010执行时用于实现根据本公开实施例的偏移预取方法。存储器2020可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。应注意,本文描述的存储器可以是任何适合类型的存储器。作为示例,通过执行存储器2020中的计算机可执行指令代码,诸如CPU的处理器能够实现快速地数据预取,其中的偏移预取器能够基于保持的预置偏移值表格以及近期偏移值表格来快速地产生用于生成数据预取请求的偏移预取值,以使得CPU能够更高效地处理数据、运行程序指令,从而提高CPU的系统性能。
根据本公开的又一方面,还提供了一种非暂时性计算机可读存储介质。图11示出了根据本公开实施例的非暂时性计算机可读存储介质的示意图。
如图11所示,计算机可读存储介质3020上存储有指令,指令例如是计算机可读指令3010。当计算机可读指令3010由处理器运行时,可以执行参照以上附图描述的偏移预取方法。计算机可读存储介质包括但不限于例如易失性存储器和/或非易失性存储器。
根据本公开的又一方面,还提供了一种计算机程序产品或计算机程序,该计算机程序产品或者计算机程序包括计算机可读指令,该计算机可读指令存储在计算机可读存储介质中。计算机设备的处理器可以从计算机可读存储介质读取该计算机可读指令,处理器执行该计算机可读指令,使得该计算机设备执行上述各个实施例中描述的偏移预取方法。
利用本公开实施例提供的偏移预取方法、执行偏移预取的装置、计算设备和介质,通过记录偏移预取器基于预置偏移值表格在时间上最新选择的偏移预取值来构建近期偏移值表格,以使得偏移预取器能够利用构建的近期偏移值表格来生成偏移预取值,从而降低偏移预取器生成偏移预取值所需的时间,由此实现提高偏移预取器的效率以及进一步提升CPU系统的运行效率。
本领域技术人员能够理解,本公开所披露的内容可以出现多种变型和改进。例如,以上所描述的各种设备或组件可以通过硬件实现,也可以通过软 件、固件、或者三者中的一些或全部的组合实现。
此外,虽然本公开对根据本公开的实施例的系统中的某些单元做出了各种引用,然而,任何数量的不同单元可以被使用并运行在客户端和/或服务器上。单元仅是说明性的,并且系统和方法的不同方面可以使用不同单元。
本公开中使用了流程图用来说明根据本公开的实施例的方法的步骤。应当理解的是,前面或后面的步骤不一定按照顺序来精确的进行。相反,可以按照倒序或同时处理各种步骤。同时,也可以将其他操作添加到这些过程中。
本领域普通技术人员可以理解上述方法中的全部或部分的步骤可通过计算机程序来指令相关硬件完成,程序可以存储于计算机可读存储介质中,如只读存储器、磁盘或光盘等。可选地,上述实施例的全部或部分步骤也可以使用一个或多个集成电路来实现。相应地,上述实施例中的各模块/单元可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。本公开并不限制于任何特定形式的硬件和软件的结合。
除非另有定义,这里使用的所有术语具有与本公开所属领域的普通技术人员共同理解的相同含义。还应当理解,诸如在通常字典里定义的那些术语应当被解释为具有与它们在相关技术的上下文中的含义相一致的含义,而不应用理想化或极度形式化的意义来解释,除非这里明确地这样定义。
以上是对本公开的说明,而不应被认为是对其的限制。尽管描述了本公开的若干示例性实施例,但本领域技术人员将容易地理解,在不背离本公开的新颖教学和优点的前提下可以对示例性实施例进行许多修改。因此,所有这些修改都意图包含在权利要求书所限定的本公开范围内。应当理解,上面是对本公开的说明,而不应被认为是限于所公开的特定实施例,并且对所公开的实施例以及其他实施例的修改意图包含在所附权利要求书的范围内。本公开由权利要求书及其等效物限定。

Claims (20)

  1. 一种偏移预取方法,包括:
    利用偏移预取器从预置偏移值表格中选择用于生成预取请求的K个偏移预取值,其中,所述预置偏移值表格包括预先设置的N个偏移值,所述K个偏移预取值为所述偏移预取器从所述预置偏移值表格中在时间上最新选择的偏移预取值,其中,N和K为正整数,N大于K;
    记录所述K个偏移预取值,用于形成包括所述K个偏移预取值的近期偏移值表格;以及
    利用所述偏移预取器从所述近期偏移值表格中选择第一偏移预取值,用于基于所述第一偏移预取值进行数据预取。
  2. 根据权利要求1所述的方法,其中,将选择所述第一偏移预取值的过程表示为第一训练阶段,所述方法还包括:
    对于选择第二偏移预取值的第二训练阶段,基于规则确定是采用所述预置偏移值表格还是采用所述近期偏移值表格来选择所述第二偏移预取值,其中,所述第二偏移预取值用于进行数据预取。
  3. 根据权利要求2所述的方法,其中,所述规则包括:
    确定所述第一偏移预取值的训练分数是否大于分数阈值,在所述训练分数大于所述分数阈值的情况下,确定采用所述近期偏移值表格来选择所述第二偏移预取值。
  4. 根据权利要求2或3所述的方法,其中,所述规则包括:
    按照交替方式来采用所述预置偏移值表格和所述近期偏移值表格来选择所述第二偏移预取值,其中,所述交替方式表示在所述第一训练阶段采用所述近期偏移值表格的情况下,在所述第二训练阶段采用所述预置偏移值表格来选择所述第二偏移预取值。
  5. 根据权利要求2-4中的任一项所述的方法,其中,所述方法还包括:
    在确定采用所述预置偏移值表格来选择所述第二偏移预取值的情况下,利用所述偏移预取器来从所述预置偏移值表格中选择所述第二偏移预取值,并基于所述第二偏移预取值来进行数据预取;或者
    在确定采用所述近期偏移值表格来选择所述第二偏移预取值的情况下, 利用所述偏移预取器来从所述近期偏移值表格中选择所述第二偏移预取值,并基于所述第二偏移预取值来进行数据预取。
  6. 根据权利要求2-5中的任一项所述的方法,其中,所述方法还包括:
    利用所述第二偏移预取值来更新所述近期偏移值表格。
  7. 根据权利要求6所述的方法,其中,所述更新所述近期偏移值表格包括:
    确定所述第二偏移预取值是否存在于所述近期偏移值表格之中;以及
    在确定所述第二偏移预取值不存在于所述近期偏移值表格之中的情况下,按照先入先出算法、最近最少使用算法或伪最近最少使用算法中的其中一种来更新所述近期偏移值表格。
  8. 根据权利要求1-7中的任一项所述的方法,其中,所述偏移预取器为最佳偏移预取器,所述K个偏移预取值为K个最佳偏移预取值。
  9. 根据权利要求1所述的方法,其中,在所述偏移预取器为用于第L级缓存的偏移预取器的情况下,所述基于所述第一偏移预取值进行数据预取包括:
    确定由所述第一偏移预取值与请求的地址组成的预取请求地址;
    确定与所述预取请求地址对应的数据是否存在于所述第L级缓存中;以及
    在确定与所述预取请求地址对应的数据不存在于所述第L级缓存中的情况下,触发与所述预取请求地址对应的预取请求,
    其中,L为大于1的整数。
  10. 一种执行偏移预取的装置,其中,所述执行偏移预取的装置包括偏移预取单元,所述偏移预取单元配置成:
    利用偏移预取器从预置偏移值表格中选择用于生成预取请求的K个偏移预取值,其中,所述预置偏移值表格包括预先设置的N个偏移值,所述K个偏移预取值为所述偏移预取器从所述预置偏移值表格中在时间上最新选择的偏移预取值,其中,N和K为正整数,N大于K;
    记录所述K个偏移预取值,用于形成包括所述K个偏移预取值的近期偏移值表格;以及
    利用所述偏移预取器从所述近期偏移值表格中选择第一偏移预取值,用 于基于所述第一偏移预取值进行数据预取。
  11. 根据权利要求10所述的装置,其中,将选择所述第一偏移预取值的过程表示为第一训练阶段,所述偏移预取单元还配置成:
    对于选择第二偏移预取值的第二训练阶段,基于规则确定是采用所述预置偏移值表格还是采用所述近期偏移值表格来选择所述第二偏移预取值,其中,所述第二偏移预取值用于进行数据预取。
  12. 根据权利要求11所述的装置,其中,所述规则包括:
    确定所述第一偏移预取值的训练分数是否大于分数阈值,在所述训练分数大于所述分数阈值的情况下,确定采用所述近期偏移值表格来选择所述第二偏移预取值。
  13. 根据权利要求11或12所述的装置,其中,所述规则包括:
    按照交替方式来采用所述预置偏移值表格和所述近期偏移值表格来选择所述第二偏移预取值,其中,所述交替方式表示在所述第一训练阶段采用所述近期偏移值表格的情况下,在所述第二训练阶段采用所述预置偏移值表格来选择所述第二偏移预取值。
  14. 根据权利要求11-13中的任一项所述的装置,其中,所述偏移预取单元还配置成:
    在确定采用所述预置偏移值表格来选择所述第二偏移预取值的情况下,利用所述偏移预取器来从所述预置偏移值表格中选择所述第二偏移预取值,并基于所述第二偏移预取值来进行数据预取;或者
    在确定采用所述近期偏移值表格来选择所述第二偏移预取值的情况下,利用所述偏移预取器来从所述近期偏移值表格中选择所述第二偏移预取值,并基于所述第二偏移预取值来进行数据预取。
  15. 根据权利要求11所述的装置,其中,所述偏移预取单元还配置成:
    利用所述第二偏移预取值来更新所述近期偏移值表格。
  16. 根据权利要求15所述的装置,其中,为更新所述近期偏移值表格包括,所述偏移预取单元还配置为:
    确定所述第二偏移预取值是否存在于所述近期偏移值表格之中;以及
    在确定所述第二偏移预取值不存在于所述近期偏移值表格之中的情况下,按照先入先出算法、最近最少使用算法或伪最近最少使用算法中的其中 一种来更新所述近期偏移值表格。
  17. 根据权利要求10-16中的任一项所述的装置,其中,所述偏移预取器为最佳偏移预取器,所述K个偏移预取值为K个最佳偏移预取值。
  18. 根据权利要求10所述的装置,其中,在所述偏移预取器为用于第L级缓存的偏移预取器的情况下,为基于所述第一偏移预取值进行数据预取,所述偏移预取单元还包括:
    确定由所述第一偏移预取值与请求的地址组成的预取请求地址;
    确定与所述预取请求地址对应的数据是否存在于所述第L级缓存中;以及
    在确定与所述预取请求地址对应的数据不存在于所述第L级缓存中的情况下,触发与所述预取请求地址对应的预取请求,
    其中,L为大于1的整数。
  19. 一种计算设备,包括:
    处理器;和
    存储器,其中,所述存储器中存储有计算机可读代码,所述计算机可读代码在由所述处理器运行时,执行如权利要求1-9中任一项所述的偏移预取方法。
  20. 一种非暂时性计算机可读存储介质,其上存储有指令,所述指令在被处理器执行时,使得所述处理器执行如权利要求1-9中任一项所述的偏移预取方法。
PCT/CN2022/093310 2021-09-09 2022-05-17 偏移预取方法、执行偏移预取的装置、计算设备和介质 WO2023035654A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111054692.4A CN113778520B (zh) 2021-09-09 2021-09-09 偏移预取方法、执行偏移预取的装置、计算设备和介质
CN202111054692.4 2021-09-09

Publications (1)

Publication Number Publication Date
WO2023035654A1 true WO2023035654A1 (zh) 2023-03-16

Family

ID=78841970

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/093310 WO2023035654A1 (zh) 2021-09-09 2022-05-17 偏移预取方法、执行偏移预取的装置、计算设备和介质

Country Status (2)

Country Link
CN (1) CN113778520B (zh)
WO (1) WO2023035654A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113778520B (zh) * 2021-09-09 2022-09-30 海光信息技术股份有限公司 偏移预取方法、执行偏移预取的装置、计算设备和介质
US11940921B2 (en) * 2022-01-07 2024-03-26 Centaur Technology, Inc. Bounding box prefetcher

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109783399A (zh) * 2018-11-19 2019-05-21 西安交通大学 一种动态可重构处理器的数据缓存预取方法
US20190272234A1 (en) * 2018-03-02 2019-09-05 Arm Limited Prefetching in data processing circuitry
US20190272233A1 (en) * 2018-03-02 2019-09-05 Arm Limited Prefetching in data processing circuitry
US20190361810A1 (en) * 2018-05-24 2019-11-28 International Business Machines Corporation Prefetching data based on register-activity patterns
CN110704107A (zh) * 2019-09-30 2020-01-17 上海兆芯集成电路有限公司 预取器、预取器的运作方法及处理器
CN113778520A (zh) * 2021-09-09 2021-12-10 海光信息技术股份有限公司 偏移预取方法、执行偏移预取的装置、计算设备和介质

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW436690B (en) * 1995-05-23 2001-05-28 Winbond Electronics Corp Modulo address generating device and method for accessing a ring buffer
US20070186050A1 (en) * 2006-02-03 2007-08-09 International Business Machines Corporation Self prefetching L2 cache mechanism for data lines
US10248569B2 (en) * 2016-08-11 2019-04-02 Futurewei Technologies, Inc. Pattern based preload engine
CN106528450B (zh) * 2016-10-27 2019-09-17 上海兆芯集成电路有限公司 数据预先提取方法及使用此方法的装置
US11106596B2 (en) * 2016-12-23 2021-08-31 Advanced Micro Devices, Inc. Configurable skewed associativity in a translation lookaside buffer
CN113190499A (zh) * 2021-05-26 2021-07-30 北京算能科技有限公司 一种面向大容量片上缓存的协同预取器及其控制方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190272234A1 (en) * 2018-03-02 2019-09-05 Arm Limited Prefetching in data processing circuitry
US20190272233A1 (en) * 2018-03-02 2019-09-05 Arm Limited Prefetching in data processing circuitry
US20190361810A1 (en) * 2018-05-24 2019-11-28 International Business Machines Corporation Prefetching data based on register-activity patterns
CN109783399A (zh) * 2018-11-19 2019-05-21 西安交通大学 一种动态可重构处理器的数据缓存预取方法
CN110704107A (zh) * 2019-09-30 2020-01-17 上海兆芯集成电路有限公司 预取器、预取器的运作方法及处理器
CN113778520A (zh) * 2021-09-09 2021-12-10 海光信息技术股份有限公司 偏移预取方法、执行偏移预取的装置、计算设备和介质

Also Published As

Publication number Publication date
CN113778520A (zh) 2021-12-10
CN113778520B (zh) 2022-09-30

Similar Documents

Publication Publication Date Title
WO2023035654A1 (zh) 偏移预取方法、执行偏移预取的装置、计算设备和介质
US9026739B2 (en) Multimode prefetcher
WO2017211240A1 (zh) 一种处理器芯片以及指令缓存的预取方法
US8683129B2 (en) Using speculative cache requests to reduce cache miss delays
TWI545435B (zh) 於階層式快取處理器中之協調預取
TWI534621B (zh) 預取單元、資料預取方法、電腦程式產品以及微處理器
US7640420B2 (en) Pre-fetch apparatus
US8683136B2 (en) Apparatus and method for improving data prefetching efficiency using history based prefetching
US20110072218A1 (en) Prefetch promotion mechanism to reduce cache pollution
US7073030B2 (en) Method and apparatus providing non level one information caching using prefetch to increase a hit ratio
US20150134933A1 (en) Adaptive prefetching in a data processing apparatus
CN112416817B (zh) 预取方法、信息处理装置、设备以及存储介质
US20150186280A1 (en) Cache replacement policy methods and systems
US10769069B2 (en) Prefetching in data processing circuitry
CN113760787B (zh) 多级高速缓存数据推送系统、方法、设备和计算机介质
US10684857B2 (en) Data prefetching that stores memory addresses in a first table and responsive to the occurrence of loads corresponding to the memory addresses stores the memory addresses in a second table
US8635406B2 (en) Data processing apparatus and method for providing target address information for branch instructions
US11249762B2 (en) Apparatus and method for handling incorrect branch direction predictions
CN114925001A (zh) 处理器、页表预取方法、电子设备
CN108874691B (zh) 数据预取方法和内存控制器
CN113760783B (zh) 联合偏移预取方法、装置、计算设备和可读存储介质
JP2013041414A (ja) 記憶制御システムおよび方法、置換方式および方法
US10776274B2 (en) Prefetching in data processing circuitry
JP5116275B2 (ja) 演算処理装置、情報処理装置及び演算処理装置の制御方法
US20140082286A1 (en) Prefetching Method and Apparatus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22866132

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE