CN114756481A - Data prefetching method and device supporting multiple memory access modes - Google Patents

Data prefetching method and device supporting multiple memory access modes Download PDF

Info

Publication number
CN114756481A
CN114756481A CN202210361651.8A CN202210361651A CN114756481A CN 114756481 A CN114756481 A CN 114756481A CN 202210361651 A CN202210361651 A CN 202210361651A CN 114756481 A CN114756481 A CN 114756481A
Authority
CN
China
Prior art keywords
offset
recording
access
bit
bit vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210361651.8A
Other languages
Chinese (zh)
Inventor
朱学斌
王永文
郭维
邓全
雷国庆
王俊辉
隋兵才
倪晓强
孙彩霞
黄立波
郑重
郭辉
李金玖
何燕东
彭令峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202210361651.8A priority Critical patent/CN114756481A/en
Publication of CN114756481A publication Critical patent/CN114756481A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0862Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with prefetch
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/60Details of cache memory
    • G06F2212/602Details relating to cache prefetching

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The invention discloses a data prefetching method and device supporting multiple memory access modes, wherein the method comprises the steps of sending out offset prefetching to a page to which a memory address A belongs by using a most recently updated optimal offset D while sending out prefetching by using a bit vector when detecting that an instruction with a program counter PC value of P1 sends out access to the memory address A in a cache, and using the offset prefetching as a supplement means for sending out prefetching by using the bit vector; and updating bit vector information by using the access information, and simultaneously evaluating the accessed offset in the latest period of time in each space region so as to update the optimal offset D according to preset conditions. The invention simultaneously utilizes the optimal offset and the bit vector corresponding to the space region to send out the prefetching, thereby not only obtaining better performance benefit aiming at the program with space intensive memory access characteristic, but also obtaining better performance benefit aiming at the program which frequently presents a stride access mode, and effectively improving the cache hit rate.

Description

Data prefetching method and device supporting multiple memory access modes
Technical Field
The invention belongs to the computer micro-architecture and storage component design technology in the field of microprocessor design, and particularly relates to a data prefetching method and device supporting multiple memory access modes.
Background
The big data age has come, and for many big data applications, long off-chip DRAM accesses are a well-known performance bottleneck, and due to the speed mismatch between the processor and the off-chip DRAM, the processor can easily stall for hundreds of cycles each time the DRAM is accessed, resulting in a huge performance penalty.
Cache hierarchies can improve program performance for data references with good temporal and spatial locality, but in some large data applications there is still a high data reference latency due to the high cache miss rate.
The data prefetching is a well-studied technology, can improve the hit rate of cache access, and can fetch the cache line of the required data into a proper cache level in advance before the data access is performed on the cache by identifying the specific access mode of a program to a memory, so as to reduce the cache miss rate and the data reference delay. The prior prefetchers can only identify one memory access mode, and are often poor in performance in programs with different memory access modes, so that limited performance improvement can be brought, and if a plurality of hardware prefetchers are combined together, although a plurality of different memory access modes can be identified, good performance benefits are obtained, but great hardware overhead is brought. For example, in some programs with space-intensive memory access characteristics, by using bit vectors to record access information of cache lines in a space region, a better performance benefit can be obtained by sending pre-fetching according to the bit vectors corresponding to the space region in the future; in contrast, in some programs that often exhibit stride access patterns, higher performance benefits may not be achieved by using a bit vector to record access to a cache line within a spatial region to issue prefetches.
Disclosure of Invention
The technical problems to be solved by the invention are as follows: the invention simultaneously evaluates the optimal offset of the recent access behavior when the bit vector is used for recording the access condition of a cache line in a space region, and simultaneously sends out prefetching by utilizing the optimal offset and the bit vector corresponding to the space region, thereby not only obtaining better performance benefit for a program with space intensive access characteristics, but also obtaining better performance benefit for the program which frequently presents a stride access mode, and effectively improving the cache hit rate.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a data prefetching method supporting multiple memory access modes comprises the following steps:
1) detecting the access of the instruction to the cache, and jumping to the next step when detecting that the instruction with the program counter PC value of P1 accesses the memory address A in the cache;
2) while issuing a prefetch using a bit vector for recording access information of a cache line in the space region R1 to which the memory address a belongs, issuing an offset prefetch to a page to which the memory address a belongs using the most recently updated optimal offset D as a supplementary means for issuing a prefetch using the bit vector; and using an instruction with a program counter PC value of P1 to send access information to the memory address A in the cache, updating bit vector information for recording the access information of the cache line in the space region, and simultaneously evaluating the accessed offset in the latest period of time in each space region to update the optimal offset D according to a preset condition.
Optionally, the step 2) of recording that the memory address a belongs to the K1 cache line of the physical page J, and issuing an offset prefetch to the page to which the memory address a belongs by using the most recently updated optimal offset D includes: judging whether the K1+ D cache line belongs to the physical page J, and if the K1+ D cache line belongs to the physical page J, sending a prefetch to the K1+ D cache line of the physical page J; recording that the memory address A belongs to the L1 cache line of the space region R1, the step 2) of issuing the prefetch by using the bit vector for recording the access information of the cache line in the space region R1 to which the memory address A belongs includes: for each bit of set 1 in the bit vector used to record the access information of the cache line in the spatial region R1 to which the memory address a belongs, the corresponding cache line in the spatial region R1 is prefetched, wherein each bit of set 1 in the bit vector indicates that the bit has been accessed in the latest period of time.
Optionally, the updating, in step 2), bit vector information used for recording access information of cache lines in the spatial regions, and the evaluating offsets accessed in the latest period of time in each spatial region to update the optimal offset D according to a preset condition includes:
2.1) indexing a preset accumulation table by using the spatial region R1, wherein the accumulation table is used for recording a trigger access instruction PC for triggering access to the spatial region, an offset in the spatial region and a bit vector corresponding to the spatial region R1 in the latest period of time; if the matching item is indexed, skipping to the next step; otherwise, skipping to step 2.6);
2.2) respectively calculating the difference value of F-L1 for the offset F in each space region with 1 in the bit vector of the matching item record, if the difference value of F-L1 has a corresponding offset d in a preset offset score list, adding 1 to the score corresponding to the offset d in the offset score list, wherein the offset score list is used for recording scores corresponding to a plurality of preset offsets;
2.3) the L1 position 1 of the bit vector corresponding to spatial region R1 in the accumulation table for the most recent period of time;
2.4) self-increasing the execution times counter by 1;
2.5) judging whether the count value of the condition execution time counter is larger than or equal to a first set value or whether the offset with the fraction larger than a second set value exists in the offset fraction list or not, if the condition is met, selecting the offset D with the highest fraction in the offset fraction list to update the optimal offset D, resetting the fractions of all the offsets in the offset fraction list, and resetting the condition execution time counter; skipping step 1);
2.6) using the spatial region R1 to index a preset filter table, wherein the filter table is used for recording a trigger access instruction PC for triggering access to the spatial region and offset in the spatial region; if the matching item is indexed, skipping to the next step; otherwise, jumping to step 2.9);
2.7) calculating the difference value of L2-L1 aiming at the triggering access instruction which is indexed to the matching item and has the corresponding program counter PC value of P2 and the offset of L2, and if the difference value of L2-L1 has the corresponding offset d in the preset offset fraction list, adding 1 to the fraction corresponding to the offset d in the offset fraction list;
2.8) the index to the matching item is ejected from the filter table, a new item is allocated to the space region R1 in the accumulation table, the corresponding program counter PC value P2 and the offset L2 corresponding to the ejected trigger access instruction are recorded, and the L1 and L2 positions 1 and other positions 0 in the new item are allocated; if a valid entry in the accumulation table is evicted due to allocation of a new entry to the space region R1 in the accumulation table, recording the evicted valid entry in a preset pattern history table, wherein the pattern history table is used for recording the bit vector transferred from the accumulation table; skipping step 2.5);
2.9) judging that the access is the trigger access to the space region R1, recording the information of the trigger access in a filter table, then using a program counter PC value P1 and a cache line L1 of a current access instruction to index a pattern history table, if a matching item is found in the pattern history table, and according to a bit of which the bit vector corresponding to the matching item in the pattern history table is already provided with 1, prefetching the corresponding cache line of the space region R1; jump to step 2.1).
Optionally, the recording the evicted valid entry in step 2.8) to the preset pattern history table includes: and allocating a new item for the effective item record which is pushed out in the mode history table, and respectively recording the PC value, the offset and the bit vector of the program counter of the instruction corresponding to the effective item record which is pushed out through the allocated new item.
Optionally, the step of recording the information triggering the access in the filter table in step 2.9) includes: a new entry is allocated to the space region R1 in the filter table, and the information of the trigger access is recorded by the allocated new entry respectively, including the program counter PC value P1 of the trigger access instruction and the cache line L1 in the accessed space region.
Optionally, in step 2.9), when prefetching is performed on the cache line corresponding to the space region R1 according to the bit of which 1 has been set in the bit vector corresponding to the matching entry in the pattern history table, if the bit of which 1 has been set in the bit vector corresponding to the matching entry in the pattern history table is 1, prefetching is performed on a cache line corresponding to the space region R1; if the bit with 1 set in the bit vector corresponding to the matching entry in the pattern history table is multi-bit, a prefetch is issued to a plurality of corresponding cache lines of the space region R1.
Optionally, fields of each entry of the filter table include four fields in total, namely a space area field, a content field, a valid bit field and an LRU bit, where the space area field is used to record an address of the space area, the content field is used to record a program counter PC value of a trigger access instruction corresponding to the space area and an offset of the space area, the valid bit field is used to indicate a valid state of the entry, and the LRU bit field is used to record an LRU replacement policy of the table;
The fields of each table entry of the accumulation table comprise a space area, content, valid bits and LRU bits, wherein the space area field is used for recording the address of the space area, the content field is used for recording a program counter PC value of a trigger access instruction corresponding to the space area, an offset of the space area and a bit vector, the valid bit field is used for representing the valid state of the table entry, and the LRU bit field is used for recording an LRU replacement strategy of the table;
the fields of each table entry of the mode history table comprise four fields including content, a bit vector, a valid bit and an LRU bit, wherein the content field is used for recording a program counter PC value of a trigger access instruction corresponding to a space region and an offset of the space region, the bit vector is used for recording the corresponding bit vector, the valid bit field is used for representing the valid state of the table entry, and the LRU bit field is used for recording an LRU replacement strategy of the table; each entry of the offset score list comprises two fields of an offset and a score, and the offsets are respectively eighteen fixed offsets of 1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 15, 16, 18, 20, 24, 25, 27 and 30.
In addition, the invention also provides a data prefetching device supporting multiple memory access modes, which comprises a control unit, an execution time counter and a plurality of storage tables, wherein the storage tables comprise:
The filter table is indexed by using the area number of the space area and is used for recording a trigger access instruction PC for triggering access to the space area and the offset in the space area;
the accumulation table is indexed by using the area number of the space area and is used for recording a trigger access instruction PC for triggering access to the space area, the offset in the space area and a bit vector of the space area in the latest period of time;
a pattern history table, using an instruction PC for triggering access and an offset index in a space region, for recording the bit vector transferred from the accumulation table;
the offset score list is used for recording different offsets and respective scores thereof;
the control unit is respectively connected with the execution times counter and the plurality of storage tables, and is programmed or configured to execute the steps of the data prefetching method supporting the plurality of memory access modes.
In addition, the invention also provides a microprocessor, which comprises a cache and a prefetching unit for prefetching data into the cache, wherein the prefetching unit is programmed or configured to execute the steps of the data prefetching method supporting multiple memory access modes.
In addition, the present invention also provides a computer readable storage medium, wherein a computer program is stored in the computer readable storage medium, and the computer program is used for being executed by a microprocessor to implement the steps of the data prefetching method supporting multiple memory access modes.
Compared with the prior art, the invention mainly has the following advantages: the invention can record the access information of the cache lines in each space region through the bit vectors, and send the prefetching to the cache lines in the space region by using the bit vectors when similar space regions are accessed in the future, and can evaluate the optimal offset of the access behavior in the recent period of time when the access information of the cache lines in each space region is recorded through the bit vectors, and send the prefetching by using the optimal offset when the access to the cache is detected next. The invention can make the prefetcher not only evaluate the best offset of the recent access behavior when using the bit vector to record the access information of the cache line in the space region, but also send out the prefetch by using the best offset and the bit vector corresponding to the space region under the condition of not increasing too much hardware overhead, thereby solving the problem that the prefetch coverage rate is too low in certain programs presenting frequent stride access modes when the information recorded by the bit vector is used alone to send out the prefetch.
Drawings
FIG. 1 is a schematic diagram of a basic process flow of a method according to an embodiment of the present invention.
FIG. 2 shows the detailed steps for prefetching and recording information according to an embodiment of the present invention.
FIG. 3 is a flowchart illustrating an embodiment of a process for detecting eviction of a line from a cache.
FIG. 4 is a field structure diagram of a filter table according to an embodiment of the present invention.
FIG. 5 is a field structure diagram of an accumulation table according to an embodiment of the present invention.
FIG. 6 is a field structure diagram of a pattern history table according to an embodiment of the present invention.
Fig. 7 is a field structure diagram of an offset score list in an embodiment of the present invention.
Detailed Description
As shown in fig. 1, the data prefetching method supporting multiple memory access modes according to the embodiment includes:
1) detecting the access of the instruction to the cache, and jumping to the next step when detecting that the instruction with the program counter PC value of P1 accesses the memory address A in the cache;
2) while issuing a prefetch using a bit vector for recording access information of a cache line in a spatial region R1 to which the memory address a belongs, issuing an offset prefetch to a page to which the memory address a belongs using the most recently updated optimum offset D as a supplementary means for issuing a prefetch using a bit vector; and using an instruction with a program counter PC value of P1 to send access information to the memory address A in the cache, updating bit vector information for recording the access information of the cache line in the space region, and simultaneously evaluating the accessed offset in the latest period of time in each space region to update the optimal offset D according to a preset condition.
According to the data prefetching method, when the access condition of the cache line in the space region is recorded by using the bit vector, the optimal offset of the recent access behavior is evaluated at the same time, and prefetching is sent by using the optimal offset and the bit vector corresponding to the space region, so that not only can a program with space-intensive access characteristics obtain better performance benefits, but also the program with a frequently-presented stride access mode can obtain better performance benefits, and the cache hit rate can be effectively improved.
In this embodiment, the step of recording that the memory address a belongs to the K1 th cache line of the physical page J, and the sending an offset prefetch to the page to which the memory address a belongs by using the most recently updated optimal offset D in step 2) includes: judging whether the K1+ D cache line belongs to the physical page J, and if the K1+ D cache line belongs to the physical page J, sending a prefetch to the K1+ D cache line of the physical page J; recording that the memory address A belongs to the L1 cache line of the space region R1, the step 2) of issuing the prefetch by using the bit vector for recording the access information of the cache line in the space region R1 to which the memory address A belongs includes: for each bit of set 1 in the bit vector used to record the access information of the cache line in the spatial region R1 to which the memory address a belongs, the corresponding cache line in the spatial region R1 is prefetched, wherein each bit of set 1 in the bit vector indicates that the bit has been accessed in the latest period of time.
In step 2), according to the current access information, while recording the access information of the spatial regions by using the bit vectors, the offsets visited in the last period of time in each spatial region are evaluated so as to update the optimal offset D according to preset conditions. In the method, the part of using the bit vector to record the memory access information of the space region is to calculate which cache line of which space region the address of the memory access belongs to, for example, the memory address a of the current access belongs to the L1 cache line of the space region R1, and then the L1 th position 1 of the bit vector of the memory access information of the space region R1 is recorded to indicate that the L1 th cache line is accessed. And evaluating the offset by calculating a difference between a currently accessed cache line and a previously accessed cache line in the space region, and scoring the occurring offset, for example, the memory address a accessed this time belongs to the L1 th cache line of the space region R1, and it is known that the L2 th cache line of the space region R1 has been previously accessed through the bit vector of the space region R1, so that the score of the offset L1-L2 is added by 1, and when a preset condition is met, the offset with the highest score is taken as the optimal offset D. The preset condition can be that the updating times exceed a set value or the offset with the highest score exceeds the set value according to needs.
Referring to fig. 2, as an alternative embodiment, the updating the bit vector information for recording the access information of the cache line in the spatial area in step 2), and the evaluating the offset accessed in the last period of time in each spatial area for updating the optimal offset D according to the preset condition includes:
2.1) indexing a preset accumulation table by using the spatial region R1, wherein the accumulation table is used for recording a trigger access instruction PC for triggering access to the spatial region, an offset in the spatial region and a bit vector corresponding to the spatial region R1 in the latest period of time; if the matching item is indexed, skipping to the next step; otherwise, skipping to step 2.6);
2.2) for each offset F in the space region with 1 in the bit vector of the matching item record, respectively calculating the difference value of F-L1, and if the difference value of F-L1 has a corresponding offset d in a preset offset score list, adding 1 to the score corresponding to the offset d in the offset score list, wherein the offset score list is used for recording scores (occurrence times) corresponding to a plurality of preset offsets;
2.3) the L1 position 1 of the bit vector corresponding to spatial region R1 in the accumulation table for the most recent period of time;
2.4) self-increasing the execution times counter by 1; in this embodiment, the execution time counter is named as test _ round, and is referred to as a test _ round counter for short, where the test _ round counter is initialized to 0 in advance;
2.5) judging whether the count value of the condition execution time counter is larger than or equal to a first set value or whether an offset with the fraction larger than a second set value exists in the offset fraction list or not, if the condition is met, selecting the offset D with the highest fraction in the offset fraction list to update the optimal offset D, resetting the fractions of all the offsets in the offset fraction list, and resetting the condition execution time counter; skipping step 1); the first setting value and the second setting value can be set according to actual needs, for example, as a preferred embodiment, the first setting value in this embodiment is 512, and the second setting value is 32; if the count value of the execution time counter is greater than or equal to 512 or the fraction of the offset is greater than 32, taking the offset with the highest score at the moment as the optimal offset D;
2.6) using the space region R1 to index a preset filter table, wherein the filter table is used for recording a trigger access instruction PC for triggering access to the space region and the offset in the space region; if the matching item is indexed, skipping to the next step; otherwise, jumping to step 2.9);
2.7) calculating the difference value of L2-L1 aiming at the triggering access instruction which is indexed to the matching item and has the corresponding program counter PC value of P2 and the offset of L2, and if the difference value of L2-L1 has the corresponding offset d in the preset offset fraction list, adding 1 to the fraction corresponding to the offset d in the offset fraction list;
2.8) the index to the matching item is ejected from the filtering table, a new item is allocated for the space region R1 in the accumulation table, the corresponding program counter PC value P2 and the offset L2 corresponding to the ejected trigger access instruction are recorded, and the L1 and the L2 in the new item are allocated to position 1 and other positions 0; referring to fig. 2, if a valid entry in the accumulation table is evicted due to allocation of a new entry to the space region R1 in the accumulation table, the evicted valid entry is recorded in a preset pattern history table, which is used for recording the bit vector shifted from the accumulation table; skipping step 2.5);
2.9) determining that the access is a trigger access to the space region R1, recording information of the trigger access in the filter table, and then using the program counter PC value P1 and the cache line L1 of the current access instruction to index the pattern history table, if a matching entry is found in the pattern history table, and according to a bit of which 1 has been set in a bit vector corresponding to the matching entry in the pattern history table, sending a prefetch to the corresponding cache line of the space region R1 (for example, the bit vector of the matching entry in the pattern history table is 00001111000000000000000000000000, because bits 4, 5, 6, and 7 in the bit vector are all 1, and other bits are all 0, sending a prefetch to bits 4, 5, 6, and 7 in the space region R1); jump to step 2.1).
Based on the modes recorded in the above steps 2.1) to 2.9), from the perspective of hardware resource utilization, all the proposed hardware structures are fully utilized, and especially, the accumulation table is used for not only recording the bit vectors corresponding to the spatial regions, but also generating the optimal offset of the latest memory access mode, so that one table has two purposes. From the aspect of prefetching coverage, the proposed design has higher prefetching coverage than that of prefetching by using the bit vector corresponding to the spatial region or the optimal offset alone, and can achieve the highest prefetching coverage in most applications.
Referring to fig. 3, whenever it is detected that a cache line is replaced out of the cache, the accumulation table is indexed by the spatial region in which the cache line is located, and if there is a matching entry in the accumulation table, the matching entry is evicted from the accumulation table, and the trigger access information and the bit vector recorded by the matching entry are transferred to the pattern history table. When detecting that the cache line E is replaced from the cache, using a space region R3 to which the cache line E belongs to index an accumulation table, if a matching item exists in the accumulation table, ejecting the matching item, and transferring information into a pattern history table, such as triggering access instructions PC P4, offset L4 and bit vector 00001111111111111111111111111111 recorded by the ejected item, allocating new items to the instructions PC P4 and offset L4 in the pattern history table, and storing the bit vector 00001111111111111111111111111111. And then continue to detect if a cache line has been replaced from the cache. The step 2.8) of this embodiment of recording the evicted valid entry into the preset pattern history table includes: and allocating a new item for the effective item record which is pushed out in the mode history table, and respectively recording the PC value, the offset and the bit vector of the program counter of the instruction corresponding to the effective item record which is pushed out through the allocated new item.
In this embodiment, the step 2.9) of recording the information of the trigger access in the filter table includes: a new entry is allocated to the space region R1 in the filter table, and the information of the trigger access is recorded by the allocated new entry respectively, including the program counter PC value P1 of the trigger access instruction and the cache line L1 in the accessed space region.
In this embodiment, when prefetching the cache line corresponding to the space region R1 according to the bit having been set to 1 in the bit vector corresponding to the matching entry in the pattern history table in step 2.9), if the bit having been set to 1 in the bit vector corresponding to the matching entry in the pattern history table is 1, prefetching the cache line corresponding to the space region R1; if the bit with 1 set in the bit vector corresponding to the matching entry in the pattern history table is multi-bit, a prefetch is issued to a plurality of corresponding cache lines of the space region R1.
As shown in fig. 4, the fields of each entry of the filter table in this embodiment include a space area, a content, a valid bit, and an LRU (least recently used) bit, wherein the space area field is used for recording the address of the space area, the content field is used for recording the PC value of the program counter of the trigger access instruction corresponding to the space area and the offset of the space area, the valid bit field is used for indicating the valid state of the entry, the LRU bit field is used for recording the LRU (least recently used) replacement policy of the table, the filter table in this embodiment is 16-way set associative, and there are 128 entries, and the LRU (least recently used) replacement policy is adopted for updating. This is described in detail herein, since the use of LRU (least recently used) replacement policy to update tables is a well-known method.
As shown in fig. 5, the fields of each entry of the accumulation table in this embodiment include four fields in total, namely a space area field, a content field, a valid bit field and an LRU bit, wherein the space area field is used for recording an address of the space area, the content field is used for recording a program counter PC value of a triggered access instruction corresponding to the space area, an offset of the space area and a bit vector, the valid bit field is used for indicating a valid state of the entry, and the LRU bit field is used for recording an LRU replacement policy of the table; the accumulation table in this embodiment is 16-way set associative, 128 entries in total, and is updated using an LRU (least recently used) replacement policy.
As shown in fig. 6, the fields of each entry of the pattern history table in the present embodiment include four fields in total, including a content field, a bit vector, a valid bit and an LRU bit, where the content field is used for recording the program counter PC value of the trigger access instruction corresponding to the space region and the offset of the space region, the bit vector is used for recording the corresponding bit vector, the valid bit field is used for indicating the valid state of the entry, and the LRU bit field is used for recording the LRU replacement policy of the table; the pattern history table in this embodiment is 16-way set associative, for a total of 8192 entries, which are updated using an LRU (least recently used) replacement policy.
As shown in fig. 7, each entry of the offset score list in the present embodiment includes two fields, an offset and a score. Our offset prefetch scheme does not consider page-crossing prefetching, so the selection of the offset in the table depends on the size of the memory page, and considering that the page size in most computing systems is 64 cache lines, the selectable offset range is-63 to +63, and as a rule of thumb, offsets with negative values, larger values, and prime factors larger than 5 do not provide sufficient benefit, so eighteen fixed offsets of 1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 15, 16, 18, 20, 24, 25, 27, 30 are selected for use.
In summary, the method of the present embodiment solves the problem that the prefetch coverage is insufficient when only one prefetch method is used and a large amount of hardware overhead is caused when multiple prefetch methods are used. Currently, most hardware prefetchers can only detect one memory access mode and issue prefetches, or detect multiple different memory access modes and issue prefetches by using multiple different hardware storage tables, but the hardware prefetchers need additional storage tables, so that a large amount of additional hardware storage overhead is brought. The method of the embodiment provides a data prefetching mode supporting multiple memory access modes, can evaluate the optimal offset of the memory access mode in a recent period of time by using a storage table for recording bit vectors of a space region when recording the bit vectors of the space region, and can simultaneously use the optimal offset and the bit vectors of the space region to send out prefetching, so that additional prefetching modes are added under the condition of not adding additional storage tables.
In addition, this embodiment further provides a data prefetching apparatus supporting multiple memory access modes, including a control unit, an execution count counter, and multiple storage tables, where the storage tables include:
the filter table is indexed by using the area number of the space area and is used for recording a trigger access instruction PC for triggering access to the space area and the offset in the space area;
the accumulation table is indexed by using the area number of the space area and is used for recording a trigger access instruction PC for triggering access to the space area, the offset in the space area and a bit vector of the space area in the latest period of time;
a pattern history table, using an instruction PC for triggering access and an offset index in a space region, for recording the bit vector transferred from the accumulation table;
the offset score list is used for recording different offsets and respective scores thereof;
the control unit is respectively connected with the execution times counter and the plurality of storage tables, and is programmed or configured to execute the steps of the data prefetching method supporting the plurality of memory access modes.
In addition, the present embodiment also provides a microprocessor, which includes a cache and a prefetch unit for prefetching data into the cache, wherein the prefetch unit is programmed or configured to execute the steps of the data prefetch method supporting multiple memory access modes.
In addition, the present embodiment also provides a computer readable storage medium, in which a computer program is stored, and the computer program is used for being executed by a microprocessor to implement the steps of the data prefetching method supporting multiple memory access modes.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and the scope of the present invention is not limited to the above embodiments, and all technical solutions that belong to the idea of the present invention belong to the scope of the present invention. It should be noted that modifications and adaptations to those skilled in the art without departing from the principles of the present invention should also be considered as within the scope of the present invention.

Claims (10)

1. A data prefetching method supporting multiple memory access modes is characterized by comprising the following steps:
1) detecting the access of the instruction to the cache, and jumping to the next step when detecting that the instruction with the program counter PC value of P1 accesses the memory address A in the cache;
2) while issuing a prefetch using a bit vector for recording access information of a cache line in a spatial region R1 to which the memory address a belongs, issuing an offset prefetch to a page to which the memory address a belongs using the most recently updated optimum offset D as a supplementary means for issuing a prefetch using a bit vector; and (3) sending access information to the memory address A in the cache by using an instruction with the PC value of a program counter P1, updating bit vector information for recording access information of cache lines in the space areas, and evaluating the accessed offset in the latest period of time in each space area so as to update the optimal offset D according to preset conditions.
2. The method of claim 1, wherein the step 2) of issuing the offset prefetch to the page of the memory address a using the most recently updated optimal offset D comprises: judging whether the K1+ D cache line belongs to the physical page J, and if the K1+ D cache line belongs to the physical page J, sending a prefetch to the K1+ D cache line of the physical page J; recording that the memory address A belongs to the L1 cache line of the space region R1, the step 2) of issuing the prefetch by using the bit vector for recording the access information of the cache line in the space region R1 to which the memory address A belongs includes: for each 1-set bit in the bit vector used to record access information of a cache line in spatial region R1 to which memory address a belongs, the corresponding cache line in spatial region R1 is prefetched, wherein each 1-set bit in the bit vector indicates that the bit was accessed in the latest period of time.
3. The method of claim 2, wherein the step 2) of updating bit vector information for recording access information of cache lines in spatial regions, and the step of evaluating offsets accessed in a recent period of time in each spatial region for updating the optimal offset D according to a predetermined condition comprises:
2.1) indexing a preset accumulation table by using the spatial region R1, wherein the accumulation table is used for recording a trigger access instruction PC for triggering access to the spatial region, an offset in the spatial region and a bit vector corresponding to the spatial region R1 in the latest period of time; if the matching item is indexed, skipping to the next step; otherwise, skipping to step 2.6);
2.2) respectively calculating the difference value of F-L1 for the offset F in each space region with 1 in the bit vector of the matching item record, if the difference value of F-L1 has a corresponding offset d in a preset offset score list, adding 1 to the score corresponding to the offset d in the offset score list, wherein the offset score list is used for recording scores corresponding to a plurality of preset offsets;
2.3) the L1 position 1 of the bit vector corresponding to spatial region R1 in the accumulation table for the most recent period of time;
2.4) self-increasing the execution times counter by 1;
2.5) judging whether the count value of the condition execution time counter is larger than or equal to a first set value or whether the offset with the fraction larger than a second set value exists in the offset fraction list or not, if the condition is met, selecting the offset D with the highest fraction in the offset fraction list to update the optimal offset D, resetting the fractions of all the offsets in the offset fraction list, and resetting the condition execution time counter; skipping step 1);
2.6) using the spatial region R1 to index a preset filter table, wherein the filter table is used for recording a trigger access instruction PC for triggering access to the spatial region and offset in the spatial region; if the matching item is indexed, skipping to the next step; otherwise, jumping to step 2.9);
2.7) calculating the difference value of L2-L1 aiming at the triggering access instruction which is indexed to the matching item and has the corresponding program counter PC value of P2 and the offset of L2, and if the difference value of L2-L1 has the corresponding offset d in the preset offset fraction list, adding 1 to the fraction corresponding to the offset d in the offset fraction list;
2.8) the index to the matching item is ejected from the filtering table, a new item is allocated for the space region R1 in the accumulation table, the corresponding program counter PC value P2 and the offset L2 corresponding to the ejected trigger access instruction are recorded, and the L1 and the L2 in the new item are allocated to position 1 and other positions 0; if a valid entry in the accumulation table is evicted due to allocation of a new entry to the space region R1 in the accumulation table, recording the evicted valid entry in a preset pattern history table, wherein the pattern history table is used for recording the bit vector transferred from the accumulation table; skipping step 2.5);
2.9) judging that the access is the trigger access to the space region R1, recording the information of the trigger access in a filter table, then using a program counter PC value P1 and a cache line L1 of a current access instruction to index a pattern history table, if a matching item is found in the pattern history table, and according to a bit of which the bit vector corresponding to the matching item in the pattern history table is already provided with 1, prefetching the corresponding cache line of the space region R1; jump to step 2.1).
4. The method of claim 3, wherein the step 2.8) of recording the evicted valid entry in a preset pattern history table comprises: and allocating a new item for the effective item record which is pushed out in the mode history table, and respectively recording a program counter PC value, an offset and a bit vector of an instruction corresponding to the effective item record which is pushed out through the allocated new item.
5. The method for prefetching data of supporting multiple memory access modes as claimed in claim 3, wherein the step 2.9) of recording the information triggering access in the filter table comprises: a new entry is allocated to the space region R1 in the filter table, and the information of the trigger access is recorded by the allocated new entry respectively, including the program counter PC value P1 of the trigger access instruction and the cache line L1 in the accessed space region.
6. The method according to claim 3, wherein in step 2.9), when prefetching the corresponding cache line of space R1 according to the bit that has been set to 1 in the bit vector corresponding to the matching entry in the pattern history table, if the bit that has been set to 1 in the bit vector corresponding to the matching entry in the pattern history table is 1, prefetching the corresponding cache line of space R1; if the bit with 1 set in the bit vector corresponding to the matching entry in the pattern history table is multi-bit, a prefetch is issued to a plurality of corresponding cache lines of the space region R1.
7. The method of claim 3, wherein the fields of each entry of the filter table include a space area, a content, a valid bit, and an LRU bit, wherein the space area is used to record an address of the space area, the content is used to record a PC value of a program counter of the triggered access instruction corresponding to the space area and an offset of the space area, the valid bit field is used to indicate a valid status of the entry, and the LRU bit field is used to record an LRU replacement policy of the table;
the fields of each table entry of the accumulation table comprise a space area, content, valid bits and LRU bits, wherein the space area field is used for recording the address of the space area, the content field is used for recording a program counter PC value of a trigger access instruction corresponding to the space area, an offset of the space area and a bit vector, the valid bit field is used for representing the valid state of the table entry, and the LRU bit field is used for recording an LRU replacement strategy of the table;
the fields of each table entry of the mode history table comprise four fields including content, a bit vector, a valid bit and an LRU bit, wherein the content field is used for recording a program counter PC value of a trigger access instruction corresponding to a space region and an offset of the space region, the bit vector is used for recording the corresponding bit vector, the valid bit field is used for representing the valid state of the table entry, and the LRU bit field is used for recording an LRU replacement strategy of the table; each entry of the offset score list comprises two fields of an offset and a score, and the offsets are respectively eighteen fixed offsets of 1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 15, 16, 18, 20, 24, 25, 27 and 30.
8. A data prefetching device supporting multiple memory access modes is characterized by comprising a control unit, an execution time counter and a plurality of storage tables, wherein the storage tables comprise:
the filter table is indexed by using the area number of the space area and is used for recording a trigger access instruction PC for triggering access to the space area and the offset in the space area;
the accumulation table is indexed by using the area number of the space area and is used for recording a trigger access instruction PC for triggering access to the space area, the offset in the space area and the bit vector of the space area in the latest period of time;
a pattern history table, using an instruction PC for triggering access and an offset index in a space region, for recording the bit vector transferred from the accumulation table;
the offset score list is used for recording different offsets and respective scores thereof;
the control unit is respectively connected with the execution time counter and the plurality of storage tables, and is programmed or configured to execute the steps of the data prefetching method supporting the plurality of memory access modes according to any one of claims 1 to 7.
9. A microprocessor including a cache and a prefetch unit for prefetching data into the cache, wherein the prefetch unit is programmed or configured to perform the steps of the method of prefetching data supporting multiple memory access modes of any of claims 1-7.
10. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is used for being executed by a microprocessor to implement the steps of the data prefetching method supporting multiple memory access modes according to any one of claims 1 to 7.
CN202210361651.8A 2022-04-07 2022-04-07 Data prefetching method and device supporting multiple memory access modes Pending CN114756481A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210361651.8A CN114756481A (en) 2022-04-07 2022-04-07 Data prefetching method and device supporting multiple memory access modes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210361651.8A CN114756481A (en) 2022-04-07 2022-04-07 Data prefetching method and device supporting multiple memory access modes

Publications (1)

Publication Number Publication Date
CN114756481A true CN114756481A (en) 2022-07-15

Family

ID=82330172

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210361651.8A Pending CN114756481A (en) 2022-04-07 2022-04-07 Data prefetching method and device supporting multiple memory access modes

Country Status (1)

Country Link
CN (1) CN114756481A (en)

Similar Documents

Publication Publication Date Title
Jaleel et al. High performance cache replacement using re-reference interval prediction (RRIP)
US7284096B2 (en) Systems and methods for data caching
KR102470184B1 (en) Cache aging policy selection for prefetch based on cache test region
US8041897B2 (en) Cache management within a data processing apparatus
US9582282B2 (en) Prefetching using a prefetch lookup table identifying previously accessed cache lines
Li et al. TaP: Table-based Prefetching for Storage Caches.
US7380047B2 (en) Apparatus and method for filtering unused sub-blocks in cache memories
US20030217230A1 (en) Preventing cache floods from sequential streams
CN109478165B (en) Method for selecting cache transfer strategy for prefetched data based on cache test area and processor
US20100217937A1 (en) Data processing apparatus and method
US20050050279A1 (en) Storage system and method for prestaging data in a cache for improved performance
CN113297098B (en) High-performance adaptive prefetching intelligent cache replacement strategy
CN110297787B (en) Method, device and equipment for accessing memory by I/O equipment
CN106021128A (en) Data prefetcher based on correlation of strides and data and prefetching method of data prefetcher
KR20160029086A (en) Data store and method of allocating data to the data store
WO2017218022A1 (en) Cache entry replacement based on availability of entries at another cache
Inoue et al. Dynamically variable line-size cache exploiting high on-chip memory bandwidth of merged DRAM/logic LSIs
US11880310B2 (en) Cache access measurement deskew
US11487671B2 (en) GPU cache management based on locality type detection
CN114756481A (en) Data prefetching method and device supporting multiple memory access modes
Etsion et al. Exploiting core working sets to filter the L1 cache with random sampling
CN115495394A (en) Data prefetching method and data prefetching device
Tyson et al. Managing data caches using selective cache line replacement
WO2018013813A1 (en) System and method for identifying pendency of a memory access request at a cache entry
US9110811B2 (en) Prefetching method and apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination