CN116383100A

CN116383100A - Cache data prefetching method and system based on merging bit vector access mode

Info

Publication number: CN116383100A
Application number: CN202211716885.6A
Authority: CN
Inventors: 蒋实知; 杨秋松; 慈轶为
Original assignee: Shanghai Advanced Research Institute of CAS
Current assignee: Shanghai Advanced Research Institute of CAS
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2023-07-04

Abstract

The cache data prefetching method and system based on the merging bit vector access mode, the bit vector mode is merged by using the counter vector, metadata storage overhead is reduced, commonalities among the bit vectors are extracted, coverage rate is effectively improved, cache miss rate is reduced, and overall performance is improved; the final prefetching target is determined according to the offset access frequency, compared with the prior art, the prefetching target is predicted by only utilizing index and label matching, and the prediction accuracy is greatly improved; the invention discards the combination of multiple features to index, does not explode the state space index any more, and reduces the hardware cost; the access modes corresponding to different feature indexes are stored through the dual mode table, so that the performance is improved, and the overall cost is controlled.

Description

Cache data prefetching method and system based on merging bit vector access mode

Technical Field

The present disclosure relates to the field of cache data prefetching and counting of a CPU, and in particular, to a cache data prefetching method and system based on a merge bit vector access mode.

Background

The CPU access high delay is one of main bottlenecks which prevent the performance of the CPU from being improved, and the cache data prefetcher loads data from the memory into the cache in advance by predicting the data address required by the CPU calculation, so that the average access delay is reduced, and the overall performance of the CPU is improved. Wherein coverage, accuracy and timeliness are three major factors that measure prefetcher performance.

At present, some advanced pre-fetching methods extract program access behavior patterns from a CPU access request sequence, organize the access patterns into bit vectors, and index the access patterns by adopting multiple program features, so that the overall performance of the CPU is greatly improved. Such prefetchers include SMS, DSPatch, bingo and the like. However, such prefetchers have a problem in that they require a very large hardware memory space, resulting in a huge hardware overhead.

Disclosure of Invention

In view of the above drawbacks of the prior art, an object of the present application is to provide a method and a system for prefetching cache data based on a merge bit vector access mode, which are used for solving the technical problem that the existing prefetcher needs a huge hardware storage space, resulting in huge hardware overhead.

To achieve the above and other related objects, a first aspect of the present application provides a cache data prefetching method based on a merge bit vector access mode, including: in the data cache, collecting target physical addresses and program counts of CPU data loading requests by an accumulation table and a filtering table, and respectively recording access history in a physical area with a preset size and generating the access history into a bit vector form; each bit vector is used for recording access history in a time interval from the first access in a certain physical area to the first data cache rejection; the generated bit vector addresses are respectively combined into counter vectors corresponding to the dual-mode table; taking the access requests which are not recorded in the accumulation table and the filtering table as triggering access requests; when the trigger access request is identified, respectively inquiring corresponding counter vectors in a dual-mode table by using physical address offset and program count as indexes; if the counter vector is obtained, two prefetch target vectors are obtained through calculation according to comparison of access frequency of each physical address offset and a prefetch threshold value; arbitrating the two prefetched target vectors from the dual-mode table to generate a final prefetched target vector; storing the final prefetched target vector obtained by arbitration into a value prefetching cache; querying whether there is at least one missing state processing register available; if the available missing state processing register exists, other offsets closest to the trigger offset and the page address of the current trigger access request are used for assembling a prefetch target address, and prefetching is initiated to the corresponding hierarchy until only one valid missing state processing register is left; if no miss state processing register is available, then a next access request with the same page address is awaited.

In some embodiments of the first aspect of the present application, the generated bit vector addresses are respectively merged into counter vectors corresponding to the dual mode table, where a merging manner includes: and combining the access mode of the bit vector captured by the accumulation table and the filtering table with the counter vector with the same index, and specifically accumulating each bit in the bit vector into a counter corresponding to the counter vector.

In some embodiments of the first aspect of the present application, the combining process includes: circularly shifting the bit vector mode to the smaller offset direction according to the position where the trigger offset is positioned to obtain an anchor vector; accumulating the obtained anchor vectors into the counter vectors one by one according to the offset; if the corresponding bit of the anchoring vector is 1, adding 1 to a counter corresponding to the counter vector; the counter corresponding to the trigger offset is a merging counter, and the merging counter is added with 1 when merging each time; when the merge counter is saturated, all counter values including it are halved; and normally accumulating when the data is not saturated.

In some embodiments of the first aspect of the present application, the calculating according to the comparison between the access frequency of each physical address offset and the prefetch threshold value obtains two prefetch target vectors includes: calculating the proportion of the counter value corresponding to other offsets except the trigger offset to the combined counter value one by one to obtain the access frequency of the offset; comparing the access frequency with an L1 level expected threshold; if the L1 level expected threshold is exceeded, predicting that the offset is fetched into an L1 layer data cache; otherwise, the next step is carried out; comparing the access frequency with an L2 level expected threshold; if the L2 level expected threshold is exceeded, predicting that the offset is prefetched into an L2 layer data cache; otherwise, the offset is not predicted to be prefetched.

In some embodiments of the first aspect of the present application, the arbitrating the two prefetch target vectors from the dual mode table to generate a final prefetch target vector includes: if the same offset in the two prefetched target vectors is predicted to be prefetched into the L1 level cache, prefetching into the L1 level cache; otherwise, if the two prefetched target vectors have predictions, but the predictions prefetched to the L2 level cache exist, the predictions are prefetched to the L2 level cache; otherwise, if the prediction of the offset mode table is not prefetched, the corresponding offset is not prefetched; otherwise, if the prediction of the program count mode table is not prefetched, the prediction of the offset mode table is reduced, e.g., L-level is reduced to L2-level. The arbitration ends.

In some embodiments of the first aspect of the present application, the dual mode table is composed of an offset mode table and a program count mode table; the offset mode table refers to a mode table which is indexed by triggering offset and stores the combined counter vector; the program count pattern table refers to a pattern table indexed by program count and storing the combined counter vector.

To achieve the above and other related objects, a second aspect of the present application provides a cache data prefetching system based on a merge bit vector access mode, including: a filtering table, an accumulation table, an offset mode table, a program counting mode table and a prefetch buffer; wherein the offset pattern table and the program count pattern table constitute a dual pattern table.

As described above, the cache data prefetching method and system based on the merging bit vector access mode have the following beneficial effects:

(1) The invention combines the bit vector modes by using the counter vector, reduces the metadata storage cost, extracts the commonality among the bit vectors, effectively improves the coverage rate, reduces the cache miss rate and improves the overall performance.

(2) Compared with the prior art, the method and the device only use index and label matching to predict the pre-fetching target, so that the prediction accuracy is greatly improved.

(3) The invention discards the combination of multiple features to index, does not explode the state space index any more, and reduces the hardware cost; the access modes corresponding to different feature indexes are stored through the dual mode table, so that the performance is improved, and the overall cost is controlled.

Drawings

FIG. 1 is a flow chart illustrating a method for prefetching cache data based on a merge bit vector access mode according to an embodiment of the present application.

Fig. 2 is a flowchart illustrating a program access mode learning method according to an embodiment of the present application.

FIG. 3 is a diagram showing the merging of bit vector addresses into counter vectors corresponding to the dual mode table, respectively, according to one embodiment of the present application.

Fig. 4 is a schematic diagram of a mode prediction method according to an embodiment of the present application.

FIG. 5 is a schematic diagram of a cache data prefetch system based on a merge bit vector access mode according to an embodiment of the present application.

Detailed Description

Other advantages and effects of the present application will become apparent to those skilled in the art from the present disclosure, when the following description of the embodiments is taken in conjunction with the accompanying drawings. The present application may be embodied or carried out in other specific embodiments, and the details of the present application may be modified or changed from various points of view and applications without departing from the spirit of the present application. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.

It is noted that in the following description, reference is made to the accompanying drawings, which describe several embodiments of the present application. It is to be understood that other embodiments may be utilized and that mechanical, structural, electrical, and operational changes may be made without departing from the spirit and scope of the present application. The following detailed description is not to be taken in a limiting sense, and the scope of embodiments of the present application is defined only by the claims of the issued patent. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. Spatially relative terms, such as "upper," "lower," "left," "right," "lower," "upper," and the like, may be used herein to facilitate a description of one element or feature as illustrated in the figures as being related to another element or feature.

In this application, unless specifically stated and limited otherwise, the terms "mounted," "connected," "secured," "held," and the like are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the terms in this application will be understood by those of ordinary skill in the art as the case may be.

Furthermore, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes," and/or "including" specify the presence of stated features, operations, elements, components, items, categories, and/or groups, but do not preclude the presence, presence or addition of one or more other features, operations, elements, components, items, categories, and/or groups. The terms "or" and/or "as used herein are to be construed as inclusive, or meaning any one or any combination. Thus, "A, B or C" or "A, B and/or C" means "any of the following: a, A is as follows; b, a step of preparing a composite material; c, performing operation; a and B; a and C; b and C; A. b and C). An exception to this definition will occur only when a combination of elements, functions or operations are in some way inherently mutually exclusive.

In order to solve the problems in the background art, the invention provides a cache data prefetching method and a cache data prefetching system based on a merging bit vector access mode, which aim to improve the access mode organization form of a cache data prefetcher by improving the cache data prefetcher, adopt a counter vector to replace a bit vector, merge similar bit vector access modes, further improve the performance and greatly reduce the hardware cost.

In order to make the objects, technical solutions and advantages of the present invention more apparent, further detailed description of the technical solutions in the embodiments of the present invention will be given by the following examples with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Before explaining the present invention in further detail, terms and terminology involved in the embodiments of the present invention will be explained, and the terms and terminology involved in the embodiments of the present invention are applicable to the following explanation:

(1) Cache prefetch: computer processors are used to improve execution performance by fetching instructions or data from their original storage in slower memory to faster local memory before actual need arises. Most modern computer processors have fast and local caches where prefetched data is kept until needed. The source of the prefetch operation is typically main memory. Because of its design, accessing cache memory is typically much faster than accessing main memory, so prefetching data and then accessing data from cache is typically many orders of magnitude faster than accessing data directly from main memory. The prefetch may be accomplished using non-blocking cache control instructions.

(2) Bit Vector (Bit Vector): is a vector consisting of some binary components, and the bit vector can store the Boolean variable with little memory.

(3) L1 cache: the first-level cache is integrated in the CPU and is used for temporarily storing data in the process of processing the data by the CPU. As the cache instruction and the data work with the CPU at the same frequency, the larger the capacity of the L1 level cache is, the more information is stored, the data exchange times between the CPU and the memory can be reduced, and the operation efficiency of the CPU is improved.

The embodiment of the invention provides a cache data prefetching method based on a merging bit vector access mode and a system of the cache data prefetching method based on the merging bit vector access mode. With respect to implementation of the cache data prefetching method based on the merged bit vector access mode, an exemplary implementation scenario of cache data prefetching based on the merged bit vector access mode will be described.

Fig. 1 is a flow chart illustrating a method for prefetching cache data based on a merge bit vector access mode according to an embodiment of the present invention. The cache data prefetching method based on the merging bit vector access mode in the embodiment mainly comprises the following steps:

step S11: in the data cache, collecting target physical addresses and program counts of CPU data loading requests by an accumulation table and a filtering table, and respectively recording access history in a physical area with a preset size and generating the access history into a bit vector form; each bit vector is used for recording access history in a time interval from the first access in a certain physical area to the first data cache rejection; the generated bit vector addresses are respectively incorporated into counter vectors corresponding to the dual mode table.

It should be noted that the bit vector access mode is composed of 64 bits, where each bit indicates whether a 64Byte cache line corresponding to a certain 4KB physical page address offset is accessed.

The counter vector consists of 64 counters, each of which represents the confidence corresponding to a certain 4KB physical page address offset corresponding to the cache line.

Further, the confidence number refers to the number of times a record repeatedly appears. There is an upper limit on the confidence, and once a certain confidence reaches the upper limit, the other confidence of the same counter vector will be halved.

In this embodiment, the filtering table includes a compressed Program Counter (PC), a physical page address, and a trigger offset, indexed by the physical page address.

Specifically, the filtering table consists of an 8-way 64-item multi-way group connected cache, and comprises data directly obtained from a memory access sequence, wherein each item of data comprises the following fields: page tag (33 bits), PC tag (5 bits), trigger offset (6 bits), LRU state (3 bits). Wherein, the PC label is obtained by hashing the original PC, and the page label is obtained by directly intercepting the high order of the physical address.

The filter table indexes the low order bits of the physical page address and stores the high order bits as page tags. When the page tag has a matching item, whether the offset is the same as the trigger offset is further judged; if the offsets are different, the two offsets are organized into a 64-bit vector form, and the page tag, PC, trigger offset and bit vector are dumped into the accumulation table, otherwise, the updating is not performed. When the page tag does not have a matching item, the accumulation table selects the oldest record for replacement when the replacement is needed according to a classical LRU replacement algorithm.

In this embodiment, the accumulation table includes a bit vector memory access mode composed of a compressed program count PC, a physical page address, a trigger offset, and a plurality of past memory accesses, and is indexed by the physical page address. The trigger offset refers to an address offset corresponding to the trigger access request.

Specifically, the accumulation table consists of a 16-way 32-item multi-way group connected cache, and comprises data directly obtained from a memory access sequence, wherein each item of data comprises the following fields: page tag (35 bits), PC tag (5 bits), trigger offset (6 bits), bit vector access mode (64 bits), LRU state (4 bits), 114 bits total. The PC label is obtained by hashing an original PC, and the page label is obtained by directly intercepting the high order of the physical address; the bit vector access mode records whether each 64B cache line is accessed in the 4KB address space corresponding to the page tag, and when the cache line is accessed, the corresponding bit is set to be 1, and the cache line is continuously updated in the recording process.

The accumulation table indexes the low order bits of the physical page address and stores the high order bits as page tags. When a page tag matching item exists, updating a subsequent bit vector memory access mode; if there are no matches, the filter table is queried for additional updating operations. The accumulation table selects the oldest record for replacement when the replacement is needed according to the classical LRU replacement algorithm.

It should be appreciated that the LRU (Least Recently Used) replacement algorithm is a page replacement algorithm that selects the most recently unused pages to eliminate; the algorithm gives each page an access field for recording the time t that a page has elapsed since the last time it was accessed, and when a page has to be eliminated, the page with the largest t value, i.e. the least recently used, in the existing pages is selected for elimination.

In the present embodiment, the dual mode table is composed of an offset mode table and a program count mode table. The offset mode table refers to a mode table which is indexed by the trigger offset and stores the combined counter vector; the program count pattern table is also called a PC pattern table, and refers to a pattern table indexed by the program count PC and storing the combined counter vector.

The offset mode table consists of 64 direct-connection caches, and comprises the merged bit vector access mode information by triggering an offset index, wherein each item of data comprises the following fields: counter vector (320 bits). Each counter vector contains 64 counters, 5 bits per counter, with an upper saturation count limit of 31. The data update is performed by the saturation halving mode, namely, when the merging counters (the counter corresponding to the trigger offset) are saturated, all the counters are subjected to numerical halving. No data substitution is performed, no substitution algorithm.

The program counting mode table consists of 32 direct-connection caches, is indexed by a PC, contains the merged bit vector memory access mode information, and comprises the following fields: counter vector (160 bits). Each counter vector contains 32 counters, each counter is 5 bits and is responsible for counting the accessed condition of two cache lines, and the upper limit of the saturation count is 31. The data update is performed by the saturation halving mode, namely, when the merging counters (the counter corresponding to the trigger offset) are saturated, all the counters are subjected to numerical halving. No data substitution is performed, no substitution algorithm.

In this embodiment, the process of learning the access mode in step S11 is described by taking the L1 level data cache as an example: in the L1-level data cache, the target physical address and program count of the CPU data loading request are collected by an accumulation table and a filtering table, and access history organization in different physical 4KB page areas is recorded respectively in the form of bit vectors. Each bit vector records the access history from the first access in a physical page area to the first data cached and removed in a time interval, and the generated bit vector addresses are respectively combined into counter vectors corresponding to the dual-mode table.

For the convenience of understanding of those skilled in the art, the process of learning the program access mode in fig. 2 will be further explained. The flow selected by the left dashed line box in fig. 2 is the program access mode learning method flow, and the embodiment of the invention learns memory loading information from the L1 level cache and data rejection information of the L1 level cache each time, wherein the memory loading information comprises a physical address and a program count PC, and the specific learning steps are as follows:

starting and entering a first path of branch flow:

A1. inputting a current access address and a PC;

A2. inquiring an accumulation table and a filtering table;

A3. judging whether the accumulation table and the filtering table have no corresponding history records;

A4. if not, updating the accumulation table and the filtering table, recording the access address and the PC, updating the bit vector mode, and jumping to A6;

A5. if yes, entering a program access mode prefetching method flow.

A6. And (5) ending.

Starting and entering a second branch flow:

B1. inputting a corresponding address of the current cache rejection data;

B2. inquiring an accumulation table and a filtering table to obtain a bit vector access mode, a trigger offset and a PC corresponding to the address;

B3. respectively inquiring the existing records in the dual-mode table according to the trigger offset and the PC;

B4. merging the current bit vector access mode into the existing record, and updating the counter vector;

B5. and (5) ending.

In this embodiment, the generated bit vector addresses are respectively merged into counter vectors corresponding to the dual-mode table, where the merging method is to merge the accumulated table and the filter table capturing the bit vector access mode with the counter vectors having the same index, specifically, to respectively accumulate each bit in the bit vector into the counter corresponding to the counter vector.

FIG. 3 is a schematic diagram showing that the bit vector addresses are respectively merged into the counter vectors corresponding to the dual mode table according to an embodiment of the present invention, which includes:

firstly, circularly shifting the bit vector mode to the smaller offset direction according to the position of the trigger offset to obtain an anchor vector.

Assuming that the current trigger offset position is t; the bit vector pattern is V, where each bit vector is represented as:

v _i (i＝0,1,…,63)；V＝{v ₀ ,v ₁ ,v ₂ ,…,v _t ,…,v ₆₃ }。

the anchor vector obtained after conversion is: v' = { V _t ,v _t+1 ,…,v ₆₃ ，v ₀ …,v _t-1 }. For the record to be combined into the program count mode table, the or operation is performed on every two offset bits to obtain an anchor bit vector with coarse granularity and length of 32.

Secondly, accumulating the obtained anchor vectors into counter vectors one by one according to the offset; if the corresponding bit of the anchoring vector is 1, adding 1 to a counter corresponding to the counter vector; and the counter corresponding to the trigger offset is a merging counter, and the merging counter is increased by 1 when merging each time. It should be appreciated that since the trigger offset always corresponds to the first counter, it may be referred to as a merge counter; the merge counter is always incremented by 1 for each merge.

Finally, when the combined counter is saturated, all counter values including it are halved; and when the device is not saturated, the device normally accumulates.

Step S12: taking the access requests which are not recorded in the accumulation table and the filtering table as triggering access requests; when the trigger access request is identified, respectively inquiring corresponding counter vectors in a dual-mode table by using physical address offset and program count as indexes; if the counter vector is obtained, two prefetch target vectors are calculated according to the comparison of the access frequency of each physical address offset and the prefetch threshold value.

It should be noted that, triggering the access request refers to a request not in the record of the filtering table and the accumulation table; correspondingly, the trigger offset refers to an address offset corresponding to the trigger access request.

The access frequency of the physical address offset refers to the value of the confidence number corresponding to a certain offset in the counter vector divided by the total merging times of the past period of time. Wherein the value of the total number of merges over a period of time is equal to the confidence number of the trigger offset correspondence, because the trigger offset correspondence bit is always 1 each time the bit vector is merged.

The prefetch threshold is a fixed value for determining whether to prefetch and prefetch the level; there may be multiple prefetch thresholds for different prefetch cache levels.

The prefetching target vector refers to a vector formed by target cache levels after the access frequency of each offset is compared with a prefetching threshold value.

The prefetch cache refers to a fully connected table in which prefetch vectors to be prefetched are stored, indexed by physical page addresses. For example, the prefetch buffer may be a 16-entry fully associative buffer containing the predicted arbitration result of the dual mode table, each item of data including the following fields: page tag (36 bits), prefetch target vector (126 bits), and LRU state (4 bits).

In this embodiment, the memory access mode prediction and prefetch process in step S12 is as shown in the flow of fig. 2, which is outlined by the right dashed box:

C1. and respectively inquiring an offset mode table and a program counting mode table by using trigger offset and program counting PC in trigger access request information to obtain two counter vectors.

C2. It is checked one by one whether the offset in both counter vectors exceeds the frequency threshold.

C3. Prefetch target vectors are generated separately.

C4. And arbitrating the two prefetched target vectors to obtain the final prefetched target vector after arbitration.

C5. And calculating the prefetch target addresses one by one according to the trigger addresses and the offset in the final prefetch target vector after arbitration.

C6. The prefetch address is output.

C6. Ending

In this embodiment, for easy understanding, fig. 4 shows a schematic diagram of a mode prediction method in an embodiment of the present invention; the execution process of obtaining two prefetch target vectors according to the comparison calculation of the access frequency of each physical address offset and the prefetch threshold value is as follows:

step S121: the ratio of the counter value corresponding to the other offsets except the trigger offset to the combined counter value is calculated one by one to obtain the access frequency of the offset.

Step S122: comparing the access frequency with an L1 level expected threshold; if the L1 level expected threshold is exceeded, predicting that the offset is prefetched into an L1 layer data cache; otherwise, the next step is performed.

For example, the L1 level expected threshold may be set to be 50% by human, and if the access frequency exceeds 50%, the offset is predicted to be prefetched into the L1 layer data cache.

Step S123: comparing the access frequency with an L2 level expected threshold; if the L2 level expected threshold is exceeded, predicting that the offset is prefetched into an L2 layer data cache; otherwise, the offset is not predicted to be prefetched.

For example, the L2 level expected threshold may be considered set to 15% and if the access frequency exceeds 15%, the offset is predicted to be prefetched into the L2 layer data cache.

Step S13: two prefetch target vectors from the dual mode table are arbitrated to generate a final prefetch target vector.

In this embodiment, the arbitration process includes: if the same offset in the two prefetched target vectors is predicted to be prefetched into the L1 level cache, prefetching into the L1 level cache; otherwise, if the two prefetched target vectors have predictions, but the predictions prefetched to the L2 level cache exist, the predictions are prefetched to the L2 level cache; otherwise, if the prediction of the offset mode table is not prefetched, the corresponding offset is not prefetched; otherwise, if the prediction of the program count mode table is not prefetched, the prediction of the offset mode table is reduced, e.g., L-level is reduced to L2-level. The arbitration ends.

Step S14: storing the final prefetched target vector obtained by arbitration into a value prefetching cache; querying whether there is at least one missing state processing register available; if the available missing state processing register exists, other offsets closest to the trigger offset and the page address of the current trigger access request are used for assembling a prefetch target address, and prefetching is initiated to the corresponding hierarchy until only one valid missing state processing register is left; if no miss state processing register is available, then a next access request with the same page address is awaited.

For the convenience of understanding of the person skilled in the art, the embodiment of the invention verifies the advancement of the technical scheme of the invention through experiments.

The experimental environment adopts an Intel 10 core processor, a memory 64G and a hard disk 1TB; the operating system is Ubuntu20.04; the hardware architecture simulation was performed using simulator chamtsim, the configuration of which is shown in table 1. In addition, 125 instruction streams from SPEC2006, SPEC 2017, PARSEC and Ligra, respectively, were used as experimental loads.

TABLE 1 simulator configuration table

Experimental results show that the invention achieves 65.2% performance improvement when added to a prefetcher-less CPU.

The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and those skilled in the art may modify or substitute the technical solution of the present invention without departing from the spirit and scope of the present invention, and the protection scope of the present invention shall be defined by the claims.

To summarize, the cache data prefetching method based on the merging bit vector access mode provided by the invention utilizes the data to be prefetched to cache the current access information, organizes the access information into the access mode in a bit vector form, and stores the corresponding Program Counter (PC) information and the trigger physical address; updating the offset of the triggering physical address corresponding to the current bit vector access mode to the counter vectors corresponding to the dual access mode tables respectively; when the trigger physical address reappears, the offset of the trigger physical address is utilized to inquire the counter vector stored in the dual mode table; calculating respective pre-fetching target offset vectors of the dual-mode table according to the counter vectors obtained by inquiry, and arbitrating to obtain final pre-fetching target offset vectors; and calculating a target address according to the pre-fetching target offset vector and the current access address, and sequentially pre-fetching. The invention solves the problem that the prior art needs huge hardware storage to store different complex access modes, utilizes the commonality among the access modes, improves the overall performance and greatly reduces the hardware cost.

Fig. 5 is a schematic diagram illustrating a cache data prefetching system based on a merge bit vector access mode according to an embodiment of the present invention. The system in the embodiment of the invention comprises a filtering table, an accumulation table, an offset mode table, a program counting mode table and a prefetch buffer; wherein the offset pattern table and the program count pattern table constitute a dual pattern table; wherein:

in the data cache, collecting target physical addresses and program counts of CPU data loading requests by an accumulation table and a filtering table, and respectively recording access history in a physical area with a preset size and generating the access history into a bit vector form; each bit vector is used for recording access history in a time interval from the first access in a certain physical area to the first data cache rejection; the generated bit vector addresses are respectively incorporated into counter vectors corresponding to the dual mode table.

Taking the access requests which are not recorded in the accumulation table and the filtering table as triggering access requests; when the trigger access request is identified, respectively inquiring corresponding counter vectors in a dual-mode table by using physical address offset and program count as indexes; if the counter vector is obtained, two prefetch target vectors are calculated according to the comparison of the access frequency of each physical address offset and the prefetch threshold value.

Two prefetch target vectors from the dual mode table are arbitrated to generate a final prefetch target vector.

Storing the final prefetched target vector obtained by arbitration into a value prefetching cache; querying whether there is at least one missing state processing register available; if the available missing state processing register exists, other offsets closest to the trigger offset and the page address of the current trigger access request are used for assembling a prefetch target address, and prefetching is initiated to the corresponding hierarchy until only one valid missing state processing register is left; if no miss state processing register is available, then a next access request with the same page address is awaited.

It should be noted that, in the embodiment of the present invention, the implementation of the cache data prefetching system based on the merge bit vector access mode is similar to the above cache data prefetching method based on the merge bit vector access mode, so that the description is omitted.

In summary, the present application provides a method and a system for prefetching cache data based on a merge bit vector access mode, which uses a counter vector to merge the bit vector modes, thereby reducing metadata storage overhead, extracting commonalities between bit vectors, effectively improving coverage rate, reducing cache miss rate, and improving overall performance. Compared with the prior art, the method and the device only use index and label matching to predict the pre-fetching target, so that the prediction accuracy is greatly improved. The invention discards the combination of multiple features to index, does not explode the state space index any more, and reduces the hardware cost; the access modes corresponding to different feature indexes are stored through the dual mode table, so that the performance is improved, and the overall cost is controlled. Therefore, the method effectively overcomes various defects in the prior art and has high industrial utilization value.

The foregoing embodiments are merely illustrative of the principles of the present application and their effectiveness, and are not intended to limit the application. Modifications and variations may be made to the above-described embodiments by those of ordinary skill in the art without departing from the spirit and scope of the present application. Accordingly, it is intended that all equivalent modifications and variations which may be accomplished by persons skilled in the art without departing from the spirit and technical spirit of the disclosure be covered by the claims of this application.

Claims

1. The cache data prefetching method based on the merging bit vector access mode is characterized by comprising the following steps of:

in the data cache, collecting target physical addresses and program counts of CPU data loading requests by an accumulation table and a filtering table, and respectively recording access history in a physical area with a preset size and generating the access history into a bit vector form; each bit vector is used for recording access history in a time interval from the first access in a certain physical area to the first data cache rejection; the generated bit vector addresses are respectively combined into counter vectors corresponding to the dual-mode table;

taking the access requests which are not recorded in the accumulation table and the filtering table as triggering access requests; when the trigger access request is identified, respectively inquiring corresponding counter vectors in a dual-mode table by using physical address offset and program count as indexes; if the counter vector is obtained, two prefetch target vectors are obtained through calculation according to comparison of access frequency of each physical address offset and a prefetch threshold value;

arbitrating the two prefetched target vectors from the dual-mode table to generate a final prefetched target vector;

2. The method for prefetching cache data based on the merge bit vector access mode according to claim 1, wherein the generated bit vector addresses are respectively merged into counter vectors corresponding to the dual mode table, and the merging mode comprises: and combining the access mode of the bit vector captured by the accumulation table and the filtering table with the counter vector with the same index, and specifically accumulating each bit in the bit vector into a counter corresponding to the counter vector.

3. The method for prefetching cache data based on merging bit vector access mode according to claim 2, wherein the merging process comprises:

circularly shifting the bit vector mode to the smaller offset direction according to the position where the trigger offset is positioned to obtain an anchor vector;

accumulating the obtained anchor vectors into the counter vectors one by one according to the offset; if the corresponding bit of the anchoring vector is 1, adding 1 to a counter corresponding to the counter vector; the counter corresponding to the trigger offset is a merging counter, and the merging counter is added with 1 when merging each time;

when the merge counter is saturated, all counter values including it are halved; and normally accumulating when the data is not saturated.

4. The method for prefetching cache data based on the merge bit vector access mode according to claim 1, wherein the process of comparing the access frequency of each physical address offset with the prefetch threshold to obtain two prefetch target vectors comprises:

calculating the proportion of the counter value corresponding to other offsets except the trigger offset to the combined counter value one by one to obtain the access frequency of the offset;

comparing the access frequency with an L1 level expected threshold; if the L1 level expected threshold is exceeded, predicting that the offset is prefetched into an L1 layer data cache; otherwise, the next step is carried out;

comparing the access frequency with an L2 level expected threshold; if the L2 level expected threshold is exceeded, predicting that the offset is prefetched into an L2 layer data cache; otherwise, the offset is not predicted to be prefetched.

5. The method of claim 1, wherein the arbitrating two prefetch target vectors from the dual mode table to generate a final prefetch target vector, the arbitrating process comprising:

if the same offset in the two prefetched target vectors is predicted to be prefetched into the L1 level cache, prefetching into the L1 level cache; otherwise, if the two prefetched target vectors have predictions, but the predictions prefetched to the L2 level cache exist, the predictions are prefetched to the L2 level cache; otherwise, if the prediction of the offset mode table is not prefetched, the corresponding offset is not prefetched; otherwise, if the prediction of the program count mode table is not prefetched, the prediction of the offset mode table is reduced, e.g., L-level is reduced to L2-level. The arbitration ends.

6. The method for prefetching cache data based on merging bit vector access mode according to claim 1, wherein the dual mode table is composed of an offset mode table and a program count mode table; the offset mode table refers to a mode table which is indexed by triggering offset and stores the combined counter vector; the program count pattern table refers to a pattern table indexed by program count and storing the combined counter vector.

7. A cache data prefetching system based on a merge bit vector access mode, comprising: a filtering table, an accumulation table, an offset mode table, a program counting mode table and a prefetch buffer; wherein the offset pattern table and the program count pattern table constitute a dual pattern table; wherein: