CN114579479A - Low-pollution cache prefetching system and method based on instruction flow mixed mode learning - Google Patents

Low-pollution cache prefetching system and method based on instruction flow mixed mode learning Download PDF

Info

Publication number
CN114579479A
CN114579479A CN202111356734.XA CN202111356734A CN114579479A CN 114579479 A CN114579479 A CN 114579479A CN 202111356734 A CN202111356734 A CN 202111356734A CN 114579479 A CN114579479 A CN 114579479A
Authority
CN
China
Prior art keywords
instruction
address
memory access
access
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111356734.XA
Other languages
Chinese (zh)
Inventor
王玉庆
杨秋松
李明树
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Advanced Research Institute of CAS
Original Assignee
Shanghai Advanced Research Institute of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Advanced Research Institute of CAS filed Critical Shanghai Advanced Research Institute of CAS
Priority to CN202111356734.XA priority Critical patent/CN114579479A/en
Publication of CN114579479A publication Critical patent/CN114579479A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0862Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with prefetch
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3842Speculative instruction execution
    • G06F9/3848Speculative instruction execution using hybrid branch prediction, e.g. selection between prediction techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline, look ahead using instruction pipelines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a low-pollution cache prefetching system based on instruction stream mixed-mode learning, which comprises a branch prediction module, a cache prefetch module and a cache prefetch module, wherein the branch prediction module is used for predicting an instruction stream and writing the instruction stream into an instruction fetch address queue; the memory access instruction recording module is used for recording the information of the submitted memory access instruction and writing the information into the memory access instruction buffer; simultaneously, inquiring to obtain a memory access instruction sequence; the memory access mode learning module records the memory access instruction sequence in the memory access history buffer of the memory access mode learning module for learning, predicts the memory access physical address of the memory access instruction and writes the memory access physical address into the memory access address queue; the prefetch request generating module is used for searching the fetch address queue and the access address queue and generating a prefetch request of the first-level cache; the prefetch request write-back module temporarily stores the prefetch request of the data, and waits for write-back or immediately writes back according to the instruction submitting condition. The invention also provides a corresponding method. The low-pollution cache prefetching system reduces uncertainty faced in a memory access sequence prediction process and improves the accuracy of prefetching address prediction.

Description

Low-pollution cache prefetching system and method based on instruction flow mixed mode learning
Technical Field
The invention relates to the technical field of computer architectures, in particular to a low-pollution cache prefetching system and method based on instruction stream mixed-mode learning.
Background
The cache is an important mechanism in modern processors, common data are copied from the memory to the cache, and subsequent data access can be directly read from the cache, so that the access times of the memory of the slow DRAM are reduced, and the performance of the processor is improved. The capacity of the cache is limited, and cache failure is inevitable in practical use. The cache prefetch technique predicts the data to be used by the program and reads the data into the cache ahead of time, reducing the number of cache misses.
Cache access behavior is complex, and memory access sequences received by a cache system in a processor may be disturbed by out-of-order execution by the processor. Meanwhile, the access instructions in different modes are mutually interwoven, so that the difficulty of predicting the access behavior in the cache system is further increased. The existing cache prefetching method cannot effectively solve the problem that the future memory access sequence is difficult to predict.
Disclosure of Invention
The invention aims to provide a low-pollution cache prefetching system and method based on instruction stream mixed-mode learning, so as to reduce uncertainty faced in a memory access sequence prediction process and improve the accuracy of prefetching address prediction.
In order to achieve the aim, the invention provides a low-pollution cache prefetching system based on instruction flow mixed-mode learning, which comprises a main pipeline and a branch prediction sub-pipeline, wherein the branch prediction sub-pipeline comprises a branch prediction module, an instruction fetch address queue, an access instruction recording module, an access mode learning module, an access address queue, a prefetching request generating module and a prefetching request writing-back module; the branch prediction module is set to adopt a prediction-in-advance technology to predict the instruction stream of the target program and write the prediction result into the instruction fetch address queue; the memory access instruction recording module is set to record the information of the submitted memory access instructions in sequence and write the information of the memory access instructions into the memory access instruction buffer; meanwhile, when a new entry item is written into the instruction fetching address queue, the access instruction buffer is inquired by using the initial address of the instruction block corresponding to the entry item, so that an access instruction sequence is obtained through inquiry attempt and is output to the access instruction learning module and the access address queue; the memory access mode learning module is set to record the memory access instruction sequence in the memory access history buffer, learn the memory access mode of the memory access instruction according to the history information stored in the memory access history buffer, predict the memory access physical address of each memory access instruction in the memory access instruction sequence according to the learned memory access mode and write the memory access physical address into the memory access address queue; the prefetch request generation module is arranged to retrieve the fetch address queue and the access address queue when a new request is received by the first-level data cache, respectively generate prefetch requests of the first-level instruction cache and the first-level data cache for instruction blocks which are about to enter a main pipeline and send the prefetch requests to the cache system so as to obtain the prefetch requests of the fetched data; the prefetch request write-back module is set to temporarily store the prefetch request of the taken data in the prefetch queue, so that the prefetch request of the taken data at the head of the prefetch queue waits for write-back or immediately writes back to the first-level cache according to the instruction submission condition before the prefetch request.
The branch prediction module is configured to write a prediction result into an instruction fetch address queue with an instruction block as granularity; and the branch prediction module is arranged to perform:
a1: in each period, taking the instruction block where the current prediction address is as the current instruction block, and taking the current prediction address as the prediction starting address of the current instruction block; determining a fixed prediction window starting from the current instruction block according to the prediction starting address of the current instruction block;
a2: after the prediction starting address of the current instruction block is obtained, retrieving and predicting branch instructions in the current instruction block to judge whether the current instruction block hits the branch instructions of the jump or not;
a3: according to the judgment result, if the branch instruction of the jump is not hit in the current instruction block, the current instruction block does not have the branch instruction or the identified branch instruction does not jump, the information of the current instruction block is used as the prediction result of the branch prediction module to be written into the instruction fetching address queue; then, determining a prediction starting address of a next instruction block, taking the next instruction block as a new current instruction block, and returning to the step A2 until the current instruction block is the last instruction block in a fixed prediction window, wherein the current prediction address is automatically increased according to the number of the instruction blocks in the fixed prediction window to enter a next cycle;
otherwise, selecting a first jump branch instruction with the minimum linear address from the hit jump branch instructions as the end address of the current instruction block, writing the information of the current instruction block as a prediction result into an instruction fetch address queue, and updating the current prediction address to the jump address of the first jump branch instruction to enter the next cycle;
or the branch prediction module is set to write the prediction result into the instruction fetch address queue by taking the instruction block as granularity; and the branch prediction module is arranged to perform:
step a 1': in each period, taking the instruction block where the current prediction address is as the current instruction block, and taking the current prediction address as the prediction starting address of the current instruction block; then, determining the prediction start addresses of all instruction blocks in a fixed prediction window starting from the current instruction block according to the prediction start address of the current instruction block;
step a 2': according to the predicted initial address of each instruction block, retrieving and predicting branch instructions in each instruction block to judge whether each instruction block hits the branch instructions of the jump;
step a 3': according to the judgment result, if all the instruction blocks miss the skipped branch instruction, the information of all the instruction blocks is used as the prediction result of the branch prediction module to be sequentially written into the instruction fetching address queue; the current prediction address is increased automatically according to the number of the instruction blocks in the fixed prediction window so as to enter the next period;
otherwise, if at least one instruction block hits at least one jump branch instruction, selecting a first jump branch instruction with the minimum linear address from the hit jump branch instructions as the tail address of the current instruction block, writing the information of the current instruction block and all instruction blocks before the current instruction block as the prediction result into the instruction fetch address queue, and updating the current prediction address to the jump address of the first jump branch instruction to enter the next cycle.
The main pipeline comprises an instruction fetching unit, a decoding unit, an execution unit, an access unit and a write-back unit which are sequentially connected, and in the branch prediction module, the bit width of the instruction block is equal to the bit width of the instruction fetching unit during instruction fetching;
the information of the instruction block comprises a line linear address, a line physical address, a starting offset, an ending offset and a jump bit of the instruction block; when the information of the instruction block is written into the instruction fetch address queue, splitting the predicted initial address of the instruction block into a high-order line linear address and a low-order initial offset, inquiring the TLB through the line linear address to obtain a line physical address, and writing the obtained line linear address, the line physical address and the initial offset into corresponding domains in entry items of the instruction fetch address queue; in addition, if the branch instruction of the jump hit by the instruction block, the last byte offset of the branch instruction of the first jump is written into the instruction fetch address queue as an end offset; otherwise, the ending offset in the fetch address queue is set to 63.
The structure of each entry item of the fetch address queue is as follows:
<valid,line_addr,phys_addr,begin_offset,end_offset,taken>,
wherein valid represents a valid bit; line _ addr represents a line linear address; phys _ addr represents the line physical address; begin _ offset represents the starting offset; end _ offset represents an end offset; taken represents jump index;
the instruction fetching address queue is also provided with a submission pointer, a read pointer and a write pointer; a commit pointer of the instruction fetching address queue points to an entry item of an instruction block where a next to-be-committed instruction is located in the instruction fetching address queue; a read pointer of the instruction fetching address queue points to an entry item of an instruction block where a next instruction to be read is located in the instruction fetching address queue; a write pointer of the fetch address queue points to the position of the next write of the branch prediction module; after the main pipeline of the processor refreshes, according to the refresh type, rolling back a read pointer and a write pointer of the instruction fetching address queue to the position of branch refresh or the position of a submit pointer;
the structure of each table entry of the memory access instruction buffer is as follows:
<LineAddr,PhyAddr,InstType,Memlen>,
wherein, the LineAddr represents an instruction linear address of an access instruction; PhyAddr represents the access physical address of the access instruction when the access instruction is executed last time; the InstType indicates the type of the access and storage instruction, the InstType belongs to { DirectInst, InDirectInst }, wherein DirectInst represents a direct access and storage instruction, and InDirectInst represents an indirect access and storage instruction; memlen indicates the memory access length of the memory access instruction;
the memory access history buffer is an array which takes a PC (personal computer) of a memory access instruction as an index, and each table entry in the memory access history buffer records the memory access physical address of the same memory access instruction for 12 times in the past;
and the structure of each entry item in the access address queue is as follows:
<valid,inst_line_addr,mem_phys_addr,memlen,inst_queue_index>,
wherein valid represents a valid bit; inst _ line _ addr represents the instruction linear address; mem _ phys _ addr represents the physical address of the memory access; memlen represents the memory access length; inst _ queue _ index represents the index of the fetch address queue;
the access address queue is provided with a submission pointer, a read pointer and a write pointer; when the main pipeline is in branch refreshing, the read pointer and the write pointer of the access address queue roll back to the position of the branch refreshing.
The branch prediction module is configured to: if the fetch address queue is full, pausing the prediction process; whether the fetch address queue is full is judged according to the combination of the write pointer and the commit pointer of the fetch address queue.
In the memory access instruction recording module, the information of the memory access instruction comprises a PC (personal computer) of the memory access instruction, an instruction linear address, a memory access address, a type and a memory access length of the memory access instruction;
the memory access instruction recording module is set to use the instruction linear addresses of the memory access instructions in all the table entries of the memory access instruction buffer to perform hit judgment in the inquiring process, and the positions and types of the memory access instructions in the memory access instruction buffer are obtained when the memory access instructions are hit, so that the positions and types of the memory access instructions in the memory access instruction buffer are tried to be obtained by inquiring all the memory access instructions in the memory access instruction buffer; when the hit judgment is carried out, when the high-order tag of the instruction linear address of one access instruction is the same as the line linear address of the instruction block corresponding to the new entry item in the access instruction buffer, and the low-order offset of the high-order tag is more than or equal to the initial offset of the instruction block and less than or equal to the ending offset of the instruction block, the access instruction is hit, and the position and the type of the access instruction in the access instruction buffer are obtained.
The memory access mode learning module comprises a step size predictor and a time correlation predictor, so that the memory access mode learning module realizes the learning and prediction of a step size mode and a time correlation mode through the step size predictor and the time correlation predictor on the basis of memory access history buffering;
each entry item of the step predictor is used for recording a step mode, and the step mode refers to a step access sequence; each entry item of the step size predictor comprises a label of a step size memory access sequence, a last address, a last step size, a credible counter, a current mode, a first address, a maximum counter and a direction;
each entry item of the time correlation predictor is used for recording a time correlation mode, the time correlation mode refers to a time correlation memory access sequence, and each entry item of the time correlation predictor comprises a time correlation mode, a time correlation sequence length and a time correlation sequence.
The prefetch request generation module determines an instruction block to enter a main pipeline according to a read pointer of a fetch address queue;
the prefetch request generation module is configured to perform the steps of:
b1: after receiving a message of a new request received by a first-level data cache, retrieving an instruction-fetching address queue, obtaining the position of a read pointer of the instruction-fetching address queue, taking the position of the read pointer after adding a pre-fetching lead as a pre-fetching start, taking N continuous pre-fetching instruction blocks taking the pre-fetching start as a first pre-fetching instruction block as an instruction block to enter a main pipeline, and generating a pre-fetching request of a corresponding first-level instruction cache for the instruction block to enter the main pipeline;
b2: receiving pointers of all the prefetch instruction blocks fed back by the fetch address queue, using the pointers of the prefetch instruction blocks to retrieve the fetch address queue indexes in the access address queue, acquiring the access instructions and the access physical addresses thereof contained in all the prefetch instruction blocks, and generating prefetch requests of a first-level data cache according to the access physical addresses of the access instructions;
b3: and sending the prefetch request of the first-level instruction cache and the prefetch request of the first-level data cache to a cache system to obtain the prefetch request of the taken data.
The prefetch request write back module is configured to: for the prefetch request positioned at the head of the prefetch queue, an uncommitted instruction set of the retrieval access address queue behind a commit pointer and before the prefetch request is judged, whether other access instructions in the same group with the prefetch request exist in an array of a first-level cache in the uncommitted instruction set or not is judged, and the judgment result is taken as the instruction commit condition before the prefetch request; if judging that other memory access instructions in the same group with the prefetch request exist and the number of the memory access instructions in the same group with the prefetch request is at least the number of ways of the cache array, waiting for the write-back of the prefetch request, and if not, immediately writing the prefetch request back to a first-level cache.
On the other hand, the invention provides a low-pollution cache prefetching method based on instruction flow mixed mode learning, which comprises the following steps:
s1: predicting the instruction stream of the target program by using a branch prediction module and adopting a prediction-in-advance technology and writing a prediction result into an instruction fetching address queue; when the fetch address queue is full, the prediction process is suspended;
s2: recording the information of the submitted access instructions by using an access instruction recording module, and writing the information of the access instructions into an access instruction buffer; when a new entry item is written into the fetch address queue, the access instruction buffer is inquired by using the initial address of the instruction block corresponding to the entry item, so that an access instruction sequence is obtained through inquiry attempt and is output to the access instruction learning module and the access address queue;
s3: recording a memory access instruction sequence in a memory access history buffer by using a memory access instruction learning module, learning a memory access mode of the memory access instruction according to history information stored in the memory access history buffer, predicting a memory access physical address of each memory access instruction in the memory access instruction sequence according to the learned memory access mode, and writing the memory access physical address of each memory access instruction into a memory access address queue;
s4: utilizing a prefetch request generation module to retrieve the instruction fetch address queue and the access address queue when a first-level data cache receives a new request, respectively generating a prefetch request of the first-level instruction cache and a first-level data cache for an instruction block which is about to enter a main pipeline and sending the prefetch request to a cache system so as to obtain the prefetch request of the fetched data;
s5: and temporarily storing the prefetch request of the fetched data in the prefetch queue by using a prefetch request write-back module, so that the prefetch request of the fetched data at the head of the prefetch queue waits for write-back or immediately writes back to the first-level cache according to the instruction submission condition before the prefetch request.
According to the low-pollution cache prefetching system based on instruction flow mixed mode learning, a prediction object is specific to each memory access instruction, and historical information is recorded for each memory access instruction independently, so that the complexity of prediction is reduced by learning instruction-level memory access behaviors; and pattern learning is carried out according to a sequence formed by submitted memory access instructions fed back by a write-back unit of the main pipeline of the processor, and a prediction object is a single independent memory access instruction with simple behavior, so that the interference of out-of-order execution can be avoided, and the accuracy of memory access sequence prediction is improved. In conclusion, the low-pollution cache prefetching system based on instruction flow mixed mode learning reduces uncertainty faced in a memory access sequence prediction process through the mode, obviously reduces the frequency of cache invalidation and improves the accuracy of prefetch address prediction.
Drawings
FIG. 1 is a block diagram of a micro-architecture of a processor suitable for the instruction-flow-mixed-mode-learning-based cache prefetch system and method of the present invention.
FIG. 2 is a block diagram of another processor micro-architecture to which the instruction-stream-mixed-mode-learning-based cache prefetch system and method of the present invention are applicable.
FIG. 3 is an overall framework diagram of a low-pollution cache prefetch system based on instruction stream mixed mode learning, according to one embodiment of the invention.
FIG. 4 is a prediction flow diagram of a branch prediction module of the low pollution cache prefetch system based on instruction stream mixed mode learning, as shown in FIG. 3.
FIG. 5 is a schematic diagram of an instruction fetch address queue and the position relationship of a commit pointer, a read pointer and a write pointer of the low-pollution cache prefetch system based on instruction flow mixed mode learning according to the present invention.
FIG. 6 is a schematic diagram of the query process of the memory access instruction buffer of the low-pollution cache prefetch system based on instruction flow mixed mode learning according to the present invention.
Fig. 7(a) is a schematic diagram of a conventional time-dependent prefetching algorithm.
FIG. 7(b) is a schematic diagram of the memory access pattern learning process based on memory access history buffering in the low-pollution cache pre-fetching system based on instruction flow mixed mode learning according to the present invention.
FIG. 8 is a diagram illustrating the structure of an entry of a stride predictor of the instruction-stream mixed-mode learning-based low-pollution cache prefetch system according to the present invention.
FIG. 9 is a functional diagram of a prefetch request generation module of the low-pollution cache prefetch system based on instruction flow mixed mode learning according to the present invention.
FIG. 10 is a schematic diagram of a write-back waiting process of the prefetch request of the first level data cache of the low-pollution cache prefetch system based on instruction flow mixed mode learning according to the present invention.
Detailed Description
The present invention will be described in further detail below with reference to specific examples.
The low-pollution cache prefetching system and method based on instruction flow mixed mode learning of the invention are suitable for the processor micro-architecture shown in FIG. 1. The processor micro-architecture at least comprises 5 stages of instruction fetching, decoding, executing, accessing and writing back, wherein the five stages respectively correspond to an instruction fetching unit 100, a decoding unit 200, an executing unit 300, an accessing unit 400 and a writing back unit 500, and the instruction fetching unit 100, the decoding unit 200, the executing unit 300, the accessing unit 400 and the writing back unit 500 form a main pipeline. The first-level cache is divided into a first-level instruction cache and a first-level data cache, wherein the first-level instruction cache is located in the instruction fetching unit 100 of the main pipeline of the CPU, and the first-level data cache is located in the access unit 400 of the main pipeline of the CPU.
The present invention is also applicable to more complex processor micro-architectures that include the functionality shown in fig. 1, where a stage may be refined, such as by splitting an execution stage into three sub-stages, renaming, scheduling, and executing.
As shown in FIG. 2, according to the prior art prediction-ahead technique, a branch prediction unit 600 may be further included in the processor micro-architecture, and the branch prediction unit 600 may interact with the fetch unit through a prediction queue. Specifically, branch prediction unit 600 predicts future streams of program instructions and writes the prediction results into prediction queue 700 at the granularity of aligned instruction blocks. Each entry in prediction queue 700 includes a start address, an end address, and other branch instruction information for the instruction block. A similar producer-consumer relationship between branch prediction unit 600 and fetch unit 100. The fetch unit 100 reads an entry from the prediction queue 700 every cycle, reads corresponding instruction block data from the cache system according to the read entry, and then sends the instruction block data to subsequent units of the main pipeline (i.e., the decode unit 200, the execution unit 300, the access unit 400, and the write-back unit 500) to execute the instructions contained in the instruction block. When the low-pollution cache prefetching method based on instruction flow mixed mode learning is applied to a processor micro-architecture comprising the branch prediction unit 600, the branch prediction unit 600 and the prediction queue 700 can be respectively used as a branch prediction module and a fetch address queue of the system, so that the specific implementation workload is reduced.
FIG. 3 is an overall framework diagram of a low-pollution cache prefetch system based on instruction stream mixed mode learning, according to one embodiment of the invention. As shown in fig. 3, the instruction flow mixed mode learning-based low-pollution cache prefetch system further includes seven modules, namely a branch prediction module 10, an instruction fetch address queue 20, an access instruction recording module 30, an access pattern learning module 40, an access address queue 50, a prefetch request generation module 60 and a prefetch request write-back module 70, on the basis of the existing processor microarchitecture (including an instruction fetch unit 100, a decoding unit 200, an execution unit 300, an access unit 400 and a write-back unit 500). The seven modules jointly form a prediction sub-pipeline with the execution rate higher than that of a main pipeline, the prediction sub-pipeline independently runs outside the main pipeline of the processor (namely the sequential execution of the finger fetching unit 100, the decoding unit 200, the execution unit 300, the access unit 400 and the write-back unit 500), and the execution sequence of the modules in the prediction sub-pipeline is the branch prediction module 10 → the finger fetching address queue 20 → the access instruction recording module 30 → the access pattern learning module 40 → the access address queue 50. The prefetch request generation module 60 and the prefetch request write-back module 70 are relatively independent modules, and the prefetch request generation module 60 is configured to select an appropriate timing to retrieve the fetch address queue and the access address queue, generate a prefetch request for the first-level instruction cache and the first-level data cache, and send the prefetch request to a cache system (e.g., second-level cache and third-level cache). The prefetch request write back module 70 buffers prefetch requests that have already been fetched for data. These prefetch requests have data to be written to the first level cache, but to prevent cache entries with a smaller reuse distance than prefetch requests from being kicked from the first level cache during a write, the prefetch requests need to wait in the queue of the prefetch request write back module 70 for the time at which they can be written back. The relationship between prefetch request generation module 60 and the other 5 constituent modules of the prediction sub-pipeline is similar to producer-consumer, with prefetch request generation module 60 retrieving both fetch address queue 20 and access address queue 50 by filling the prediction sub-pipeline.
Branch prediction module 10
The branch prediction module 10 is the starting point of the predictor pipeline. The branch prediction module 10 is arranged to: predicting the instruction stream of the target program by adopting a prediction-in-advance technology and writing a prediction result into the instruction fetch address queue 20 by taking an instruction block as granularity (namely, in the form of an aligned instruction block); the fetch address queue 20 is full and the prediction process is halted.
Thus, what is held in the instruction fetch address queue is a block of instructions (containing additional branch instruction information). Since the instruction fetch unit accesses the first-level instruction cache also in the form of an instruction block with a fixed bit width (64B), the obtained entry in the instruction fetch address queue 20 is used as a prediction result of a future access sequence of the first-level instruction cache in the instruction fetch unit 100, that is, an access sequence of an instruction stream of the target program.
The branch prediction module 10 predicts the instruction stream of the target program according to the judgment result by judging whether branch instructions of jumps exist in all instruction blocks of the fixed prediction window in each cycle. The prediction process in each period is carried out on the basis of the current prediction address, and the current prediction address is updated to enter the next period. A very important function in the branch prediction module 10 is to maintain the current predicted address, which is a 48-bit linear address.
Maintaining the current predicted address includes: if the branch instruction of the jump is predicted, the jump address of the branch instruction of the jump is updated to the current prediction address; that is, the branch prediction module 10 is a self-loop branch prediction unit, and the prediction result of the branch prediction module 10 is written into the fetch address queue 20, and the jump address of the branch instruction of the jump in the prediction result is returned to the branch prediction module 10 itself as the current prediction address as the start address of the next prediction cycle. In addition, during the initialization phase of the processor or when a flush occurs in the main pipeline, the fetch unit sends the initialization address or the flush address to the branch prediction module 10 to update the current prediction address.
Fig. 4 shows the prediction flow of the branch prediction module 10. The fixed prediction window of the branch prediction module is the width of a plurality of instruction blocks, that is, a plurality of consecutive instruction blocks can be predicted in each cycle, and the bit width of the instruction block is equal to the bit width of the instruction fetch unit 100 during instruction fetch, so that the interaction with the instruction fetch unit 100 is facilitated.
The fixed prediction window of the branch prediction module is a plurality of 64B instruction blocks, i.e., a plurality of consecutive 64B instruction blocks may be predicted per cycle, and the 64B instruction blocks selected are to match the bandwidth of the fetch unit 100 when accessing the level one instruction cache.
As shown in fig. 4, the branch prediction module 10 is configured to perform the following steps:
step A1: in each cycle (namely when the current prediction address is initialized or updated), taking an instruction block where the current prediction address in the first-level instruction cache is located as a current instruction block, and taking the current prediction address as a prediction starting address of the current instruction block; determining a fixed prediction window starting from the current instruction block according to the prediction starting address of the current instruction block;
since the prediction of the instruction stream of the target program is granular in terms of aligned instruction blocks, the fixed prediction window per cycle is the width of a plurality of instruction blocks, i.e. a plurality of instruction blocks can be predicted per cycle, for example, a fixed prediction window of 32B can be split into 2 instruction blocks of 16B for parallel prediction. In the present embodiment, assuming the current prediction address is 5, the fixed prediction window per cycle of the branch prediction module 10 is 128B, i.e., 2 aligned 64B instruction blocks are predicted per cycle. The predicted start address of the first 64B instruction block is the current predicted address (i.e., 5), and the instruction block range is [5,63 ]; the predicted start address of the second 64B instruction block is 64 and the instruction block range is [64,127 ]. The default starting address for each instruction block is an integer multiple of 64B, as is the aligned instruction block. However, considering that the starting address of the first instruction block in the prediction process may not be an integer multiple of 64B, since this address may jump from a branch instruction, there is no way to ensure that this address is strictly aligned; the start address of the second instruction block (and third, fourth, etc.) in the prediction process must therefore be an integer multiple of 64B.
Step A2: after the prediction starting address of the current instruction block is obtained, retrieving and predicting branch instructions in the current instruction block to judge whether the current instruction block hits the branch instructions of the jump or not;
the branch prediction module 10 needs to use an existing branch prediction algorithm to implement the prediction of the branch instruction. The invention does not depend on a specific branch prediction algorithm, and the existing branch prediction algorithms are all suitable for branch prediction modules. Taking a common tag branch prediction algorithm as an example, first, a Branch Target Buffer (BTB) is retrieved by using a predicted start address of a current instruction block, and all branch instructions included in the current instruction block are identified according to branch instruction positions stored in the BTB. This identification process may hit multiple branch instructions. The jump direction and jump address of each identified branch instruction are then predicted to determine whether a jump branch instruction exists among the branch instructions.
Step A3: according to the judgment result, if the branch instruction of the jump is not hit in the current instruction block, which indicates that no branch instruction exists in the current instruction block or the identified branch instruction is not jumped, the information of the current instruction block is written into the instruction fetch address queue as the prediction result of the branch prediction module 10; then, determining a prediction starting address of a next instruction block, taking the next instruction block as a new current instruction block, and returning to the step A2 until the current instruction block is the last instruction block in a fixed prediction window, wherein the current prediction address is automatically increased according to the number of the instruction blocks in the fixed prediction window, so that the current prediction address is updated, and entering a next period;
otherwise, namely the branch instruction of the hit jump of the current instruction block indicates that the branch instruction identified in the instruction block has one or more predicted jumps, the branch instruction of the first jump with the minimum linear address is selected from the branch instructions of the hit jump as the end address of the current instruction block, the information of the current instruction block is written into the instruction fetch address queue as the prediction result, and the current prediction address is updated to the jump address of the branch instruction of the first jump so as to enter the next cycle.
That is, as shown in fig. 4, in the present embodiment, the instruction blocks predicted by the branch prediction module per cycle are predicted one by one in serial order, i.e. the predicted start address of the next instruction block is calculated from the predicted start address of the current instruction block each time.
The information of the instruction block includes a line linear address, a line physical address, a start offset, an end offset, and a jump bit of the instruction block, and so on.
The addresses used by branch prediction module 10 in the prediction process for each cycle (e.g., the predicted start address of the current instruction block, etc.) are in the form of linear addresses. However, the addresses stored in the fetch address queue 20 are in the form of physical addresses, since the addresses used in the cache system are all physical addresses, and the cache replacement decision module 60 will use the physical addresses of the candidates to retrieve the fetch address queue and the access address queue. Therefore, the address held in the fetch address queue 20 needs to be a physical address corresponding to a linear address.
Therefore, the linear address and the start offset of the current instruction block can be determined according to the predicted start address of the current instruction block, and when the linear address in the information of the current instruction block is written into the instruction fetch address queue, the linear address and the start offset of the current instruction block need to be retrieved through a TLB (translation lookaside buffer) to obtain a physical address of a line, and both the linear address of the line and the physical address of the line are stored in the instruction fetch address queue 20.
Specifically, when writing information of an instruction block into the fetch address queue 20, it is necessary to split the predicted start address of the instruction block of 48 bits into a line linear address of upper 42 bits and a start offset of lower 6 bits, obtain a line physical address of 34 bits by querying the TLB using the line linear address of 42 bits, and write the obtained line linear address, line physical address, and start offset into corresponding fields in entry entries of the fetch address queue 20. Furthermore, if the branch instruction of the jump hit by the instruction block, the last byte offset of the branch instruction of the first jump is written as an end offset into the instruction fetch address queue 20; otherwise, the ending offset in the fetch address queue 20 is set to 63. Furthermore, if the current instruction block hits in a branch instruction of the jump, jump position 1 in the entry in the address queue 20 will be fetched.
Because the prediction of each instruction block is relatively independent, in other embodiments, the prediction start addresses of multiple instruction blocks may be predicted simultaneously in parallel.
Accordingly, the branch prediction module 10 is arranged to perform the steps of:
step a 1': in each cycle (namely when the current prediction address is initialized or updated), taking the instruction block where the current prediction address is as the current instruction block, and taking the current prediction address as the prediction starting address of the current instruction block; then, determining the prediction start addresses of all instruction blocks in a fixed prediction window starting from the current instruction block according to the prediction start address of the current instruction block;
as described above, if the range of the first instruction block is [5,63], the range of the second instruction block is [64,127], and the range of the third instruction block is [128,191 ]. The start addresses of all but the first instruction block are 64B aligned.
Step a 2': according to the prediction starting address of each instruction block, retrieving and predicting branch instructions in each instruction block to judge whether each instruction block hits the branch instructions of the jump;
step a 3': according to the judgment result, if all the instruction blocks miss the branch instruction of the jump, the information of all the instruction blocks is used as the prediction result of the branch prediction module 10 to be written into the instruction fetching address queue in sequence; the current prediction address is increased automatically according to the number of the instruction blocks in the fixed prediction window;
otherwise, if at least one instruction block hits at least one jump branch instruction, selecting a first jump branch instruction with the minimum linear address from the hit jump branch instructions as the tail address of the current instruction block, writing the information of the current instruction block and all instruction blocks before the current instruction block as the prediction result into the instruction fetch address queue, and updating the current prediction address to the jump address of the first jump branch instruction.
Instruction fetch address queue 20
Referring again to FIG. 3, the instruction fetch address queue 20 is configured to store the prediction results written by the branch prediction module 10 in units of instruction blocks, and in the present design, the instruction fetch address queue has 4096 entries. In the present embodiment, each entry in the fetch address queue represents the prediction result of a 64B block of instructions. The width of 64B is to match the 64B bandwidth of instruction fetch unit 100 when accessing the level one instruction cache.
The structure of each entry of the fetch address queue can be expressed as:
<valid,line_addr,phys_addr,begin_offset,end_offset,taken>,
wherein valid represents a valid bit; line _ addr represents a line linear address; phys _ addr represents the line physical address; begin _ offset represents the starting offset; end _ offset represents an end offset; taken denotes the jump bit.
The line linear address, the start offset, and the end offset determine the range of valid bytes in the 64B instruction block, which is used to retrieve the memory access instruction recording module 30 to query the memory access instructions present in the instruction block. Each entry in the fetch address queue represents an instruction block, each instruction block including a start offset and an end offset, the sequential concatenation of the instruction blocks forming the instruction stream of the target program. The line physical address is used for address matching calculation when the cache replacement decision module 60 queries the reuse distance of the replacement candidate. The jump bit is used to indicate whether the data block corresponding to the current entry item has a branch instruction to jump.
As shown in fig. 5, the fetch address queue 20 is further provided with a commit pointer, a read pointer and a write pointer, which point to one of the entries in the fetch address queue, each of which can be regarded as a number in the range of [0,4095 ].
The commit pointer of the instruction fetch address queue 20 points to the entry in the instruction fetch address queue 20 of the instruction block in which the next instruction to be committed is located. When all instructions in a certain instruction block are submitted in the main pipeline, the entry corresponding to the instruction block can be removed from the instruction fetch address queue, and the submission pointer is increased by 1. In particular, the write back unit 500 in the main pipeline is arranged to feed back to the instruction fetch address queue 20 at each cycle the number of instructions in the instruction fetch address queue 20 that are being committed, and when this number is greater than the commit pointer, it can be confirmed that all instructions in the instruction block pointed to by the commit pointer have been committed in the main pipeline, at which point the commit pointer is incremented by 1. Entry items before the commit pointer do not need to be actively flushed, and the prediction results of the branch prediction module 10 can only be written to entry items before the commit pointer, thereby automatically overwriting entry items before the commit pointer.
The read pointer of the fetch address queue 20 points to an entry in the fetch address queue of the instruction block where the next instruction to be read is located. When all instructions in the instruction block currently read by the instruction fetch unit 100 are read completely, the read pointer is incremented by 1 to ensure that the read pointer of the instruction fetch address queue 20 always points to the entry in the instruction fetch address queue of the instruction block in which the next instruction to be read is located.
That is, the instruction block between the fetch address queue commit pointer and the read pointer is already in the main pipeline of the processor (i.e., the flow of execution from the fetch unit 100 to the write back unit 500).
The write pointer of the fetch address queue 20 points to the location of the next write by the branch prediction module 10 for determining the location of the write when the branch prediction module 10 writes a new entry. The write pointer is set to increment by 1 each time the branch prediction module 10 writes a new entry. Thus, the block of instructions between the read pointer and the write pointer is a valid predicted block of instructions, and the write pointer and the commit pointer in combination determine an empty full condition of the fetch address queue 20 (i.e., whether the fetch address queue 20 is full).
After the main pipeline of the processor is refreshed, the read and write pointers can roll back to the position of branch refreshing or the position of the commit pointer according to the refreshing type. Specifically, when a branch flush occurs (i.e., the flush type is a branch flush), the execution unit 300 in the main pipeline feeds back the number of the flushed branch instruction in the fetch address queue 20 to the fetch address queue 20 as a branch instruction flush pointer, and the read and write pointers are rolled back to the location of the branch flush. When a write-back refresh occurs (i.e., the refresh type is write-back refresh), the read and write pointers are rolled back to the location of the commit pointer. It should be noted that the refresh process does not actively empty the data in the queue, but only changes the assignment of the read and write pointers. The branch prediction is not 100% accurate, so the prediction result in the instruction fetch address queue may be wrong, and the read and write pointers need to be rolled back by means of a refresh process, so that the wrong data is cleared to rewrite the correct data.
(III) memory access instruction recording module 30
Referring to fig. 3 again, the memory access instruction recording module 30 is configured to receive the memory access instruction information fed back by the write-back unit 500, sequentially record the information of the memory access instructions already submitted therein, and write the information of the memory access instructions into a memory access instruction buffer 31. Therefore, whether the instruction block contains the access instruction or not can be judged.
The access instruction recording module 30 has an access instruction buffer 31, the storage structure of which is a table structure, and the table structure has 4096 table entries.
The information of the memory access instruction (namely the information of the memory access instruction which is already submitted) comprises information of a PC (personal computer) of the memory access instruction, a linear address, a memory access address, a type and a memory access length of the memory access instruction and the like.
The structure of each entry of the memory access instruction buffer 31 is represented as:
<LineAddr,PhyAddr,InstType,Memlen>,
wherein, the LineAddr represents an instruction linear address of an access instruction; PhyAddr represents the access physical address of the access instruction in the last execution; the InstType indicates the type of the access and storage instruction, the InstType belongs to { DirectInst, InDirectInst }, wherein DirectInst represents a direct access and storage instruction, and InDirectInst represents an indirect access and storage instruction; memlen indicates the memory access length of the memory access instruction.
The memory access instruction recording module 30 is further configured to: when a new entry is written into the instruction fetch address queue 20, the access instruction buffer 31 is queried by using the starting address (i.e. the line linear address + the starting offset) of the instruction block corresponding to the entry, so as to obtain the instruction linear address (i.e. the position) LineAddr and the type InstType of the access instruction through query attempts, and the LineAddr and the type InstType are sequentially used as the instruction fetch address queue index of the access instruction in the access instruction sequence and output to the access instruction learning module 40 and the access address queue 50. That is, if the instruction block includes a plurality of memory access instructions, the memory access instructions are sequentially arranged according to the instruction linear addresses (i.e., positions) of the memory access instructions to obtain a memory access instruction sequence. The instruction linear address LineAddr (i.e. the position) of the access instruction corresponding to the newly added entry item of the access address queue 20 is used as an access address queue index, and then sent to the access instruction recording module 40 for querying, and the access address queue index is sent to the access address queue 50 in the subsequent flow to record the position of the access instruction. The type of access instruction is used for the access instruction learning module 40 to select between multiple modes.
Thus, by the access instruction recording module 30, the access instruction sequence of the program can be obtained from the instruction block sequence of the program predicted by the branch prediction module 10.
The access instruction recording module 30 is configured to perform hit determination by using instruction linear addresses of access instructions in all entries of the access instruction buffer 31 during the query process, and obtain the location and type of the access instruction in the access instruction buffer 31 when the access instruction is hit, so as to try to obtain the instruction linear address (i.e., location) and type of the access instruction by querying all the access instructions in the access instruction buffer 31.
The process of querying the memory access instruction buffer 31 is shown in fig. 6. As shown in FIG. 6, the hit determination of the accessed instruction buffer 31 is a determination of an inclusion relationship, and each entry in the instruction fetch address queue 20 corresponds to an instruction block, which is an interval range; therefore, when performing hit determination, it is necessary to determine whether the instruction linear address of each access instruction is within the interval range where the new entry item is written in the instruction fetch address queue 20, if the access instruction is within the interval range, the access instruction queries for hit, otherwise, the query for hit fails, that is, no access instruction is hit, which indicates that no access instruction exists in the instruction block corresponding to the new entry item, and at this time, nothing is performed.
In the present embodiment, the entry newly added by the instruction fetch address queue 20 includes a 48-bit start address and a 48-bit end address of the instruction block, which can be converted into a linear address with 42 upper bits, an upper tag (i.e. a line linear address) and a start offset and an end offset with 6 lower bits. Therefore, in the query process, all the access instructions in the access instruction buffer 31 are queried by using the tags of the instruction blocks and the offsets, when the access instruction buffer 31 has the high-order tag of the instruction linear address of one access instruction, which is the same as the line linear address of the instruction block corresponding to the new entry item, and the low-order offset of the access instruction is greater than or equal to the starting offset of the instruction block and less than or equal to the ending offset of the instruction block, it is indicated that the access instruction is located in the instruction block, that is, the access instruction queries and hits, so as to obtain the instruction linear address (i.e., the position) and the type of the access instruction.
In addition, the present invention adds a step size predictor and a time correlation predictor on the basis of the memory access instruction buffer 31, and these predictors are included in a memory access mode learning module 40 which will be described in detail below, so that the memory access instruction which accords with a specific mode can be predicted.
(IV) memory access mode learning module 40
Referring to fig. 3 again, the memory access pattern learning module 40 is configured to record the memory access instruction sequence sent by the memory access instruction recording module 30 in the memory access history buffer, learn the memory access pattern of the memory access instruction according to the history information stored in the memory access history buffer, predict the memory access physical address of each memory access instruction in the memory access instruction sequence according to the learned memory access pattern, and write the memory access physical address of each memory access instruction as a prediction result into the memory access address queue 50.
The memory access mode learning module 40 includes a memory access history buffer, which is configured to receive memory access instruction information (including PC, type and memory access physical address) fed back by the processor write-back stage, and the memory access physical address of the memory access instruction is written into a table entry corresponding to the memory access instruction in the memory access history buffer according to the PC of the memory access instruction, where each table entry is a piece of history information.
It should be noted that, there is a certain similarity between the memory access instruction buffer 31 of the memory access instruction recording module 30 and the memory access history buffer of the memory access pattern learning module 40, and the difference between the two buffers is: the memory access instruction buffer 31 in the memory access instruction recording module 30 stores instruction related information, such as an instruction linear address of a memory access instruction, a memory access instruction type and the like; the memory access physical address is also stored in the memory access instruction buffer, but only the result of the last memory access physical address is recorded. The memory access history buffer in the memory access mode learning module 40 is mainly used for recording past memory access history, and memory access physical addresses of past 12 times are recorded, so that the memory access mode learning can be performed on the basis. Since there is certain similarity between the two memory arrays, the memory instruction buffer 31 and the memory history buffer, the two arrays are designed to be equally large. Although access instruction buffer 31 and access history buffer are logically two arrays, in particular implementations access instruction buffer 31 and access history buffer may be stored in the same hardware location in unison to be merged together.
The access history buffer is an array indexed by the PC of the access instruction, and has 4096 table entries. Each table entry in the memory access history buffer is a 12-entry circular queue, and the memory access physical address of the same memory access instruction in the past 12 times is recorded. The access instructions with the same PC are identified as the same access instruction.
The memory access mode learning process based on memory access history buffering is obviously different from the existing algorithm. As shown in fig. 7(a), in the conventional time-dependent prefetching algorithm, access mode acquisition is performed based on a global invalidation buffer instead of an access history buffer, cache invalidation in the time-dependent prefetching algorithm is a trigger condition for generating a prefetch request, the global invalidation buffer is retrieved by using an address a of cache invalidation, and an address sequence { a, B, C, D } with the address a as a head address is acquired. The three addresses { B, C, D } are the prefetch candidates for address A, and A, B, C, D can all be used to send prefetch requests. In order to ensure that the prefetch request can cover more cache invalidation scenes, the longest match principle is generally adopted when searching the address sequence with the address A as the first address so as to issue more prefetch requests. And to prevent excessive prefetch requests from wasting cache system bandwidth, the number of prefetch requests issued is typically constrained by the prefetch depth (degree). For the traditional algorithm, the length of the address sequence (namely, the prefetching depth) does not affect the accuracy of prediction, and the length only plays a role in controlling the prefetching advance.
As shown in fig. 7(b), the instruction block where the access instruction M is located in the present invention may appear in the instruction fetch address queue 20 multiple times, and the present invention needs to perform address prediction independently for each occurrence of the access instruction M. If the behavior of the access physical address of the access instruction M conforms to the mode { A, B, C, D }, A, B, C, D are all the access physical addresses of the access instruction M, the invention can accurately predict that the access physical address of the access instruction M when the access instruction M appears for the fifth time is A on the premise of learning the length of the step length mode to be 4. Therefore, the step size pattern of the memory access pattern obtained through learning is a factor which must be considered in the learning stage of the invention.
Therefore, the memory access pattern learning module 40 includes a step size predictor and a time association predictor, so that the memory access pattern learning module 40 realizes learning and prediction of a plurality of memory access patterns such as a step size pattern and a time association pattern through the step size predictor and the time association predictor on the basis of memory access history buffering.
The step-size predictor of the present invention is based on the conventional IBSP (intelligent blind Signal processing method) algorithm. In this embodiment, each entry of the stride predictor is used to record a stride pattern, where the stride pattern refers to a stride access sequence, and the structure of the entry is shown in fig. 8, and each entry of the stride predictor includes a tag, a last address, a last stride, a trusted counter, a current pattern, a first address, a maximum counter, and a direction of the stride access sequence. That is, on the basis of the conventional IBSP algorithm, the entry items of the step size predictor of the invention extend the contents of a credible counter, a current mode, a first address, a maximum counter and the like. The original IBSP algorithm only needs to record step information, because the pre-fetching operation of the data is only calculated on the basis of the current access physical address. Aiming at the memory access physical address sequence of a fixed step size mode such as a step size memory access sequence { A, A + K, A + 2K., A + NK }, the step size predictor of the invention not only records the step size K, but also records the initial address A and the maximum count value N of the step size memory access sequence. Therefore, to increase the accuracy of the step-size predictor, the step-size predictor increases a trusted counter and a current mode. Specifically, when the last step size is the same as the current step size, the trusted counter is incremented by one, otherwise, the trusted counter is cleared. When the value of the credible counter is larger than a certain threshold value, the current mode is set to be 1, which indicates that the behavior of the current access instruction indeed conforms to the step size mode. If the current mode is 0, the current mode is in a learning stage, and the step length predictor in the learning stage does not output a prediction result. The step size predictor also comprises a step size and a direction bit, wherein the step size is always a positive value, and the direction bit determines whether the step size memory access sequence is increased or decreased. When the step predictor is used for carrying out the allocation of the entry item of the access instruction, the first address of the entry item is set as the current access physical address. In the step size mode counting process, the first address is not updated, only the last address is updated to be the current latest access physical address, and the first address of a step size access sequence is stored in the mode. When the step length predictor updates a certain memory access instruction which accords with the step length memory access sequence, the memory access physical address of the current memory access instruction is found to be exactly equal to the first address of the current step length memory access sequence in the step length predictor, and the step length memory access sequence corresponding to the first address is executed again. At this time, the value of the credible counter is updated to the maximum counter, so that the head and the tail of the step size memory access sequence are recorded.
The following exemplarily gives the identification algorithm of the step pattern.
Algorithm 1 step pattern recognition algorithm
Inputting a submitted memory access instruction PC and a physical address PhysAddr;
output whether it is step mode
if (current mode ═ step mode);
if (Current direction and step size not in accordance with history)
if (physa ddr ═ head address & (maximum counter ═ 0| | | | trusted counter | | | maximum counter))
The maximum counter is a credible counter;
else
clearing the relevant information of the step length;
end if
else
a trusted counter + +;
if (maximum counter! & & trusted counter > maximum counter)
The maximum counter is a credible counter;
end if
end if
else
if (Current direction and step size not in accordance with history)
Clearing the relevant information of the step length;
else
a trusted counter + 1;
if (credible counter > learning threshold)
The current mode is the step size mode;
end if
end if
end if
each entry of the time correlation predictor is used for recording a time correlation mode, and the time correlation mode refers to a period of memory access sequence which repeatedly appears in a fixed order, namely a time correlation memory access sequence. Each entry of the temporal association predictor comprises a temporal association pattern, a temporal association sequence length and a temporal association sequence. For example, we observe that after a memory access sequence such as { A, B, C, D, A, B, C, D }, there is a high probability that such a sequence of { B, C, D } occurs immediately after A. There is also a temporal associative pattern in instruction level memory access behavior.
Algorithm 2. time correlation Pattern recognition
Inputting: past 12 accesses to physical addresses;
and (3) outputting: time correlation information;
recording past 12 times of access physical addresses in an access history buffer [11:0 ];
if (memory access history buffer [3n-1:0] matching
{A1,A2,…,An,A1,A2,…,An,A1,A2,…,An})
Time association mode is 1;
the time correlation sequence length is n;
time-related sequence ═ memory access history buffer [ n-1:0 ];
end if
the time correlation predictor depends on the past 12 times of access physical addresses of each access instruction recorded in the access history buffer, and identifies the time correlation mode according to the 12 times of history information. The same access sequence in the algorithm of the invention can be determined as a time correlation mode after repeating for 3 times, and the time correlation mode with the sequence length not exceeding 4 is supported.
In addition to this basic time correlation pattern, the algorithm of the present invention supports a "step-time correlation pattern" in the form of { A, B, C, A + n, B + n, C + n, A +2n, B +2n, C +2n, … }.
(V) memory access address queue 50
Referring to fig. 3 again, the memory access address queue 50 is used to store the prediction results of the memory access instruction recording module 30 and the memory access pattern learning module 40, that is, the memory access instruction sequence output by the memory access instruction recording module 30 and the memory access physical address of each memory access instruction output by the memory access pattern learning module 40. The contents of the memory address queue 50 may be viewed as a future sequence of accesses to the level one data cache.
The access address queue 50 has a plurality of entry entries, each entry corresponding to information of an access instruction and to an entry in the fetch address queue 20. The access address queue 50 has a commit pointer, a read pointer, and a write pointer corresponding to the entry pointed to by the commit pointer, the read pointer, and the write pointer in the fetch address queue 20. Specifically, the commit pointer of the access address queue 50 points to the next entry item of the instruction to be committed in the access address queue 50, the read pointer of the access address queue 50 points to the next entry item of the instruction to be read in the access address queue 50, and the write pointer of the access address queue 50 points to the location of the next write by the access instruction recording module 30 and the access pattern learning module 40.
In this embodiment, the memory address queue 50 has 65536 entry entries. 65536 it follows that on the basis of 4096 entry entries in the instruction fetch address queue 20, since each entry in the instruction fetch address queue 20 represents a 64B aligned instruction block, we assume that there are a maximum of 16 access instructions in each instruction block, and therefore a maximum of 65536 access instructions are contained in the 4096 instruction blocks. Each entry in the access address queue 50 corresponds to information of one access instruction, and one entry in each of the access address queues 20 may correspond to 0 to 16 entries in the access address queue 50. At this time, since not all 64B instruction blocks have 16 access instructions, the capacity of the access address queue 50 has design redundancy.
The structure of each entry in the access address queue 50 is represented as:
<valid,inst_line_addr,mem_phys_addr,memlen,inst_queue_index>,
wherein valid represents a valid bit; inst _ line _ addr represents the instruction linear address; mem _ phys _ addr represents the physical address of the memory access; memlen represents the access length, and the access instruction of a cross-row can be identified according to the access physical address and the access length; inst _ queue _ index represents an index of the fetch address queue, so that synchronous refreshing of the fetch address queue and the access address queue is facilitated during pipeline refreshing.
The instruction linear address inst _ line _ addr, the access length memlen, and the instruction fetch address queue index inst _ queue _ index are from an access instruction buffer 31 in the access instruction recording module 30, and the access physical address mem _ phys _ addr is from the access pattern learning module 40 or the access instruction recording module 30. If the memory access mode learning module 40 identifies the step size mode and/or the time association mode to which the memory access instruction belongs, the memory access physical address of the memory access instruction is given by a corresponding predictor in the memory access mode learning module 40; otherwise, it is stated that the access instruction does not belong to the specific mode, and the access physical address is the physical address of the access instruction in the access instruction buffer 31 in the last execution.
The following describes the principle of the refresh flow of the memory address queue 50 when a branch refresh occurs in the main pipeline. When a branch refresh occurs in the main pipeline, the execution unit of the main pipeline feeds back the number of the branch instruction in the instruction fetch address queue 20 to the instruction fetch address queue 20, and the read and write pointers of the instruction fetch address queue 20 are rolled back to the branch refresh position as a branch instruction refresh pointer according to the number of the branch instruction. Meanwhile, the number of the branch instruction in the instruction fetch address queue 20 is also sent to the access address queue 50, the access address queue 50 first traverses the queue from the position of the commit pointer, finds the instruction fetch address queue index inst _ queue _ index of the entry equal to the position of the first entry of the number of the branch instruction in the instruction fetch address queue 20, which is the position of branch flush, and then rolls back the read and write pointers of the access address queue 50 to the position of branch flush.
(VI) prefetch request Generation Module 60
The prefetch request generation module 60 is arranged to: when a new request is received by the first-level data cache, the instruction fetching address queue 20 and the access address queue 50 are retrieved, the pre-fetching requests of the first-level instruction cache and the first-level data cache are respectively generated for the instruction block which is about to enter the main pipeline, and the pre-fetching requests of the first-level instruction cache and the first-level data cache are both sent to the cache system, so that the pre-fetching requests of the fetched data are obtained.
The first-level cache in the processor is divided into a first-level instruction cache and a first-level data cache, and the instruction fetching address queue and the access address queue are future access sequences of the first-level instruction cache and the first-level data cache respectively. The fetch address queue 20 stores a complete instruction sequence including branch instructions, arithmetic instructions, and access instructions. The access address queue 50 needs to separate the access instruction from the complete instruction stream, so the instruction fetch address queue 20 is the basis of the access address queue 50.
Fig. 9 is a schematic diagram of the operation of the prefetch request generation module 60. When a new read/write access request arrives in the first-level data cache in the memory access unit 400, a message is sent to the prefetch request generation module 60 to notify that a new read/write occurs, and the message is used as a trigger event of the prefetch process. The triggering of the pre-fetching process needs a frequent enough event, and as long as the processor is running, the processor can access the first-level data cache, so that the read/write access request is selected as the triggering condition to ensure the timeliness of the pre-fetching.
As described above, the memory physical address of the memory instruction to enter the main pipeline of the processor is stored in the memory address queue 50, and therefore, each entry of the memory address queue 50 is a prefetch request candidate. However, the prediction bandwidth (128B) of the prediction sub-pipeline is greater than the instruction fetch bandwidth (64B) of the main pipeline of the CPU, and in the case that the main pipeline of the CPU is blocked for some reason, the prediction sub-pipeline can still continue to work, so the prediction sub-pipeline is predicted a lot ahead of the main pipeline, the execution rate of the branch prediction sub-pipeline is in most cases far ahead of the execution rate of the main pipeline of the CPU, and the access instruction written into the access address queue 50 is likely to be executed by the program after a long time. Therefore, the most appropriate time for sending the prefetch request to the cache system is not the time when the instruction is written into the access address queue 50, but the time when the instruction block corresponding to the instruction is about to enter the main pipeline. Specifically, the prefetch request generation module 60 determines the instruction block that is about to enter the main pipeline based on the read pointer of the fetch address queue 20.
Therefore, the prefetch request generation module 60 is arranged to perform the steps of:
step B1: after receiving the message of a new request received by the first-level data cache, retrieving the instruction-fetching address queue 20, obtaining the position of the read pointer, taking the position of the read pointer after adding a pre-fetching lead as a pre-fetching start, taking the continuous N pre-fetching instruction blocks taking the pre-fetching start as a first pre-fetching instruction block as an instruction block to enter the main pipeline, and generating a pre-fetching request of a corresponding first-level instruction cache for the instruction block to enter the main pipeline. Thus, provision is made for implementing the issue process of the level one instruction cache prefetch request.
The instruction block behind the read pointer of the instruction fetch address queue 20 is about to enter the pipeline, a prefetch lead is added on the basis of the read pointer to serve as the start of the prefetch instruction block (i.e. prefetch start), the prefetch lead is a fixed value, the prefetch lead is measured by an experimental link according to the optimal effect, and the prefetch lead does not change the position of the read pointer. In this embodiment, the actual value of the prefetch advance is 120. Each prefetch process sends prefetch requests for N consecutive prefetch instruction blocks. N is a prefetch number, which controls the number of prefetch requests that can be issued at a time, and in this embodiment, the value of prefetch number N is 4.
In step B1, the number of valid predicted instruction blocks in the instruction fetch address queue 20 is counted, that is, the distance between the write pointer and the read pointer is used as the number of valid predicted instruction blocks, and it is determined whether to continue the prefetch request generation process according to the counted result. Specifically, the generation of the prefetch request of the first-level instruction cache is continued only when the statistical result is that the number of the effective predicted instruction blocks is greater than the sum of the prefetch advance and the prefetch number N.
The primary data cache stores that the arrival of a new request is a trigger condition of a prefetching process, and is a master switch, and the prefetching advance and the prefetching number N are similar to a threshold value in the control logic, and the threshold value can be adjusted.
Step B2: pointers of all the prefetch instruction blocks (namely the numbers of the prefetch instruction blocks in the prefetch address queue 20) fed back by the prefetch address queue 20 are received, and the pointers of the prefetch instruction blocks are used for retrieving a prefetch request of a fetch address in the prefetch address queue 50 at the head of the prefetch queue and fetching data, and the prefetch request waits for write-back or immediately writes back to the first-level cache according to the instruction submission condition before the prefetch request.
These prefetch requests for data have data for writing to the first level cache, but writing back to the cache array immediately after the prefetch request has taken the data may kick out valid data in the cache. Therefore, in order to prevent a cache entry with a smaller reuse distance than a prefetch request from being kicked out from the first-level cache during a write process, it is necessary to store the prefetch requests into the prefetch queue in chronological order and to wait for the prefetch request to wait in the prefetch queue for a time at which the prefetch request can be written back.
The prefetch queue is a 50-entry circular queue, and at the head of the prefetch queue is a prefetch request for data to be written into the cache. The prefetch request is first queued in the prefetch queue for a write-back time until the level one cache is assured that an alternative exists in the array, at which time the write-back occurs to the level one cache. If the instruction before the prefetch request is committed, that is, other instructions in the same group as the prefetch instruction are not yet committed in the array of the first-level cache in the memory address queue 50 before the prefetch request, then the prefetch instruction is written back to the first-level cache with confidence that a replaceable item exists in the array of the first-level cache.
The array of the first-level cache adopts the most common structure, namely a group connection structure. The array of the first level cache of 64 entries is represented by a 16 x 4 matrix, the rows of which are "groups" and the columns of which are "ways". The structure of a 64-entry cache array can be described as 16 sets, 4 ways. When the cache array with the group connection structure performs write operation, firstly, the number of the group is acquired according to the physical address, and then the written way number is selected according to the cache replacement strategy so as to write the data of the prefetch request.
Fig. 10 is a schematic diagram illustrating a write-back waiting process of a prefetch request, for example, a first-level data cache prefetch request. According to the above, the prefetch request generating module 60 determines the prefetch start in the fetch address queue 20 according to the prefetch lead, and sends the prefetch request M corresponding to the prefetch start in the access address queue 50 to the cache system, so as to obtain the prefetch request of the fetched data. Prefetch request write back module 70 is configured to temporarily store prefetch requests for fetched data in its prefetch queue; for the prefetch request M at the head of the prefetch queue, an uncommitted instruction set (the uncommitted instruction set includes the commit pointer but does not include the prefetch request M) of the prefetch access address queue 50 after the commit pointer and before the prefetch request M is judged whether other access instructions in the same group as the prefetch request M exist in an array of a first-level data cache in the uncommitted instruction set, and according to a judgment result (the judgment result is the instruction commit condition before the prefetch request), if the other access instructions in the same group as the prefetch request M exist and the number of the access instructions in the same group as the prefetch request M is at least the number of ways of the cache array, the prefetch request M waits to be written back, otherwise, the prefetch request M is immediately written back to the first-level data cache.
If the cache array has only 4 way numbers and the number of the same group of instructions is exactly equal to the number of the way numbers in the cache array, the prefetch request M replaces any one of the instructions in a, B, C, and D, which affects performance. At this time, the location L of the first instruction a in the fetch address queue needs to be found, and the prefetch request M can be written into the cache array only when the commit pointer of the fetch address queue exceeds L. Thus, when the commit pointer of the fetch address queue exceeds L, it can be determined that instruction A must have been executed in the processor's main stream. The data for the first instruction A is still in the cache array at this point, but because it is already in use, replacing the first instruction A with the data for the prefetch request M does not incur a performance penalty. Conversely, if the first instruction A is replaced with the data of the prefetch request M before the first instruction A completes execution, the first instruction A may generate additional cache misses during execution.
If the prefetch request is a prefetch request from a level one instruction cache, then the instruction fetch address queue 20 need only be retrieved accordingly. During program execution, the number of first level instruction cache misses is much less than the first level data cache. Of more interest in the cache prefetch method of the present invention is prefetching of data requests. Therefore, the experimental link of the invention is only tested for the first-level data cache, and because the failure times of the first-level instruction cache of many spec programs are 0, the prefetching for the first-level instruction cache has no experimental significance.
The low-pollution cache prefetching system based on instruction flow mixed mode learning of the invention assumes that the prediction of the access address queue is accurate. If the prediction is wrong, the logic here may incur some performance penalty.
Based on the low-pollution cache prefetching system based on instruction flow mixed mode learning, the low-pollution cache prefetching method based on instruction flow mixed mode learning comprises the following steps:
step S1: a branch prediction module 10 is used for predicting the instruction stream of the target program by adopting the advance prediction technology and writing the prediction result into the instruction fetching address queue; when the fetch address queue 20 is full, the prediction process is suspended;
the step S1 includes:
step A1: in each period, taking the instruction block where the current prediction address is as the current instruction block, and taking the current prediction address as the prediction starting address of the current instruction block; determining a fixed prediction window starting from the current instruction block according to the prediction starting address of the current instruction block;
step A2: after the prediction starting address of the current instruction block is obtained, retrieving and predicting branch instructions in the current instruction block to judge whether the current instruction block hits the branch instructions of the jump or not;
step A3: according to the judgment result, if the branch instruction of the jump is not hit in the current instruction block, which indicates that no branch instruction exists in the current instruction block or the identified branch instruction is not jumped, the information of the current instruction block is written into the instruction fetch address queue as the prediction result of the branch prediction module 10; then, determining a prediction starting address of a next instruction block, taking the next instruction block as a new current instruction block, and returning to the step A2 until the current instruction block is the last instruction block in a fixed prediction window, wherein the current prediction address is automatically increased according to the number of the instruction blocks in the fixed prediction window, so that the current prediction address is updated, and entering a next period;
otherwise, selecting the branch instruction of the first jump with the minimum linear address from the branch instructions of the hit jump as the end address of the current instruction block, writing the information of the current instruction block as a prediction result into an instruction fetch address queue, and updating the current prediction address to the jump address of the branch instruction of the first jump to enter the next cycle.
Alternatively, the step S1 includes:
step a 1': in each cycle (namely when the current prediction address is initialized or updated), taking the instruction block where the current prediction address is as the current instruction block, and taking the current prediction address as the prediction starting address of the current instruction block; then, determining the prediction start addresses of all instruction blocks in a fixed prediction window starting from the current instruction block according to the prediction start address of the current instruction block;
step a 2': according to the prediction starting address of each instruction block, retrieving and predicting branch instructions in each instruction block to judge whether each instruction block hits the branch instructions of the jump;
step a 3': according to the judgment result, if all the instruction blocks miss the branch instruction of the jump, the information of all the instruction blocks is used as the prediction result of the branch prediction module 10 to be written into the instruction fetching address queue in sequence; the current prediction address is increased automatically according to the number of the instruction blocks in the fixed prediction window so as to enter the next period;
otherwise, if at least one instruction block hits at least one jump branch instruction, selecting a first jump branch instruction with the minimum linear address from the hit jump branch instructions as the tail address of the current instruction block, writing the information of the current instruction block and all instruction blocks before the current instruction block as the prediction result into the instruction fetch address queue, and updating the current prediction address to the jump address of the first jump branch instruction to enter the next cycle.
Step S2: using a memory access instruction recording module 30 to record the information of the submitted memory access instructions and write the information of the memory access instructions into a memory access instruction buffer; when a new entry item is written into the instruction fetching address queue 20, the access instruction buffer is inquired by using the initial address of the instruction block corresponding to the entry item, so that an access instruction sequence is obtained through inquiry attempt and is output to the access instruction learning module and the access address queue;
step S3: the memory access instruction learning module 40 is used for recording a memory access instruction sequence in a memory access history buffer, learning a memory access mode of the memory access instruction according to history information stored in the memory access history buffer, predicting a memory access physical address of each memory access instruction in the memory access instruction sequence according to the learned memory access mode, and writing the memory access physical address of each memory access instruction into a memory access address queue;
step S4: utilizing the prefetch request generating module 60, when the first-level data cache receives a new request, retrieving the fetch address queue and the access address queue, respectively generating a prefetch request of the first-level instruction cache and the first-level data cache for an instruction block which is about to enter a main pipeline, and sending the prefetch request to the cache system to obtain a prefetch request of the fetched data;
in the step S4, the prefetch request generating module 60 determines an instruction block to enter the main pipeline according to the read pointer of the instruction fetch address queue 20;
step S4 includes:
step B1: after receiving a message of a new request received by a first-level data cache, retrieving an instruction-fetching address queue, obtaining the position of a read pointer of the instruction-fetching address queue, taking the position of the read pointer after adding a pre-fetching lead as a pre-fetching start, taking N continuous pre-fetching instruction blocks taking the pre-fetching start as a first pre-fetching instruction block as an instruction block to enter a main pipeline, and generating a pre-fetching request of a corresponding first-level instruction cache for the instruction block to enter the main pipeline;
step B2: receiving pointers of all the prefetch instruction blocks fed back by the fetch address queue, retrieving fetch address queue indexes in the access address queue by using the pointers of the prefetch instruction blocks, acquiring access instructions and access physical addresses thereof contained in all the prefetch instruction blocks, and generating prefetch requests of a first-level data cache according to the access physical addresses of the access instructions;
step B3: and sending the prefetch request of the first-level instruction cache and the prefetch request of the first-level data cache to a cache system to obtain the prefetch request of the taken data.
Step S5: the prefetch request write-back module 70 is utilized to temporarily store the prefetch request of the fetched data in the prefetch queue, so that the prefetch request of the fetched data at the head of the prefetch queue waits for write-back or immediately writes back to the first-level cache according to the instruction submission condition before the prefetch request.
In step S5, for the prefetch request at the head of the prefetch queue, the uncommitted instruction set of the retrieval access address queue 50 after its commit pointer and before the prefetch request is judged, whether there are other access instructions in the same group as the prefetch request in the array of the level one cache in the uncommitted instruction set, and the judgment result is taken as the instruction commit condition before the prefetch request; if judging that other memory access instructions in the same group with the prefetch request exist and the number of the memory access instructions in the same group with the prefetch request is at least the number of ways of the cache array, waiting for the write-back of the prefetch request, and if not, immediately writing the prefetch request M back to a first-level cache.
The experiment of the invention verifies that:
the GEM5 simulator is used as the basic experimental environment, using the alpha instruction set. A DerivO3CPU fine-grained CPU model is used, a branch prediction algorithm is TAGE _ L, a first-level cache 32KB, a second-level cache 256KB, an instruction fetching bit width 64B and a memory 4 GB. The SPEC2006 test program was chosen to compare the performance of the inventive method with other methods.
Table 1 compares the read operation invalidation times of different spec programs with a queue index (i.e., inst _ queue _ index domain), obtains the access instruction and the access physical address thereof contained in each prefetch instruction block, and generates a prefetch request of a first-level data cache according to the access physical address of the access instruction; thus, provision is made for implementing the sending process of the level one data cache prefetch request.
Step B3: and sending the prefetch request of the first-level instruction cache and the prefetch request of the first-level data cache to a cache system to obtain the prefetch request of the taken data.
The prefetch request reads the 64B aligned data from the memory according to the address, writes the data into the cache system, and finally writes the data into the first level cache, so the data fetched refers to the 64B data read from the memory.
(VII) prefetch request writeback Module 70
Prefetch request write back module 70 is configured to temporarily store prefetch requests to data in its prefetch queue such that bits
Not pre-fetching STeMS BOP ISB STRIDE IFBTS
lbm 493947 494368 381635 494291 187410 153940
Bzip2 180765 184835 184124 177324 177003 172128
GemsFDTD 586680 586701 553972 586635 510435 123719
gobmk 67463 69636 68240 66167 51302 58829
astar 91 89 90 91 91 74
H264ref 261561 263830 215465 260042 91558 61381
hmmer 43410 44040 43019 43199 41924 4352
Soplex 813733 811522 624612 813817 287799 230243
Bwaves 578612 578703 372337 578611 174377 88849
sjeng 10160 12003 11090 9908 10475 11458
TABLE 2 comparison of write operation failure times for different spec programs
Not pre-fetching STeMS BOP ISB STRIDE IFBTS
lbm 427392 423352 358226 393252 153874 140457
Bzip2 18306 18247 17967 18248 15809 15655
GemsFDTD 3 3 3 3 3 3
gobmk 58343 58857 58022 56175 47325 53317
astar 65564 65564 59548 65564 39296 18915
H264ref 4611 4685 5291 4614 4426 3017
hmmer 168577 168488 165541 168519 153174 6492
Soplex 106596 106552 89808 106605 73147 51185
bwaves 23 23 23 23 23 23
sjeng 27395 27873 29417 27252 24206 28584
When the performance comparison is carried out, the cache failure times of the read/write operation of the first-level data cache are respectively counted. As shown in table 1 and table 2, the cache failure times of the method of the present invention are respectively reduced by 28.26%, 51.47%, 47.28% and 48.75% in average compared with the cache failure times of the algorithms of string, steems, BOP and ISB for read operation; compared with STRIDE, STeMS, BOP and ISB algorithms, the method of the invention reduces the cache failure times by 18.84%, 34.28%, 33.49% and 33.26% respectively.
That is, currently, the performance of the high-performance processor cannot be fully exerted due to the bottleneck of the cache system, and the system and the method of the present invention have a significant effect on improving the performance of the processor.
The method of the present invention is applicable to all processor micro-architectures, and is not limited to a specific branch prediction algorithm and an instruction set, and of course, the specific implementation process may vary slightly depending on the instruction set and the specific processor micro-architecture, but also falls within the protection scope of the present invention.
The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims (10)

1. A low-pollution cache prefetching system based on instruction flow mixed mode learning is characterized by comprising a main assembly line and a branch prediction sub assembly line, wherein the branch prediction sub assembly line comprises a branch prediction module, an instruction fetch address queue, an access instruction recording module, an access mode learning module, an access address queue, a prefetching request generation module and a prefetching request write-back module;
the branch prediction module is set to adopt a prediction-in-advance technology to predict the instruction stream of the target program and write the prediction result into the instruction fetch address queue;
the memory access instruction recording module is set to record the information of the submitted memory access instructions in sequence and write the information of the memory access instructions into the memory access instruction buffer; meanwhile, when a new entry item is written into the instruction fetching address queue, the access instruction buffer is inquired by using the initial address of the instruction block corresponding to the entry item, so that an access instruction sequence is obtained through inquiry attempt and is output to the access instruction learning module and the access address queue;
the memory access mode learning module is set to record the memory access instruction sequence in the memory access history buffer, learn the memory access mode of the memory access instruction according to the history information stored in the memory access history buffer, predict the memory access physical address of each memory access instruction in the memory access instruction sequence according to the learned memory access mode and write the memory access physical address into the memory access address queue;
the prefetch request generation module is arranged to retrieve the fetch address queue and the access address queue when a new request is received by the first-level data cache, respectively generate prefetch requests of the first-level instruction cache and the first-level data cache for instruction blocks which are about to enter a main pipeline and send the prefetch requests to the cache system so as to obtain the prefetch requests of the fetched data;
the prefetch request write-back module is set to temporarily store the prefetch request of the taken data in the prefetch queue, so that the prefetch request of the taken data at the head of the prefetch queue waits for write-back or immediately writes back to the first-level cache according to the instruction submission condition before the prefetch request.
2. The instruction flow mixed mode learning-based low-pollution cache prefetching system according to claim 1, wherein the branch prediction module is configured to write the prediction result into the instruction fetch address queue at the granularity of an instruction block; and the branch prediction module is arranged to perform:
step A1: in each period, taking the instruction block where the current prediction address is as the current instruction block, and taking the current prediction address as the prediction starting address of the current instruction block; determining a fixed prediction window starting from the current instruction block according to the prediction starting address of the current instruction block;
step A2: after the prediction starting address of the current instruction block is obtained, retrieving and predicting branch instructions in the current instruction block to judge whether the current instruction block hits the branch instructions of the jump or not;
step A3: according to the judgment result, if the branch instruction of the jump is not hit in the current instruction block, the current instruction block does not have the branch instruction or the identified branch instruction does not jump, the information of the current instruction block is used as the prediction result of the branch prediction module to be written into the instruction fetching address queue; then, determining a prediction starting address of a next instruction block, taking the next instruction block as a new current instruction block, and returning to the step A2 until the current instruction block is the last instruction block in a fixed prediction window, wherein the current prediction address is automatically increased according to the number of the instruction blocks in the fixed prediction window to enter a next cycle;
otherwise, selecting a first jump branch instruction with the minimum linear address from the hit jump branch instructions as the end address of the current instruction block, writing the information of the current instruction block as a prediction result into an instruction fetch address queue, and updating the current prediction address to the jump address of the first jump branch instruction to enter the next cycle;
or the branch prediction module is set to write the prediction result into the instruction fetch address queue by taking the instruction block as granularity; and the branch prediction module is arranged to perform:
step a 1': in each period, taking the instruction block where the current prediction address is as the current instruction block, and taking the current prediction address as the prediction starting address of the current instruction block; then, determining the prediction start addresses of all instruction blocks in a fixed prediction window starting from the current instruction block according to the prediction start address of the current instruction block;
step a 2': according to the predicted initial address of each instruction block, retrieving and predicting branch instructions in each instruction block to judge whether each instruction block hits the branch instructions of the jump;
step a 3': according to the judgment result, if all the instruction blocks miss the skipped branch instruction, the information of all the instruction blocks is used as the prediction result of the branch prediction module to be sequentially written into the instruction fetching address queue; the current prediction address is increased automatically according to the number of the instruction blocks in the fixed prediction window so as to enter the next period;
otherwise, if at least one instruction block hits at least one jump branch instruction, selecting a first jump branch instruction with the minimum linear address from the hit jump branch instructions as the tail address of the current instruction block, writing the information of the current instruction block and all instruction blocks before the current instruction block as the prediction result into the instruction fetch address queue, and updating the current prediction address to the jump address of the first jump branch instruction to enter the next cycle.
3. The instruction flow mixed mode learning-based low-pollution cache prefetching system according to claim 2, wherein the main pipeline comprises an instruction fetching unit, a decoding unit, an execution unit, an access unit and a write-back unit which are connected in sequence, and in the branch prediction module, the bit width of the instruction block is equal to the bit width of the instruction fetching unit during instruction fetching;
the information of the instruction block comprises a line linear address, a line physical address, a starting offset, an ending offset and a jump bit of the instruction block; when the information of the instruction block is written into the instruction fetch address queue, splitting the predicted initial address of the instruction block into a high-order line linear address and a low-order initial offset, inquiring the TLB through the line linear address to obtain a line physical address, and writing the obtained line linear address, the line physical address and the initial offset into corresponding domains in entry items of the instruction fetch address queue; in addition, if the branch instruction of the jump hit by the instruction block, the last byte offset of the branch instruction of the first jump is written into the instruction fetch address queue as an end offset; otherwise, the ending offset in the fetch address queue is set to 63.
4. The instruction flow mixed mode learning-based low-pollution cache prefetching system according to claim 1, wherein the structure of each entry in the fetch address queue is:
<valid,line_addr,phys_addr,begin_offset,end_offset,taken>,
wherein valid represents a valid bit; line _ addr represents a line linear address; phys _ addr represents the line physical address; begin _ offset represents the starting offset; end _ offset represents an end offset; taken represents jump index;
the instruction fetching address queue is also provided with a submission pointer, a read pointer and a write pointer; a commit pointer of the instruction fetching address queue points to an entry item of an instruction block where a next to-be-committed instruction is located in the instruction fetching address queue; a read pointer of the instruction fetching address queue points to an entry item of an instruction block where a next instruction to be read is located in the instruction fetching address queue; a write pointer of the fetch address queue points to the position of the next write of the branch prediction module; after the main pipeline of the processor refreshes, according to the refresh type, rolling back a read pointer and a write pointer of the instruction fetching address queue to the position of branch refresh or the position of a submit pointer;
the structure of each table entry of the access instruction buffer is as follows:
<LineAddr,PhyAddr,InstType,Memlen>,
wherein, the LineAddr represents an instruction linear address of an access instruction; PhyAddr represents the access physical address of the access instruction in the last execution; the InstType indicates the type of the access and storage instruction, the InstType belongs to { DirectInst, InDirectInst }, wherein DirectInst represents a direct access and storage instruction, and InDirectInst represents an indirect access and storage instruction; memlen indicates the memory access length of the memory access instruction;
the memory access history buffer is an array which takes a PC (personal computer) of a memory access instruction as an index, and each table entry in the memory access history buffer records the memory access physical address of the same memory access instruction for 12 times in the past;
and the structure of each entry item in the access address queue is as follows:
<valid,inst_line_addr,mem_phys_addr,memlen,inst_queue_index>,
wherein valid represents a valid bit; inst _ line _ addr represents the instruction linear address; mem _ phys _ addr represents the physical address of the memory access; memlen represents the memory access length; inst _ queue _ index represents an index of the fetch address queue;
the access address queue is provided with a submission pointer, a read pointer and a write pointer; when the main pipeline is in branch refreshing, the read pointer and the write pointer of the access address queue roll back to the position of the branch refreshing.
5. The instruction flow mixed mode learning based low-pollution cache prefetching system of claim 4, wherein the branch prediction module is configured to: if the fetch address queue is full, pausing the prediction process; whether the fetch address queue is full is judged according to the combination of the write pointer and the commit pointer of the fetch address queue.
6. The instruction flow mixed mode learning-based low-pollution cache prefetching system of claim 1, wherein in the memory access instruction recording module, the information of the memory access instruction comprises a PC of the memory access instruction, an instruction linear address, a memory access address, a type and a memory access length of the memory access instruction;
the memory access instruction recording module is set to use the instruction linear addresses of the memory access instructions in all the table entries of the memory access instruction buffer to perform hit judgment in the inquiring process, and the position and the type of the memory access instruction in the memory access instruction buffer are obtained when the memory access instruction is hit, so that the position and the type of the memory access instruction in the memory access instruction buffer are tried to be obtained by inquiring all the memory access instructions in the memory access instruction buffer; when the hit judgment is carried out, when the high-order tag of the instruction linear address of one access instruction is the same as the line linear address of the instruction block corresponding to the new entry item in the access instruction buffer, and the low-order offset of the high-order tag is more than or equal to the initial offset of the instruction block and less than or equal to the ending offset of the instruction block, the access instruction is hit, and the position and the type of the access instruction in the access instruction buffer are obtained.
7. The instruction flow mixed mode learning-based low-pollution cache prefetching system according to claim 1, wherein the memory access pattern learning module comprises a step size predictor and a time association predictor, so that the memory access pattern learning module realizes learning and prediction of the step size pattern and the time association pattern through the step size predictor and the time association predictor on the basis of memory access history buffering;
each entry item of the step predictor is used for recording a step mode, and the step mode refers to a step access sequence; each entry item of the step size predictor comprises a label of a step size memory access sequence, a last address, a last step size, a credible counter, a current mode, a first address, a maximum counter and a direction;
each entry item of the time correlation predictor is used for recording a time correlation mode, the time correlation mode refers to a time correlation memory access sequence, and each entry item of the time correlation predictor comprises a time correlation mode, a time correlation sequence length and a time correlation sequence.
8. The instruction flow mixed mode learning-based low-pollution cache prefetching system according to claim 1, wherein the prefetching request generating module determines the instruction block to enter the main pipeline according to the read pointer of the instruction fetch address queue;
the prefetch request generation module is configured to perform the steps of:
step B1: after receiving a message of a new request received by a first-level data cache, retrieving a fetch address queue to obtain the position of a read pointer of the first-level data cache, taking the position of the read pointer after adding a pre-fetch lead as a pre-fetch start, taking N continuous pre-fetch instruction blocks taking the pre-fetch start as a first pre-fetch instruction block as instruction blocks going to enter a main pipeline, and generating a pre-fetch request of a corresponding first-level instruction cache for the instruction blocks going to enter the main pipeline;
step B2: receiving pointers of all the prefetch instruction blocks fed back by the fetch address queue, using the pointers of the prefetch instruction blocks to retrieve the fetch address queue indexes in the access address queue, acquiring the access instructions and the access physical addresses thereof contained in all the prefetch instruction blocks, and generating prefetch requests of a first-level data cache according to the access physical addresses of the access instructions;
step B3: and sending the prefetch request of the first-level instruction cache and the prefetch request of the first-level data cache to a cache system to obtain the prefetch request of the taken data.
9. The instruction flow mixed mode learning-based low-pollution cache prefetching system according to claim 1, wherein the prefetch request write-back module is configured to: for the prefetch request positioned at the head of the prefetch queue, an uncommitted instruction set of the retrieval access address queue behind a commit pointer and before the prefetch request is judged, whether other access instructions in the same group with the prefetch request exist in an array of a first-level cache in the uncommitted instruction set or not is judged, and the judgment result is taken as the instruction commit condition before the prefetch request; if judging that other memory access instructions in the same group with the prefetch request exist and the number of the memory access instructions in the same group with the prefetch request is at least the number of ways of the cache array, waiting for the write-back of the prefetch request, and if not, immediately writing the prefetch request back to a first-level cache.
10. A low-pollution cache prefetching method based on instruction flow mixed mode learning is characterized by comprising the following steps:
step S1: predicting the instruction stream of the target program by using a branch prediction module and adopting a prediction-in-advance technology and writing a prediction result into an instruction fetching address queue; suspending the prediction process when the fetch address queue is full;
step S2: recording the information of the submitted access instructions by using an access instruction recording module, and writing the information of the access instructions into an access instruction buffer; when a new entry item is written into the fetch address queue, the access instruction buffer is inquired by using the initial address of the instruction block corresponding to the entry item, so that an access instruction sequence is obtained through inquiry attempt and is output to the access instruction learning module and the access address queue;
step S3: the memory access instruction learning module is used for recording a memory access instruction sequence in a memory access history buffer, learning a memory access mode of the memory access instruction according to history information stored in the memory access history buffer, predicting a memory access physical address of each memory access instruction in the memory access instruction sequence according to the learned memory access mode, and writing the memory access physical address of each memory access instruction into a memory access address queue;
step S4: utilizing a prefetch request generation module to retrieve the instruction fetch address queue and the access address queue when a first-level data cache receives a new request, respectively generating a prefetch request of the first-level instruction cache and a first-level data cache for an instruction block which is about to enter a main pipeline and sending the prefetch request to a cache system so as to obtain the prefetch request of the fetched data;
step S5: and temporarily storing the prefetch request of the fetched data in the prefetch queue by using a prefetch request write-back module, so that the prefetch request of the fetched data at the head of the prefetch queue waits for write-back or immediately writes back to the first-level cache according to the instruction submission condition before the prefetch request.
CN202111356734.XA 2021-11-16 2021-11-16 Low-pollution cache prefetching system and method based on instruction flow mixed mode learning Pending CN114579479A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111356734.XA CN114579479A (en) 2021-11-16 2021-11-16 Low-pollution cache prefetching system and method based on instruction flow mixed mode learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111356734.XA CN114579479A (en) 2021-11-16 2021-11-16 Low-pollution cache prefetching system and method based on instruction flow mixed mode learning

Publications (1)

Publication Number Publication Date
CN114579479A true CN114579479A (en) 2022-06-03

Family

ID=81767901

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111356734.XA Pending CN114579479A (en) 2021-11-16 2021-11-16 Low-pollution cache prefetching system and method based on instruction flow mixed mode learning

Country Status (1)

Country Link
CN (1) CN114579479A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116048627A (en) * 2023-03-31 2023-05-02 北京开源芯片研究院 Instruction buffering method, apparatus, processor, electronic device and readable storage medium
CN116719561A (en) * 2023-08-09 2023-09-08 芯砺智能科技(上海)有限公司 Conditional branch instruction processing system and method
WO2023236355A1 (en) * 2022-06-10 2023-12-14 成都登临科技有限公司 Method for acquiring instruction in parallel by multiple thread groups, processor, and electronic device
CN117472802A (en) * 2023-12-28 2024-01-30 北京微核芯科技有限公司 Cache access method, processor, electronic device and storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023236355A1 (en) * 2022-06-10 2023-12-14 成都登临科技有限公司 Method for acquiring instruction in parallel by multiple thread groups, processor, and electronic device
CN116048627A (en) * 2023-03-31 2023-05-02 北京开源芯片研究院 Instruction buffering method, apparatus, processor, electronic device and readable storage medium
CN116719561A (en) * 2023-08-09 2023-09-08 芯砺智能科技(上海)有限公司 Conditional branch instruction processing system and method
CN116719561B (en) * 2023-08-09 2023-10-31 芯砺智能科技(上海)有限公司 Conditional branch instruction processing system and method
CN117472802A (en) * 2023-12-28 2024-01-30 北京微核芯科技有限公司 Cache access method, processor, electronic device and storage medium
CN117472802B (en) * 2023-12-28 2024-03-29 北京微核芯科技有限公司 Cache access method, processor, electronic device and storage medium

Similar Documents

Publication Publication Date Title
CN114579479A (en) Low-pollution cache prefetching system and method based on instruction flow mixed mode learning
CN113986774A (en) Cache replacement system and method based on instruction stream and memory access mode learning
Kim et al. Path confidence based lookahead prefetching
Shevgoor et al. Efficiently prefetching complex address patterns
US9684601B2 (en) Data processing apparatus having cache and translation lookaside buffer
CN105183663B (en) Pre-fetch unit and data prefetching method
US5694568A (en) Prefetch system applicable to complex memory access schemes
CN100517274C (en) Cache memory and control method thereof
US7558939B2 (en) Three-tiered translation lookaside buffer hierarchy in a multithreading microprocessor
US6560693B1 (en) Branch history guided instruction/data prefetching
US7099999B2 (en) Apparatus and method for pre-fetching data to cached memory using persistent historical page table data
US20030208665A1 (en) Reducing data speculation penalty with early cache hit/miss prediction
US8677049B2 (en) Region prefetcher and methods thereof
JPH1074166A (en) Multilevel dynamic set predicting method and its device
US8782374B2 (en) Method and apparatus for inclusion of TLB entries in a micro-op cache of a processor
US7047362B2 (en) Cache system and method for controlling the cache system comprising direct-mapped cache and fully-associative buffer
CN109461113B (en) Data structure-oriented graphics processor data prefetching method and device
US11301250B2 (en) Data prefetching auxiliary circuit, data prefetching method, and microprocessor
US6711651B1 (en) Method and apparatus for history-based movement of shared-data in coherent cache memories of a multiprocessor system using push prefetching
US20120210056A1 (en) Cache memory and control method thereof
CN101681258A (en) Associate cached branch information with the last granularity of branch instruction in variable length instruction set
US11249762B2 (en) Apparatus and method for handling incorrect branch direction predictions
US8433850B2 (en) Method and apparatus for pipeline inclusion and instruction restarts in a micro-op cache of a processor
US11847053B2 (en) Apparatuses, methods, and systems for a duplication resistant on-die irregular data prefetcher
CN115934170A (en) Prefetching method and device, prefetching training method and device, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination