CN114528024A

CN114528024A - Instruction fetching assembly line for storage and calculation fusion processor

Info

Publication number: CN114528024A
Application number: CN202210158362.8A
Authority: CN
Inventors: 王媛; 胡孔阳; 李泉泉; 刘玉
Original assignee: Anhui Core Century Technology Co ltd
Current assignee: Anhui Core Century Technology Co ltd
Priority date: 2022-02-21
Filing date: 2022-02-21
Publication date: 2022-05-24

Abstract

The invention discloses an instruction fetching assembly line for an access calculation fusion processor, which aims to improve the operation performance of the access calculation fusion processor to the maximum extent, and preferably adopts a larger instruction memory iMEM and a high-speed Cache iCache to replace an iCache and related multi-level caches in a traditional instruction fetching module, so that the Cache miss rate is reduced, and the instruction fetching performance is improved; however, the larger storage means larger delay, and for this reason, the invention designs six-level pipelined instruction fetch logic to replace the original four-level pipelined instruction fetch logic, so that the time length of a key path is shortened, and the system clock frequency is determined by the time length of the key path, so that the system clock frequency is improved, and the system execution efficiency is greatly improved; although the number of the fetching pipeline stages is increased, the performance of the fetching pipeline is not reduced due to the existence of the branch predictor and the instruction buffering FIFO.

Description

Instruction fetching assembly line for storage and calculation fusion processor

Technical Field

The invention relates to the technical field of pipeline instruction fetching, in particular to an instruction fetching pipeline for a memory-computation fusion processor.

Background

With the increasing process improvement of processors and memories, the memory wall scissors under the Von Neumann architecture are continuously increased, the problem of memory access power consumption is increasingly highlighted, and the focus of the industry and the academia begins to be shifted from focus computing to focus storage. Meanwhile, the rapid development of applications such as artificial intelligence, brain-like computing and the like with high memory access, high parallelism and low precision drives the rapid development of computing type storage/storage and computation integration/memory computing, and the rapid development of computing type storage/storage and computation integration/memory computing inevitably brings greater data throughput rate requirements.

The Instruction Fetch (IF) stage is the process of fetching an instruction from main memory to an instruction register. In order to maximize the advantages of the compute-based unified/in-memory computation, it is necessary to design the fetch logic matching the throughput. At present, the conventional instruction fetching module uses the iCache to Cache the instruction machine code, but the iCache capacity is usually small, the Cache miss probability is high, and the Cache miss cost is usually high. In order to improve the operation performance of the memory-compute fusion processor to the maximum extent, it is better to adopt a larger instruction memory iMEM (SRAM) and a Cache iCache (SRAM) to replace iCache and its related multi-level Cache in the traditional instruction fetching module, reduce the Cache miss rate, improve the instruction fetching performance, but larger storage brings larger delay and reduces the system execution efficiency.

Disclosure of Invention

Aiming at the delay problem of the existing fetch pipeline applied to the memory fusion processor adopting a larger instruction memory and a high-speed cache, the invention provides the fetch pipeline for the memory fusion processor, which effectively improves the program execution efficiency.

An instruction fetch pipeline for a memory compute fusion processor, comprising six-stage pipelined instruction fetch logic of stages F1-F6, respectively:

the F1 level flow unit generates a correct instruction fetching address according to the execution characteristics of the application program;

an F2 level flow unit which sends the read address and related control signals of the iCache/iMEM, wherein the related control signals include but are not limited to read enable signals;

an F3 stage flow unit, which registers the read data returned by iCache/iMEM when the flow stops; when the pipeline stall condition is relieved, sending the registered read data to an F4-level pipeline unit;

the F4-level flow unit is used for judging whether the iCache/iMEM is hit or not, when the iCache is lacked, generating flow pause signals of F1, F2 and F3, starting the AXIMaster to read data of an external memory and updating the iCache, selecting the iCache, the iMEM or the AXIMaster to read an instruction packet returned by the data memory according to the hit judgment result, performing branch prediction on the returned instruction packet, sending the prediction result to the F5-level flow unit, and enabling the branch prediction result to take effect in F5-level flow;

the stage F5 pipeline unit buffers the instruction packet through the instruction buffer FIFO and sends the branch prediction target address to the stage F1 pipeline unit;

and the stage F6 pipeline unit converts the instruction packet into an execution packet according to the content in the instruction buffer FIFO and the instruction line end mark and sends the execution packet to the execution pipeline.

Furthermore, the stage F1 pipeline unit includes an instruction fetch address generation circuit, which is mainly composed of a multiplexer and a register, and the instruction fetch source of the multiplexer includes but is not limited to reset signal, interrupt controller output, branch jump, branch prediction, sequential execution, wherein the interrupt controller output includes output generated by interrupt, interrupt return or interrupt cancel.

Furthermore, the F4 level pipeline unit comprises a branch predictor, and the branch predictor adopts a local two-bit prediction scheme; after power-on reset, all register units in the branch prediction register are set to be 2'b10, when a branch is predicted, the branch prediction is accessed by using an instruction line PC [9:0] where a conditional branch instruction is located, if the prediction [ PC [9:0] ] [1] is 1' b1, the branch needs to jump, otherwise, the branch does not jump; if the branch prediction result is jump, generating pipeline refresh signals of F1, F2, F3 and F4 in an F5 level pipeline unit;

the content in the branch prediction register is updated according to the final calculation result of the conditional branch, the branch predictor predicts 8 slots of the instruction packet, when the branch instruction line in the instruction packet has a line crossing condition, namely a part of instructions in the branch instruction line are positioned in the instruction packet of F3 level, the branch prediction cannot be carried out when the residual instruction packet in F3 level reaches F4 level, and meanwhile, when the branch prediction result reaches F5 level, the pipeline of F4 level is not cleaned.

Further, class F5 pipeline unit includes an instruction buffer FIFO having a 6-bit read pointer FIFO rd and a 3-bit write pointer FIFO wr, FIFO rd [4:3] serving as a FIFO line read pointer, FIFO rd [2:0] serving as a FIFO in-line offset pointer, FIFO wr [1:0] serving as a FIFO line write pointer;

the empty state and the full state of the FIFO are pre-judged by FIFO _ rd and FIFO _ wr, and when FIFO _ wr is equal to FIFO _ rd, the FIFO is pre-judged to be empty; when FIFO wr { -FIFO rd [5], FIFO rd [4:3] } and the F4 level instruction packet is valid, the FIFO is pre-determined to be full; generating pipeline stall signals for F1, F2, F3, F4 when the FIFO is pre-determined to be full;

when the instruction packet is all zero, the FIFO sends out the all zero instruction packet, and the FIFO read pointer is added with 8.

The six-level flow type instruction fetching logic designed by the invention replaces the original four-level flow type instruction fetching logic, shortens the time length of a key path, and the system clock frequency is determined by the time length of the key path, so the system clock frequency is improved, and the system execution efficiency is greatly improved; although the number of fetch pipeline stages is increased, the performance of the fetch pipeline is not degraded due to the presence of the branch predictor and the instruction buffer FIFO.

Drawings

FIG. 1 is a memory-compute fused processor architecture instruction fetch pipeline;

FIG. 2 is a fetch address generation circuit;

FIG. 3 is a state diagram for a two-bit branch prediction scheme;

FIG. 4 is an instruction buffer FIFO structure.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments. The embodiments of the present invention have been presented for purposes of illustration and description, and are not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Example 1

An instruction fetch pipeline for a memory-compute fusion processor, as shown in FIG. 1, includes six stages of pipelined instruction fetch logic F1-F6, which are described below.

1. F1 stage pipeline unit

The stage F1 pipeline unit includes instruction fetch address generating circuit for generating correct instruction fetch address according to the execution characteristics of the application program. The instruction sources are three, namely iCache, iMem and BIU, the BIU is a communication interface between the kernel and external equipment, and data of the external memory is read through the BIU.

The fetch address generating circuit is mainly composed of a multiplexer and a register, as shown in fig. 2, the fetch source of the multiplexer includes but is not limited to a reset signal, an interrupt controller output, a branch jump, a branch prediction, and a sequential execution, wherein the interrupt controller output includes an output generated by an interrupt, an interrupt return, or an interrupt cancel.

2. F2 stage pipeline unit

The read address of the iCache/iMEM is sent along with associated control signals including, but not limited to, a read enable signal.

3. F3 stage pipeline unit

When the pipeline is stopped, the read data returned by the iCache/iMEM is registered; when the pipeline stall is released, the registered read data is sent to the F4 stage pipeline unit. And (3) stopping/stopping the running water, judging according to the states of running water stopping signals of all levels of running water units, stopping the running water when the running water stopping signal stall is 1, and stopping the running water when the running water stopping signal stall is 0.

4. F4 stage pipeline unit

Judging whether the iCache/iMEM hits or not: if the value address is in the iMEM address range, the iMEM hits, and an instruction packet returned by the iMEM data reading memory is selected to be received; if the numeric address is not in the iMEM address range, the iCache is firstly accessed, whether a corresponding instruction exists in the iCache is judged, if yes, the iCache hits, an instruction packet returned by the iCache read data memory is selected to be received, and if not, the iCache is missed.

When the iCache is absent, a pipeline stall signal (namely, a set stall) of F1, F2 and F3 is generated, the AXIMaster is started to read the data of the external memory and update the iCache, and an instruction packet returned by the AXIMaster reading data memory is selected to be received.

The stage F4 pipeline unit comprises a branch predictor (adopting a local two-bit branch prediction scheme), performs branch prediction on a returned instruction packet, and sends a prediction result to the stage F5 pipeline unit.

After power-on reset, all register units in the branch prediction register are set to be 2' b10, when a branch is predicted, the branch predictor prediction is accessed by using an instruction line PC [9:0] where a conditional branch instruction is located, wherein the PC [9:0] is an address for addressing a branch prediction table, and the bit width of the address is changed according to the structure size of the branch prediction table. If prediction [ PC [9:0] ] [1] ═ 1' b1, it indicates that the branch needs to jump, otherwise it does not jump. If the branch prediction result is jump, a pipeline refreshing signal flush of F1, F2, F3 and F4 is generated in the stage F5 pipeline unit and is used for refreshing the preceding stage pipeline. The flush signal is high for one clk clock cycle during which intermediate results in the running water are reset to an initial value (including clearing stall).

A two bit branch prediction scheme state diagram is shown in fig. 3, with the branch predictor prediction stored values corresponding to the various state values in the state diagram of fig. 3. The content in the branch prediction register is updated according to the final calculation result of the conditional branch, the branch predictor predicts 8 slots of the instruction packet, when the branch instruction line in the instruction packet has a line crossing condition, namely a part of instructions in the branch instruction line are positioned in the instruction packet of F3 level, the branch prediction cannot be carried out when the residual instruction packet in F3 level reaches F4 level, and meanwhile, when the branch prediction result reaches F5 level, the pipeline of F4 level is not cleaned.

5. F5 stage pipeline unit

The stage F5 pipeline unit includes an instruction buffer FIFO through which the instruction packets are buffered while the branch predicted target address is provided to the stage F1 pipeline unit.

The instruction buffer FIFO has a 6-bit read pointer FIFO rd and a 3-bit write pointer FIFO wr, FIFO rd [4:3] being used as the FIFO line read pointer, FIFO rd [2:0] being used as the FIFO line offset pointer and FIFO wr [1:0] being used as the FIFO line write pointer, as shown in FIG. 4.

The empty state and the full state of the FIFO are pre-judged by FIFO _ rd and FIFO _ wr, and when FIFO _ wr is equal to FIFO _ rd, the FIFO is pre-judged to be empty; when FIFO wr { -FIFO rd [5], FIFO rd [4:3] } and the F4 level instruction packet is valid, the FIFO is pre-determined to be full.

Generating pipeline stall signals for F1, F2, F3, F4 when the FIFO is pre-determined to be full; clearing the running stall signals of F1, F2, F3 and F4 when the FIFO is fully released; when the instruction packet is all zero, the FIFO sends out the all zero instruction packet, and the FIFO read pointer is added with 8 (one line of the FIFO corresponds to 8 instruction slots).

6. And the stage F6 pipeline unit converts the instruction packet into an execution packet according to the content in the instruction buffer FIFO and the instruction line end mark and sends the execution packet to the execution pipeline.

It is to be understood that the described embodiments are merely a few embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by one of ordinary skill in the art and related arts based on the embodiments of the present invention without any creative effort, shall fall within the protection scope of the present invention.

Claims

1. An instruction fetch pipeline for a memory compute fusion processor, comprising six-stage pipelined instruction fetch logic of stages F1-F6, respectively:

an F3 stage flow unit, which registers the read data returned by iCache/iMEM when the flow stops; when the pipeline pause is released, sending the registered read data to an F4-level pipeline unit;

the F4-level pipeline unit is used for judging whether the iCache/iMEM is hit or not, generating pipeline pause signals of F1, F2 and F3 when the iCache is missing, starting the AXI Master to read data of an external memory and updating the iCache; according to the hit judgment result, selecting and receiving an instruction packet returned by an iCache, iMEM or AXI Master data reading memory, performing branch prediction on the returned instruction packet, and sending the prediction result to an F5-level pipeline unit;

2. The pipeline of claim 1, wherein the stage F1 pipeline unit comprises an instruction address generation circuit, the instruction address generation circuit is mainly composed of a multiplexer and a register, the instruction source of the multiplexer includes but is not limited to reset signal, interrupt controller output, branch jump, branch prediction, sequential execution, wherein the interrupt controller output includes output generated by interrupt, interrupt return or interrupt cancel.

3. The fetch pipeline for storing the computational fusion processor of claim 1, wherein the F4 stage pipeline unit contains a branch predictor that employs a local two-bit prediction scheme;

after power-on reset, all register units in the branch prediction register are set to be 2'b10, when a branch is predicted, the branch prediction is accessed by using an instruction line PC [9:0] where a conditional branch instruction is located, if the prediction [ PC [9:0] ] [1] is 1' b1, the branch needs to jump, otherwise, the branch does not jump; if the branch prediction result is jump, generating pipeline refresh signals of F1, F2, F3 and F4 in an F5 level pipeline unit;

4. The fetch pipeline for storing a computationally fused processor of claim 1, wherein the stage F5 pipeline unit comprises an instruction buffer FIFO having a 6bit read pointer FIFO rd and a 3bit write pointer FIFO wr, FIFO rd [4:3] acting as FIFO line read pointer, FIFO rd [2:0] acting as FIFO line offset pointer, FIFO wr [1:0] acting as FIFO line write pointer;