CN114528024A - Instruction fetching assembly line for storage and calculation fusion processor - Google Patents

Instruction fetching assembly line for storage and calculation fusion processor Download PDF

Info

Publication number
CN114528024A
CN114528024A CN202210158362.8A CN202210158362A CN114528024A CN 114528024 A CN114528024 A CN 114528024A CN 202210158362 A CN202210158362 A CN 202210158362A CN 114528024 A CN114528024 A CN 114528024A
Authority
CN
China
Prior art keywords
instruction
fifo
pipeline
branch
level
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210158362.8A
Other languages
Chinese (zh)
Inventor
王媛
胡孔阳
李泉泉
刘玉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Core Century Technology Co ltd
Original Assignee
Anhui Core Century Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Core Century Technology Co ltd filed Critical Anhui Core Century Technology Co ltd
Priority to CN202210158362.8A priority Critical patent/CN114528024A/en
Publication of CN114528024A publication Critical patent/CN114528024A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3842Speculative instruction execution
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)

Abstract

The invention discloses an instruction fetching assembly line for an access calculation fusion processor, which aims to improve the operation performance of the access calculation fusion processor to the maximum extent, and preferably adopts a larger instruction memory iMEM and a high-speed Cache iCache to replace an iCache and related multi-level caches in a traditional instruction fetching module, so that the Cache miss rate is reduced, and the instruction fetching performance is improved; however, the larger storage means larger delay, and for this reason, the invention designs six-level pipelined instruction fetch logic to replace the original four-level pipelined instruction fetch logic, so that the time length of a key path is shortened, and the system clock frequency is determined by the time length of the key path, so that the system clock frequency is improved, and the system execution efficiency is greatly improved; although the number of the fetching pipeline stages is increased, the performance of the fetching pipeline is not reduced due to the existence of the branch predictor and the instruction buffering FIFO.

Description

Instruction fetching assembly line for storage and calculation fusion processor
Technical Field
The invention relates to the technical field of pipeline instruction fetching, in particular to an instruction fetching pipeline for a memory-computation fusion processor.
Background
With the increasing process improvement of processors and memories, the memory wall scissors under the Von Neumann architecture are continuously increased, the problem of memory access power consumption is increasingly highlighted, and the focus of the industry and the academia begins to be shifted from focus computing to focus storage. Meanwhile, the rapid development of applications such as artificial intelligence, brain-like computing and the like with high memory access, high parallelism and low precision drives the rapid development of computing type storage/storage and computation integration/memory computing, and the rapid development of computing type storage/storage and computation integration/memory computing inevitably brings greater data throughput rate requirements.
The Instruction Fetch (IF) stage is the process of fetching an instruction from main memory to an instruction register. In order to maximize the advantages of the compute-based unified/in-memory computation, it is necessary to design the fetch logic matching the throughput. At present, the conventional instruction fetching module uses the iCache to Cache the instruction machine code, but the iCache capacity is usually small, the Cache miss probability is high, and the Cache miss cost is usually high. In order to improve the operation performance of the memory-compute fusion processor to the maximum extent, it is better to adopt a larger instruction memory iMEM (SRAM) and a Cache iCache (SRAM) to replace iCache and its related multi-level Cache in the traditional instruction fetching module, reduce the Cache miss rate, improve the instruction fetching performance, but larger storage brings larger delay and reduces the system execution efficiency.
Disclosure of Invention
Aiming at the delay problem of the existing fetch pipeline applied to the memory fusion processor adopting a larger instruction memory and a high-speed cache, the invention provides the fetch pipeline for the memory fusion processor, which effectively improves the program execution efficiency.
An instruction fetch pipeline for a memory compute fusion processor, comprising six-stage pipelined instruction fetch logic of stages F1-F6, respectively:
the F1 level flow unit generates a correct instruction fetching address according to the execution characteristics of the application program;
an F2 level flow unit which sends the read address and related control signals of the iCache/iMEM, wherein the related control signals include but are not limited to read enable signals;
an F3 stage flow unit, which registers the read data returned by iCache/iMEM when the flow stops; when the pipeline stall condition is relieved, sending the registered read data to an F4-level pipeline unit;
the F4-level flow unit is used for judging whether the iCache/iMEM is hit or not, when the iCache is lacked, generating flow pause signals of F1, F2 and F3, starting the AXIMaster to read data of an external memory and updating the iCache, selecting the iCache, the iMEM or the AXIMaster to read an instruction packet returned by the data memory according to the hit judgment result, performing branch prediction on the returned instruction packet, sending the prediction result to the F5-level flow unit, and enabling the branch prediction result to take effect in F5-level flow;
the stage F5 pipeline unit buffers the instruction packet through the instruction buffer FIFO and sends the branch prediction target address to the stage F1 pipeline unit;
and the stage F6 pipeline unit converts the instruction packet into an execution packet according to the content in the instruction buffer FIFO and the instruction line end mark and sends the execution packet to the execution pipeline.
Furthermore, the stage F1 pipeline unit includes an instruction fetch address generation circuit, which is mainly composed of a multiplexer and a register, and the instruction fetch source of the multiplexer includes but is not limited to reset signal, interrupt controller output, branch jump, branch prediction, sequential execution, wherein the interrupt controller output includes output generated by interrupt, interrupt return or interrupt cancel.
Furthermore, the F4 level pipeline unit comprises a branch predictor, and the branch predictor adopts a local two-bit prediction scheme; after power-on reset, all register units in the branch prediction register are set to be 2'b10, when a branch is predicted, the branch prediction is accessed by using an instruction line PC [9:0] where a conditional branch instruction is located, if the prediction [ PC [9:0] ] [1] is 1' b1, the branch needs to jump, otherwise, the branch does not jump; if the branch prediction result is jump, generating pipeline refresh signals of F1, F2, F3 and F4 in an F5 level pipeline unit;
the content in the branch prediction register is updated according to the final calculation result of the conditional branch, the branch predictor predicts 8 slots of the instruction packet, when the branch instruction line in the instruction packet has a line crossing condition, namely a part of instructions in the branch instruction line are positioned in the instruction packet of F3 level, the branch prediction cannot be carried out when the residual instruction packet in F3 level reaches F4 level, and meanwhile, when the branch prediction result reaches F5 level, the pipeline of F4 level is not cleaned.
Further, class F5 pipeline unit includes an instruction buffer FIFO having a 6-bit read pointer FIFO rd and a 3-bit write pointer FIFO wr, FIFO rd [4:3] serving as a FIFO line read pointer, FIFO rd [2:0] serving as a FIFO in-line offset pointer, FIFO wr [1:0] serving as a FIFO line write pointer;
the empty state and the full state of the FIFO are pre-judged by FIFO _ rd and FIFO _ wr, and when FIFO _ wr is equal to FIFO _ rd, the FIFO is pre-judged to be empty; when FIFO wr { -FIFO rd [5], FIFO rd [4:3] } and the F4 level instruction packet is valid, the FIFO is pre-determined to be full; generating pipeline stall signals for F1, F2, F3, F4 when the FIFO is pre-determined to be full;
when the instruction packet is all zero, the FIFO sends out the all zero instruction packet, and the FIFO read pointer is added with 8.
The six-level flow type instruction fetching logic designed by the invention replaces the original four-level flow type instruction fetching logic, shortens the time length of a key path, and the system clock frequency is determined by the time length of the key path, so the system clock frequency is improved, and the system execution efficiency is greatly improved; although the number of fetch pipeline stages is increased, the performance of the fetch pipeline is not degraded due to the presence of the branch predictor and the instruction buffer FIFO.
Drawings
FIG. 1 is a memory-compute fused processor architecture instruction fetch pipeline;
FIG. 2 is a fetch address generation circuit;
FIG. 3 is a state diagram for a two-bit branch prediction scheme;
FIG. 4 is an instruction buffer FIFO structure.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments. The embodiments of the present invention have been presented for purposes of illustration and description, and are not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.
Example 1
An instruction fetch pipeline for a memory-compute fusion processor, as shown in FIG. 1, includes six stages of pipelined instruction fetch logic F1-F6, which are described below.
1. F1 stage pipeline unit
The stage F1 pipeline unit includes instruction fetch address generating circuit for generating correct instruction fetch address according to the execution characteristics of the application program. The instruction sources are three, namely iCache, iMem and BIU, the BIU is a communication interface between the kernel and external equipment, and data of the external memory is read through the BIU.
The fetch address generating circuit is mainly composed of a multiplexer and a register, as shown in fig. 2, the fetch source of the multiplexer includes but is not limited to a reset signal, an interrupt controller output, a branch jump, a branch prediction, and a sequential execution, wherein the interrupt controller output includes an output generated by an interrupt, an interrupt return, or an interrupt cancel.
2. F2 stage pipeline unit
The read address of the iCache/iMEM is sent along with associated control signals including, but not limited to, a read enable signal.
3. F3 stage pipeline unit
When the pipeline is stopped, the read data returned by the iCache/iMEM is registered; when the pipeline stall is released, the registered read data is sent to the F4 stage pipeline unit. And (3) stopping/stopping the running water, judging according to the states of running water stopping signals of all levels of running water units, stopping the running water when the running water stopping signal stall is 1, and stopping the running water when the running water stopping signal stall is 0.
4. F4 stage pipeline unit
Judging whether the iCache/iMEM hits or not: if the value address is in the iMEM address range, the iMEM hits, and an instruction packet returned by the iMEM data reading memory is selected to be received; if the numeric address is not in the iMEM address range, the iCache is firstly accessed, whether a corresponding instruction exists in the iCache is judged, if yes, the iCache hits, an instruction packet returned by the iCache read data memory is selected to be received, and if not, the iCache is missed.
When the iCache is absent, a pipeline stall signal (namely, a set stall) of F1, F2 and F3 is generated, the AXIMaster is started to read the data of the external memory and update the iCache, and an instruction packet returned by the AXIMaster reading data memory is selected to be received.
The stage F4 pipeline unit comprises a branch predictor (adopting a local two-bit branch prediction scheme), performs branch prediction on a returned instruction packet, and sends a prediction result to the stage F5 pipeline unit.
After power-on reset, all register units in the branch prediction register are set to be 2' b10, when a branch is predicted, the branch predictor prediction is accessed by using an instruction line PC [9:0] where a conditional branch instruction is located, wherein the PC [9:0] is an address for addressing a branch prediction table, and the bit width of the address is changed according to the structure size of the branch prediction table. If prediction [ PC [9:0] ] [1] ═ 1' b1, it indicates that the branch needs to jump, otherwise it does not jump. If the branch prediction result is jump, a pipeline refreshing signal flush of F1, F2, F3 and F4 is generated in the stage F5 pipeline unit and is used for refreshing the preceding stage pipeline. The flush signal is high for one clk clock cycle during which intermediate results in the running water are reset to an initial value (including clearing stall).
A two bit branch prediction scheme state diagram is shown in fig. 3, with the branch predictor prediction stored values corresponding to the various state values in the state diagram of fig. 3. The content in the branch prediction register is updated according to the final calculation result of the conditional branch, the branch predictor predicts 8 slots of the instruction packet, when the branch instruction line in the instruction packet has a line crossing condition, namely a part of instructions in the branch instruction line are positioned in the instruction packet of F3 level, the branch prediction cannot be carried out when the residual instruction packet in F3 level reaches F4 level, and meanwhile, when the branch prediction result reaches F5 level, the pipeline of F4 level is not cleaned.
5. F5 stage pipeline unit
The stage F5 pipeline unit includes an instruction buffer FIFO through which the instruction packets are buffered while the branch predicted target address is provided to the stage F1 pipeline unit.
The instruction buffer FIFO has a 6-bit read pointer FIFO rd and a 3-bit write pointer FIFO wr, FIFO rd [4:3] being used as the FIFO line read pointer, FIFO rd [2:0] being used as the FIFO line offset pointer and FIFO wr [1:0] being used as the FIFO line write pointer, as shown in FIG. 4.
The empty state and the full state of the FIFO are pre-judged by FIFO _ rd and FIFO _ wr, and when FIFO _ wr is equal to FIFO _ rd, the FIFO is pre-judged to be empty; when FIFO wr { -FIFO rd [5], FIFO rd [4:3] } and the F4 level instruction packet is valid, the FIFO is pre-determined to be full.
Generating pipeline stall signals for F1, F2, F3, F4 when the FIFO is pre-determined to be full; clearing the running stall signals of F1, F2, F3 and F4 when the FIFO is fully released; when the instruction packet is all zero, the FIFO sends out the all zero instruction packet, and the FIFO read pointer is added with 8 (one line of the FIFO corresponds to 8 instruction slots).
6. And the stage F6 pipeline unit converts the instruction packet into an execution packet according to the content in the instruction buffer FIFO and the instruction line end mark and sends the execution packet to the execution pipeline.
It is to be understood that the described embodiments are merely a few embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by one of ordinary skill in the art and related arts based on the embodiments of the present invention without any creative effort, shall fall within the protection scope of the present invention.

Claims (4)

1. An instruction fetch pipeline for a memory compute fusion processor, comprising six-stage pipelined instruction fetch logic of stages F1-F6, respectively:
the F1 level flow unit generates a correct instruction fetching address according to the execution characteristics of the application program;
an F2 level flow unit which sends the read address and related control signals of the iCache/iMEM, wherein the related control signals include but are not limited to read enable signals;
an F3 stage flow unit, which registers the read data returned by iCache/iMEM when the flow stops; when the pipeline pause is released, sending the registered read data to an F4-level pipeline unit;
the F4-level pipeline unit is used for judging whether the iCache/iMEM is hit or not, generating pipeline pause signals of F1, F2 and F3 when the iCache is missing, starting the AXI Master to read data of an external memory and updating the iCache; according to the hit judgment result, selecting and receiving an instruction packet returned by an iCache, iMEM or AXI Master data reading memory, performing branch prediction on the returned instruction packet, and sending the prediction result to an F5-level pipeline unit;
the stage F5 pipeline unit buffers the instruction packet through the instruction buffer FIFO and sends the branch prediction target address to the stage F1 pipeline unit;
and the stage F6 pipeline unit converts the instruction packet into an execution packet according to the content in the instruction buffer FIFO and the instruction line end mark and sends the execution packet to the execution pipeline.
2. The pipeline of claim 1, wherein the stage F1 pipeline unit comprises an instruction address generation circuit, the instruction address generation circuit is mainly composed of a multiplexer and a register, the instruction source of the multiplexer includes but is not limited to reset signal, interrupt controller output, branch jump, branch prediction, sequential execution, wherein the interrupt controller output includes output generated by interrupt, interrupt return or interrupt cancel.
3. The fetch pipeline for storing the computational fusion processor of claim 1, wherein the F4 stage pipeline unit contains a branch predictor that employs a local two-bit prediction scheme;
after power-on reset, all register units in the branch prediction register are set to be 2'b10, when a branch is predicted, the branch prediction is accessed by using an instruction line PC [9:0] where a conditional branch instruction is located, if the prediction [ PC [9:0] ] [1] is 1' b1, the branch needs to jump, otherwise, the branch does not jump; if the branch prediction result is jump, generating pipeline refresh signals of F1, F2, F3 and F4 in an F5 level pipeline unit;
the content in the branch prediction register is updated according to the final calculation result of the conditional branch, the branch predictor predicts 8 slots of the instruction packet, when the branch instruction line in the instruction packet has a line crossing condition, namely a part of instructions in the branch instruction line are positioned in the instruction packet of F3 level, the branch prediction cannot be carried out when the residual instruction packet in F3 level reaches F4 level, and meanwhile, when the branch prediction result reaches F5 level, the pipeline of F4 level is not cleaned.
4. The fetch pipeline for storing a computationally fused processor of claim 1, wherein the stage F5 pipeline unit comprises an instruction buffer FIFO having a 6bit read pointer FIFO rd and a 3bit write pointer FIFO wr, FIFO rd [4:3] acting as FIFO line read pointer, FIFO rd [2:0] acting as FIFO line offset pointer, FIFO wr [1:0] acting as FIFO line write pointer;
the empty state and the full state of the FIFO are pre-judged by FIFO _ rd and FIFO _ wr, and when FIFO _ wr is equal to FIFO _ rd, the FIFO is pre-judged to be empty; when FIFO wr { -FIFO rd [5], FIFO rd [4:3] } and the F4 level instruction packet is valid, the FIFO is pre-determined to be full; generating pipeline stall signals for F1, F2, F3, F4 when the FIFO is pre-determined to be full;
when the instruction packet is all zero, the FIFO sends out the all zero instruction packet, and the FIFO read pointer is added with 8.
CN202210158362.8A 2022-02-21 2022-02-21 Instruction fetching assembly line for storage and calculation fusion processor Pending CN114528024A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210158362.8A CN114528024A (en) 2022-02-21 2022-02-21 Instruction fetching assembly line for storage and calculation fusion processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210158362.8A CN114528024A (en) 2022-02-21 2022-02-21 Instruction fetching assembly line for storage and calculation fusion processor

Publications (1)

Publication Number Publication Date
CN114528024A true CN114528024A (en) 2022-05-24

Family

ID=81625105

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210158362.8A Pending CN114528024A (en) 2022-02-21 2022-02-21 Instruction fetching assembly line for storage and calculation fusion processor

Country Status (1)

Country Link
CN (1) CN114528024A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116719561A (en) * 2023-08-09 2023-09-08 芯砺智能科技(上海)有限公司 Conditional branch instruction processing system and method
CN118069224A (en) * 2024-04-19 2024-05-24 芯来智融半导体科技(上海)有限公司 Address generation method, address generation device, computer equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116719561A (en) * 2023-08-09 2023-09-08 芯砺智能科技(上海)有限公司 Conditional branch instruction processing system and method
CN116719561B (en) * 2023-08-09 2023-10-31 芯砺智能科技(上海)有限公司 Conditional branch instruction processing system and method
CN118069224A (en) * 2024-04-19 2024-05-24 芯来智融半导体科技(上海)有限公司 Address generation method, address generation device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
US5553255A (en) Data processor with programmable levels of speculative instruction fetching and method of operation
US8108615B2 (en) Prefetching controller using a counter
US6012134A (en) High-performance processor with streaming buffer that facilitates prefetching of instructions
CN114528024A (en) Instruction fetching assembly line for storage and calculation fusion processor
US20130179640A1 (en) Instruction cache power reduction
CN112230992B (en) Instruction processing device, processor and processing method thereof comprising branch prediction loop
US20080072024A1 (en) Predicting instruction branches with bimodal, little global, big global, and loop (BgGL) branch predictors
US20190065205A1 (en) Variable length instruction processor system and method
US11301250B2 (en) Data prefetching auxiliary circuit, data prefetching method, and microprocessor
JPH10228377A (en) Information processor for predicting branch
KR20070001081A (en) Method and apparatus for allocating entries in a branch target buffer
US20200233806A1 (en) Apparatus, method, and system for enhanced data prefetching based on non-uniform memory access (numa) characteristics
US6823430B2 (en) Directoryless L0 cache for stall reduction
US5854943A (en) Speed efficient cache output selector circuitry based on tag compare and data organization
US10628163B2 (en) Processor with variable pre-fetch threshold
US6321299B1 (en) Computer circuits, systems, and methods using partial cache cleaning
EP4020229A1 (en) System, apparatus and method for prefetching physical pages in a processor
US5421026A (en) Data processor for processing instruction after conditional branch instruction at high speed
US11567776B2 (en) Branch density detection for prefetcher
US7389405B2 (en) Digital signal processor architecture with optimized memory access for code discontinuity
US6957319B1 (en) Integrated circuit with multiple microcode ROMs
CN111209043B (en) Method for realizing instruction prefetching in front-end pipeline by using look-ahead pointer method
KR100456215B1 (en) cache system using the block buffering and the method
CN116627335A (en) Low-power eFlash reading acceleration system
JPS61269735A (en) Instruction queue control system of electronic computer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination