CN114528024A - Instruction fetching assembly line for storage and calculation fusion processor - Google Patents
Instruction fetching assembly line for storage and calculation fusion processor Download PDFInfo
- Publication number
- CN114528024A CN114528024A CN202210158362.8A CN202210158362A CN114528024A CN 114528024 A CN114528024 A CN 114528024A CN 202210158362 A CN202210158362 A CN 202210158362A CN 114528024 A CN114528024 A CN 114528024A
- Authority
- CN
- China
- Prior art keywords
- instruction
- fifo
- pipeline
- branch
- level
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000004927 fusion Effects 0.000 title claims abstract description 12
- 238000004364 calculation method Methods 0.000 title claims abstract description 7
- 230000015654 memory Effects 0.000 claims abstract description 23
- 239000000872 buffer Substances 0.000 claims description 13
- 238000013461 design Methods 0.000 abstract description 2
- 230000003139 buffering effect Effects 0.000 abstract 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 8
- 238000011161 development Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000010354 integration Effects 0.000 description 2
- 238000000034 method Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3867—Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3842—Speculative instruction execution
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Advance Control (AREA)
Abstract
The invention discloses an instruction fetching assembly line for an access calculation fusion processor, which aims to improve the operation performance of the access calculation fusion processor to the maximum extent, and preferably adopts a larger instruction memory iMEM and a high-speed Cache iCache to replace an iCache and related multi-level caches in a traditional instruction fetching module, so that the Cache miss rate is reduced, and the instruction fetching performance is improved; however, the larger storage means larger delay, and for this reason, the invention designs six-level pipelined instruction fetch logic to replace the original four-level pipelined instruction fetch logic, so that the time length of a key path is shortened, and the system clock frequency is determined by the time length of the key path, so that the system clock frequency is improved, and the system execution efficiency is greatly improved; although the number of the fetching pipeline stages is increased, the performance of the fetching pipeline is not reduced due to the existence of the branch predictor and the instruction buffering FIFO.
Description
Technical Field
The invention relates to the technical field of pipeline instruction fetching, in particular to an instruction fetching pipeline for a memory-computation fusion processor.
Background
With the increasing process improvement of processors and memories, the memory wall scissors under the Von Neumann architecture are continuously increased, the problem of memory access power consumption is increasingly highlighted, and the focus of the industry and the academia begins to be shifted from focus computing to focus storage. Meanwhile, the rapid development of applications such as artificial intelligence, brain-like computing and the like with high memory access, high parallelism and low precision drives the rapid development of computing type storage/storage and computation integration/memory computing, and the rapid development of computing type storage/storage and computation integration/memory computing inevitably brings greater data throughput rate requirements.
The Instruction Fetch (IF) stage is the process of fetching an instruction from main memory to an instruction register. In order to maximize the advantages of the compute-based unified/in-memory computation, it is necessary to design the fetch logic matching the throughput. At present, the conventional instruction fetching module uses the iCache to Cache the instruction machine code, but the iCache capacity is usually small, the Cache miss probability is high, and the Cache miss cost is usually high. In order to improve the operation performance of the memory-compute fusion processor to the maximum extent, it is better to adopt a larger instruction memory iMEM (SRAM) and a Cache iCache (SRAM) to replace iCache and its related multi-level Cache in the traditional instruction fetching module, reduce the Cache miss rate, improve the instruction fetching performance, but larger storage brings larger delay and reduces the system execution efficiency.
Disclosure of Invention
Aiming at the delay problem of the existing fetch pipeline applied to the memory fusion processor adopting a larger instruction memory and a high-speed cache, the invention provides the fetch pipeline for the memory fusion processor, which effectively improves the program execution efficiency.
An instruction fetch pipeline for a memory compute fusion processor, comprising six-stage pipelined instruction fetch logic of stages F1-F6, respectively:
the F1 level flow unit generates a correct instruction fetching address according to the execution characteristics of the application program;
an F2 level flow unit which sends the read address and related control signals of the iCache/iMEM, wherein the related control signals include but are not limited to read enable signals;
an F3 stage flow unit, which registers the read data returned by iCache/iMEM when the flow stops; when the pipeline stall condition is relieved, sending the registered read data to an F4-level pipeline unit;
the F4-level flow unit is used for judging whether the iCache/iMEM is hit or not, when the iCache is lacked, generating flow pause signals of F1, F2 and F3, starting the AXIMaster to read data of an external memory and updating the iCache, selecting the iCache, the iMEM or the AXIMaster to read an instruction packet returned by the data memory according to the hit judgment result, performing branch prediction on the returned instruction packet, sending the prediction result to the F5-level flow unit, and enabling the branch prediction result to take effect in F5-level flow;
the stage F5 pipeline unit buffers the instruction packet through the instruction buffer FIFO and sends the branch prediction target address to the stage F1 pipeline unit;
and the stage F6 pipeline unit converts the instruction packet into an execution packet according to the content in the instruction buffer FIFO and the instruction line end mark and sends the execution packet to the execution pipeline.
Furthermore, the stage F1 pipeline unit includes an instruction fetch address generation circuit, which is mainly composed of a multiplexer and a register, and the instruction fetch source of the multiplexer includes but is not limited to reset signal, interrupt controller output, branch jump, branch prediction, sequential execution, wherein the interrupt controller output includes output generated by interrupt, interrupt return or interrupt cancel.
Furthermore, the F4 level pipeline unit comprises a branch predictor, and the branch predictor adopts a local two-bit prediction scheme; after power-on reset, all register units in the branch prediction register are set to be 2'b10, when a branch is predicted, the branch prediction is accessed by using an instruction line PC [9:0] where a conditional branch instruction is located, if the prediction [ PC [9:0] ] [1] is 1' b1, the branch needs to jump, otherwise, the branch does not jump; if the branch prediction result is jump, generating pipeline refresh signals of F1, F2, F3 and F4 in an F5 level pipeline unit;
the content in the branch prediction register is updated according to the final calculation result of the conditional branch, the branch predictor predicts 8 slots of the instruction packet, when the branch instruction line in the instruction packet has a line crossing condition, namely a part of instructions in the branch instruction line are positioned in the instruction packet of F3 level, the branch prediction cannot be carried out when the residual instruction packet in F3 level reaches F4 level, and meanwhile, when the branch prediction result reaches F5 level, the pipeline of F4 level is not cleaned.
Further, class F5 pipeline unit includes an instruction buffer FIFO having a 6-bit read pointer FIFO rd and a 3-bit write pointer FIFO wr, FIFO rd [4:3] serving as a FIFO line read pointer, FIFO rd [2:0] serving as a FIFO in-line offset pointer, FIFO wr [1:0] serving as a FIFO line write pointer;
the empty state and the full state of the FIFO are pre-judged by FIFO _ rd and FIFO _ wr, and when FIFO _ wr is equal to FIFO _ rd, the FIFO is pre-judged to be empty; when FIFO wr { -FIFO rd [5], FIFO rd [4:3] } and the F4 level instruction packet is valid, the FIFO is pre-determined to be full; generating pipeline stall signals for F1, F2, F3, F4 when the FIFO is pre-determined to be full;
when the instruction packet is all zero, the FIFO sends out the all zero instruction packet, and the FIFO read pointer is added with 8.
The six-level flow type instruction fetching logic designed by the invention replaces the original four-level flow type instruction fetching logic, shortens the time length of a key path, and the system clock frequency is determined by the time length of the key path, so the system clock frequency is improved, and the system execution efficiency is greatly improved; although the number of fetch pipeline stages is increased, the performance of the fetch pipeline is not degraded due to the presence of the branch predictor and the instruction buffer FIFO.
Drawings
FIG. 1 is a memory-compute fused processor architecture instruction fetch pipeline;
FIG. 2 is a fetch address generation circuit;
FIG. 3 is a state diagram for a two-bit branch prediction scheme;
FIG. 4 is an instruction buffer FIFO structure.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments. The embodiments of the present invention have been presented for purposes of illustration and description, and are not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.
Example 1
An instruction fetch pipeline for a memory-compute fusion processor, as shown in FIG. 1, includes six stages of pipelined instruction fetch logic F1-F6, which are described below.
1. F1 stage pipeline unit
The stage F1 pipeline unit includes instruction fetch address generating circuit for generating correct instruction fetch address according to the execution characteristics of the application program. The instruction sources are three, namely iCache, iMem and BIU, the BIU is a communication interface between the kernel and external equipment, and data of the external memory is read through the BIU.
The fetch address generating circuit is mainly composed of a multiplexer and a register, as shown in fig. 2, the fetch source of the multiplexer includes but is not limited to a reset signal, an interrupt controller output, a branch jump, a branch prediction, and a sequential execution, wherein the interrupt controller output includes an output generated by an interrupt, an interrupt return, or an interrupt cancel.
2. F2 stage pipeline unit
The read address of the iCache/iMEM is sent along with associated control signals including, but not limited to, a read enable signal.
3. F3 stage pipeline unit
When the pipeline is stopped, the read data returned by the iCache/iMEM is registered; when the pipeline stall is released, the registered read data is sent to the F4 stage pipeline unit. And (3) stopping/stopping the running water, judging according to the states of running water stopping signals of all levels of running water units, stopping the running water when the running water stopping signal stall is 1, and stopping the running water when the running water stopping signal stall is 0.
4. F4 stage pipeline unit
Judging whether the iCache/iMEM hits or not: if the value address is in the iMEM address range, the iMEM hits, and an instruction packet returned by the iMEM data reading memory is selected to be received; if the numeric address is not in the iMEM address range, the iCache is firstly accessed, whether a corresponding instruction exists in the iCache is judged, if yes, the iCache hits, an instruction packet returned by the iCache read data memory is selected to be received, and if not, the iCache is missed.
When the iCache is absent, a pipeline stall signal (namely, a set stall) of F1, F2 and F3 is generated, the AXIMaster is started to read the data of the external memory and update the iCache, and an instruction packet returned by the AXIMaster reading data memory is selected to be received.
The stage F4 pipeline unit comprises a branch predictor (adopting a local two-bit branch prediction scheme), performs branch prediction on a returned instruction packet, and sends a prediction result to the stage F5 pipeline unit.
After power-on reset, all register units in the branch prediction register are set to be 2' b10, when a branch is predicted, the branch predictor prediction is accessed by using an instruction line PC [9:0] where a conditional branch instruction is located, wherein the PC [9:0] is an address for addressing a branch prediction table, and the bit width of the address is changed according to the structure size of the branch prediction table. If prediction [ PC [9:0] ] [1] ═ 1' b1, it indicates that the branch needs to jump, otherwise it does not jump. If the branch prediction result is jump, a pipeline refreshing signal flush of F1, F2, F3 and F4 is generated in the stage F5 pipeline unit and is used for refreshing the preceding stage pipeline. The flush signal is high for one clk clock cycle during which intermediate results in the running water are reset to an initial value (including clearing stall).
A two bit branch prediction scheme state diagram is shown in fig. 3, with the branch predictor prediction stored values corresponding to the various state values in the state diagram of fig. 3. The content in the branch prediction register is updated according to the final calculation result of the conditional branch, the branch predictor predicts 8 slots of the instruction packet, when the branch instruction line in the instruction packet has a line crossing condition, namely a part of instructions in the branch instruction line are positioned in the instruction packet of F3 level, the branch prediction cannot be carried out when the residual instruction packet in F3 level reaches F4 level, and meanwhile, when the branch prediction result reaches F5 level, the pipeline of F4 level is not cleaned.
5. F5 stage pipeline unit
The stage F5 pipeline unit includes an instruction buffer FIFO through which the instruction packets are buffered while the branch predicted target address is provided to the stage F1 pipeline unit.
The instruction buffer FIFO has a 6-bit read pointer FIFO rd and a 3-bit write pointer FIFO wr, FIFO rd [4:3] being used as the FIFO line read pointer, FIFO rd [2:0] being used as the FIFO line offset pointer and FIFO wr [1:0] being used as the FIFO line write pointer, as shown in FIG. 4.
The empty state and the full state of the FIFO are pre-judged by FIFO _ rd and FIFO _ wr, and when FIFO _ wr is equal to FIFO _ rd, the FIFO is pre-judged to be empty; when FIFO wr { -FIFO rd [5], FIFO rd [4:3] } and the F4 level instruction packet is valid, the FIFO is pre-determined to be full.
Generating pipeline stall signals for F1, F2, F3, F4 when the FIFO is pre-determined to be full; clearing the running stall signals of F1, F2, F3 and F4 when the FIFO is fully released; when the instruction packet is all zero, the FIFO sends out the all zero instruction packet, and the FIFO read pointer is added with 8 (one line of the FIFO corresponds to 8 instruction slots).
6. And the stage F6 pipeline unit converts the instruction packet into an execution packet according to the content in the instruction buffer FIFO and the instruction line end mark and sends the execution packet to the execution pipeline.
It is to be understood that the described embodiments are merely a few embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by one of ordinary skill in the art and related arts based on the embodiments of the present invention without any creative effort, shall fall within the protection scope of the present invention.
Claims (4)
1. An instruction fetch pipeline for a memory compute fusion processor, comprising six-stage pipelined instruction fetch logic of stages F1-F6, respectively:
the F1 level flow unit generates a correct instruction fetching address according to the execution characteristics of the application program;
an F2 level flow unit which sends the read address and related control signals of the iCache/iMEM, wherein the related control signals include but are not limited to read enable signals;
an F3 stage flow unit, which registers the read data returned by iCache/iMEM when the flow stops; when the pipeline pause is released, sending the registered read data to an F4-level pipeline unit;
the F4-level pipeline unit is used for judging whether the iCache/iMEM is hit or not, generating pipeline pause signals of F1, F2 and F3 when the iCache is missing, starting the AXI Master to read data of an external memory and updating the iCache; according to the hit judgment result, selecting and receiving an instruction packet returned by an iCache, iMEM or AXI Master data reading memory, performing branch prediction on the returned instruction packet, and sending the prediction result to an F5-level pipeline unit;
the stage F5 pipeline unit buffers the instruction packet through the instruction buffer FIFO and sends the branch prediction target address to the stage F1 pipeline unit;
and the stage F6 pipeline unit converts the instruction packet into an execution packet according to the content in the instruction buffer FIFO and the instruction line end mark and sends the execution packet to the execution pipeline.
2. The pipeline of claim 1, wherein the stage F1 pipeline unit comprises an instruction address generation circuit, the instruction address generation circuit is mainly composed of a multiplexer and a register, the instruction source of the multiplexer includes but is not limited to reset signal, interrupt controller output, branch jump, branch prediction, sequential execution, wherein the interrupt controller output includes output generated by interrupt, interrupt return or interrupt cancel.
3. The fetch pipeline for storing the computational fusion processor of claim 1, wherein the F4 stage pipeline unit contains a branch predictor that employs a local two-bit prediction scheme;
after power-on reset, all register units in the branch prediction register are set to be 2'b10, when a branch is predicted, the branch prediction is accessed by using an instruction line PC [9:0] where a conditional branch instruction is located, if the prediction [ PC [9:0] ] [1] is 1' b1, the branch needs to jump, otherwise, the branch does not jump; if the branch prediction result is jump, generating pipeline refresh signals of F1, F2, F3 and F4 in an F5 level pipeline unit;
the content in the branch prediction register is updated according to the final calculation result of the conditional branch, the branch predictor predicts 8 slots of the instruction packet, when the branch instruction line in the instruction packet has a line crossing condition, namely a part of instructions in the branch instruction line are positioned in the instruction packet of F3 level, the branch prediction cannot be carried out when the residual instruction packet in F3 level reaches F4 level, and meanwhile, when the branch prediction result reaches F5 level, the pipeline of F4 level is not cleaned.
4. The fetch pipeline for storing a computationally fused processor of claim 1, wherein the stage F5 pipeline unit comprises an instruction buffer FIFO having a 6bit read pointer FIFO rd and a 3bit write pointer FIFO wr, FIFO rd [4:3] acting as FIFO line read pointer, FIFO rd [2:0] acting as FIFO line offset pointer, FIFO wr [1:0] acting as FIFO line write pointer;
the empty state and the full state of the FIFO are pre-judged by FIFO _ rd and FIFO _ wr, and when FIFO _ wr is equal to FIFO _ rd, the FIFO is pre-judged to be empty; when FIFO wr { -FIFO rd [5], FIFO rd [4:3] } and the F4 level instruction packet is valid, the FIFO is pre-determined to be full; generating pipeline stall signals for F1, F2, F3, F4 when the FIFO is pre-determined to be full;
when the instruction packet is all zero, the FIFO sends out the all zero instruction packet, and the FIFO read pointer is added with 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210158362.8A CN114528024A (en) | 2022-02-21 | 2022-02-21 | Instruction fetching assembly line for storage and calculation fusion processor |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210158362.8A CN114528024A (en) | 2022-02-21 | 2022-02-21 | Instruction fetching assembly line for storage and calculation fusion processor |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114528024A true CN114528024A (en) | 2022-05-24 |
Family
ID=81625105
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210158362.8A Pending CN114528024A (en) | 2022-02-21 | 2022-02-21 | Instruction fetching assembly line for storage and calculation fusion processor |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114528024A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116719561A (en) * | 2023-08-09 | 2023-09-08 | 芯砺智能科技(上海)有限公司 | Conditional branch instruction processing system and method |
CN118069224A (en) * | 2024-04-19 | 2024-05-24 | 芯来智融半导体科技(上海)有限公司 | Address generation method, address generation device, computer equipment and storage medium |
-
2022
- 2022-02-21 CN CN202210158362.8A patent/CN114528024A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116719561A (en) * | 2023-08-09 | 2023-09-08 | 芯砺智能科技(上海)有限公司 | Conditional branch instruction processing system and method |
CN116719561B (en) * | 2023-08-09 | 2023-10-31 | 芯砺智能科技(上海)有限公司 | Conditional branch instruction processing system and method |
CN118069224A (en) * | 2024-04-19 | 2024-05-24 | 芯来智融半导体科技(上海)有限公司 | Address generation method, address generation device, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US5553255A (en) | Data processor with programmable levels of speculative instruction fetching and method of operation | |
US8108615B2 (en) | Prefetching controller using a counter | |
US6012134A (en) | High-performance processor with streaming buffer that facilitates prefetching of instructions | |
CN114528024A (en) | Instruction fetching assembly line for storage and calculation fusion processor | |
US20130179640A1 (en) | Instruction cache power reduction | |
CN112230992B (en) | Instruction processing device, processor and processing method thereof comprising branch prediction loop | |
US20080072024A1 (en) | Predicting instruction branches with bimodal, little global, big global, and loop (BgGL) branch predictors | |
US20190065205A1 (en) | Variable length instruction processor system and method | |
US11301250B2 (en) | Data prefetching auxiliary circuit, data prefetching method, and microprocessor | |
JPH10228377A (en) | Information processor for predicting branch | |
KR20070001081A (en) | Method and apparatus for allocating entries in a branch target buffer | |
US20200233806A1 (en) | Apparatus, method, and system for enhanced data prefetching based on non-uniform memory access (numa) characteristics | |
US6823430B2 (en) | Directoryless L0 cache for stall reduction | |
US5854943A (en) | Speed efficient cache output selector circuitry based on tag compare and data organization | |
US10628163B2 (en) | Processor with variable pre-fetch threshold | |
US6321299B1 (en) | Computer circuits, systems, and methods using partial cache cleaning | |
EP4020229A1 (en) | System, apparatus and method for prefetching physical pages in a processor | |
US5421026A (en) | Data processor for processing instruction after conditional branch instruction at high speed | |
US11567776B2 (en) | Branch density detection for prefetcher | |
US7389405B2 (en) | Digital signal processor architecture with optimized memory access for code discontinuity | |
US6957319B1 (en) | Integrated circuit with multiple microcode ROMs | |
CN111209043B (en) | Method for realizing instruction prefetching in front-end pipeline by using look-ahead pointer method | |
KR100456215B1 (en) | cache system using the block buffering and the method | |
CN116627335A (en) | Low-power eFlash reading acceleration system | |
JPS61269735A (en) | Instruction queue control system of electronic computer |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |