CN117850881B - Instruction execution method and device based on pipelining - Google Patents

Instruction execution method and device based on pipelining Download PDF

Info

Publication number
CN117850881B
CN117850881B CN202410076144.9A CN202410076144A CN117850881B CN 117850881 B CN117850881 B CN 117850881B CN 202410076144 A CN202410076144 A CN 202410076144A CN 117850881 B CN117850881 B CN 117850881B
Authority
CN
China
Prior art keywords
instruction
indication information
loop
execution
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410076144.9A
Other languages
Chinese (zh)
Other versions
CN117850881A (en
Inventor
张�荣
李雨佳
苏运强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Xinlianxin Intelligent Technology Co ltd
Original Assignee
Shanghai Xinlianxin Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Xinlianxin Intelligent Technology Co ltd filed Critical Shanghai Xinlianxin Intelligent Technology Co ltd
Priority to CN202410076144.9A priority Critical patent/CN117850881B/en
Publication of CN117850881A publication Critical patent/CN117850881A/en
Application granted granted Critical
Publication of CN117850881B publication Critical patent/CN117850881B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)

Abstract

The application provides a pipelined instruction execution method and device, comprising the following steps: determining whether to use the processor auto-control loop based on the first indication information stored in the dedicated register; if the processor self-control loop is determined to be used, determining whether second instruction information stored in the special register is not greater than the length of the delay slot; the second instruction information is used for indicating the number of the residual unexecuted instructions in the loop body; if the second instruction information is equal to the length of the delay slot, reducing the third instruction information and setting the second instruction information as the total number of instructions in the loop body; the third indication information is used for indicating the number of remaining cycles in the cycle body; and determining that the third indication information in the special register does not meet a cycle ending condition, executing the instruction in the delay slot and jumping to the starting instruction of the cycle body. According to the scheme, the execution of the loop can be accelerated, and the overhead loss is reduced.

Description

Instruction execution method and device based on pipelining
Technical Field
The present application relates to the field of computer technologies, and in particular, to a pipeline-based instruction execution method and apparatus.
Background
For tasks that need to be executed multiple times, using loops can improve the execution efficiency of the code, helping developers reduce unnecessary repetitive work. The loop allows a set of statements or code blocks, also called a loop body, to be repeatedly executed, thereby reducing redundancy and repeatability of the code and improving readability and maintainability of the code.
In fact, if the number of cycles is large and the cycle volume is short, the overhead specific gravity spent on the number of cycles self-subtraction and conditional branching increases.
For the number of rounds of self-reduction, it takes at least one clock cycle, provided that the register pressure is not great, i.e. all variables participating in the loop body can always be stored in the register. If the loop body is short, it does not take much clock cycles per se, so even the overhead of one clock cycle is not negligible. Given that the register pressure is high (e.g., if there are a large number of other variables in the loop body), the variable storing the number of rounds may not always remain in the register, but must be moved out of the register during execution of the loop body (its value is stored back in memory, the corresponding register is allocated for use by another variable), so that the number of rounds of auto-subtraction at least requires reading back the variable value from memory before it is used by auto-subtraction and for later conditional jumps. Thus, in each round of rotation, the operations involved in the round number variable are three steps of writing back to the memory (freeing up registers), taking out from the memory, and self-subtracting. Here, two memory accesses are performed, and since accessing the memory requires more clock cycles (several times to several hundreds times) than accessing the registers, the overhead in this case may be considerable, even if the loop body is long, and such overhead may not be negligible.
For conditional branching, it is used to end the loop when the number of rounds decreases to a certain value (typically 0). The conditional branch itself needs to occupy at least one instruction and therefore at least one clock cycle, and likewise, if the loop body is short, it does not take as many clock cycles itself, even the overhead of one clock cycle is not negligible. Moreover, modern CPUs employ branch prediction techniques in order to optimize the performance of conditional branches. Branch prediction requires a complex and bulky branch predictor circuit, which is a part of the CPU power consumption, while incurring high design and semiconductor manufacturing costs. Further, branch prediction, once it fails, often requires flushing the pipeline to ensure correctness of program execution, resulting in more clock cycles, since all instructions in the delay slots at the microarchitectural level have been speculatively executed, all of which have to be rolled back and undone.
How to accelerate the execution of the loop, reduce the overhead loss and save the electric energy is needed to be solved.
Disclosure of Invention
The application provides a pipelined instruction execution method and device, which can accelerate the execution of loops and reduce the overhead loss.
In a first aspect, an embodiment of the present application provides a pipelined-based instruction execution method, where the method may be executed by a pipelined-based instruction execution apparatus, where the pipelined-based instruction execution apparatus may be a terminal device or a module for a terminal device, or a server or a module for a server. The application is not limited to the execution body of the method. The method comprises the following steps: determining whether to use the processor auto-control loop based on the first indication information stored in the dedicated register; if the processor self-control loop is determined to be used, determining whether second instruction information stored in the special register is not greater than the length of the delay slot; the second instruction information is used for indicating the number of the residual unexecuted instructions in the loop body; if the second instruction information is equal to the length of the delay slot, reducing the third instruction information and setting the second instruction information as the total number of instructions in the loop body; the third indication information is used for indicating the number of remaining cycles in the cycle body; and determining that the third indication information in the special register does not meet a cycle ending condition, executing the instruction in the delay slot and jumping to the starting instruction of the cycle body.
According to the scheme, on one hand, the special register is added in the CPU, and the execution of the loop is accelerated through special optimization on hardware, so that the overhead loss is reduced; on the other hand, the special register can store the variables of the total number of cycles and the current number of cycles, and the temporary data generated in the cycle process are reduced in the memory and the storage pressure on the register, so that the condition that the program is slow due to insufficient registers is avoided to a great extent; on the other hand, the special optimization of introducing the loops into the hardware can fix the program behavior, reduce unnecessary risks and make the written program more robust; in still another aspect, the second instruction information is equal to the length of the delay slot, so that all instructions in the delay slot on the micro-architecture level can not be withdrawn, the utilization rate of the pipeline is improved, the pause of part of the pipeline is avoided, and the cycle performance is improved; finally, by the method, the branch predictor can be closed in the circulation body, and electric energy is saved.
In a possible implementation method, if the current instruction is a loop instruction, the first instruction information and the second instruction information in the special register are set according to the total number of instructions in a loop body indicated by the loop instruction, and the third instruction information in the special register is set according to the number of loops indicated by the loop instruction.
According to the scheme, on one hand, the first indication information, the second indication information and the third indication information are arranged in the special register, so that the storage pressure of temporary data generated in the circulation process on the memory and the register is reduced, and the situation that the program is slowed down due to insufficient registers is avoided to a great extent; on the other hand, the branch predictor can be closed in the loop body through the mutual coordination of the first indication information, the second indication information and the third indication information, so that electric energy is saved.
In a possible implementation method, if the cycle instruction does not indicate the cycle number, initializing the third indication information to be a first value according to the cycle instruction; wherein the first value is a positive integer.
According to the scheme, if the circulation instruction does not indicate the circulation times, the third indication information is initialized to the first numerical value according to the circulation instruction, so that the third indication information can be accurately determined, further, the third indication information is stored in the special register, the storage pressure of temporary data generated in the circulation process on the memory and the register is reduced, and the situation that the program is slowed down due to insufficient registers is avoided to a great extent.
In a possible implementation method, if the loop instruction indicates at least the number of times of execution, initializing the third indication information to be the at least the number of times of execution; or if the loop instruction indicates the number of times of up to execution, initializing the third indication information to the number of times of up to execution; if the loop instruction indicates at least the number of times of execution and the number of times of execution, initializing the third indication information to be any value between the at least the number of times of execution and the number of times of execution.
By means of the scheme, the third indication information can be accurately determined.
In one possible implementation method, the value corresponding to the third indication information is dynamically increased or decreased according to an external condition.
By means of the scheme, the third indication information can be accurately determined.
In one possible implementation method, when it is determined that the processor auto-control loop can be immediately ended according to an external condition, initializing the first indication information in the special register to be a second value; the second value is used to indicate that the processor auto-loop is not used.
In the above scheme, the first indication information in the special register is initialized to be a second value, namely the self-control loop of the processor is immediately closed, so that no special operation exists when the loop body tail sound is executed, and the backward execution is continued instead of the next round of loop execution; thus, the processor self-control loop can be finished at any time according to external conditions.
In a possible implementation method, if the second instruction information is greater than the length of the delay slot, executing a next instruction of a current instruction, where the current instruction is an instruction in the loop body; the next instruction of the current instruction is an instruction in the loop body.
The above scheme illustrates that the next instruction of the current instruction is executed if the delay slot to the microarchitectural level has not been executed. According to the second instruction information equal to the length of the delay slot, all instructions in the delay slot of the micro-architecture layer can not be withdrawn, the utilization rate of the pipeline can be improved, the pause of part of the pipeline is avoided, and the cycle performance is improved.
In a possible implementation method, if the second instruction information is smaller than the length of the delay slot, pipeline cavitation bubbles are inserted into the delay slot, wherein the number of the pipeline cavitation bubbles is the length of the delay slot minus the number of the residual unexecuted instructions in the loop body corresponding to the second instruction information.
The scheme shows that the delay slot is about to enter the micro-architecture layer, but the total number of instructions in the loop body is smaller than the length of the delay slot, so that pipeline cavitation is inserted into the delay slot to ensure the accuracy of program execution; although the total number of instructions in the loop body is smaller than the length of the delay slot, the instructions in the loop body can still be executed by means of the delay slot of the micro-architecture layer, all instructions in the delay slot of the micro-architecture layer are not withdrawn, the utilization rate of the pipeline can be improved, the pause of part of the pipeline is avoided, and the circulation performance is improved.
In a possible implementation method, determining that the third indication information meets the cycle end condition, and initializing the first indication information in the special register to be a second value; executing the instruction in the delay slot and jumping to the starting instruction of the circulation body.
After the scheme meets the cycle end condition, the first indication information is initialized, the self-control cycle of the processor can be terminated rapidly, the circuit design can be simplified, the circuit structure is optimized to improve the performance, and the complexity is reduced.
In a possible implementation, the second instruction information in the dedicated register is subtracted from the second instruction information.
The unconditional self-decreasing action of the scheme is similar to the self-increasing action of a PC (namely a program counter used for storing the address of the currently executed instruction), and the PC can automatically self-increase every time one instruction is executed so as to point to the next instruction to be executed, so that the unconditional self-decreasing action can be deeply embedded into a pipeline, the clock period is saved, and the circuit design is simplified.
In one possible implementation, the first indication information is initialized to the second value when the user program is just executed.
By the scheme, the circuit design can be simplified, the circuit structure is optimized to improve the performance, and the complexity is reduced.
In a second aspect, an embodiment of the present application provides a pipelined-based instruction execution apparatus, including: a determining unit, a calculating unit and an executing unit. The determining unit is used for determining whether to use the processor self-control loop or not based on the first indication information stored in the special register; if the processor self-control loop is determined to be used, determining whether second instruction information stored in the special register is not greater than the length of the delay slot; the second instruction information is used for indicating the number of the residual unexecuted instructions in the loop body; the calculating unit is configured to reduce the third instruction information and set the second instruction information as a total number of instructions in the loop body if the second instruction information is equal to the length of the delay slot; the third indication information is used for indicating the number of remaining cycles in the cycle body; and the execution unit is used for determining that the third indication information in the special register does not meet the cycle ending condition, executing the instruction in the delay slot and jumping to the starting instruction of the cycle body.
In a possible implementation method, the computing unit is configured to set, if the current instruction is a loop instruction, first instruction information and second instruction information in the special register according to a total number of instructions in a loop body indicated by the loop instruction, and set, according to a number of loops indicated by the loop instruction, third instruction information in the special register.
In a possible implementation method, the computing unit is configured to initialize the third indication information to a first value according to the loop instruction if the loop instruction does not indicate a loop number; wherein the first value is a positive integer.
In a possible implementation method, the computing unit is configured to initialize the third indication information to at least the number of executions if the loop instruction indicates the at least the number of executions; or if the loop instruction indicates the number of times of up to execution, initializing the third indication information to the number of times of up to execution; if the loop instruction indicates at least the number of times of execution and the number of times of execution, initializing the third indication information to be any value between the at least the number of times of execution and the number of times of execution.
In one possible implementation method, the calculating unit is configured to dynamically increase or decrease the value corresponding to the third indication information according to an external condition.
In a possible implementation method, the computing unit is configured to initialize the first indication information in the dedicated register to a second value when it is determined that the processor auto-control loop can be immediately ended according to an external condition; the second value is used to indicate that the processor auto-loop is not used.
In a possible implementation method, the execution unit is configured to execute a next instruction of a current instruction if the second instruction information is greater than the length of the delay slot, where the current instruction is an instruction in the loop body; the next instruction of the current instruction is an instruction in the loop body.
In a possible implementation method, the execution unit is configured to insert pipeline cavitation bubbles into the delay slot if the second instruction information is smaller than the length of the delay slot, where the number of pipeline cavitation bubbles is the length of the delay slot minus the number of remaining unexecuted instructions in the loop body corresponding to the second instruction information.
In a possible implementation method, the computing unit is configured to determine that the third indication information meets the cycle end condition, and initialize first indication information in the dedicated register to a second value; executing the instruction in the delay slot and jumping to the starting instruction of the circulation body.
In a possible implementation method, the computing unit is configured to self-subtract the second instruction information in the dedicated register.
In a possible implementation method, the computing unit is configured to initialize the first indication information to the second value when the user program is just executed.
In a third aspect, embodiments of the present application also provide a computing device, comprising:
A memory for storing program instructions;
And the processor is used for calling the program instructions stored in the memory and executing any method for realizing the first aspect according to the obtained program instructions.
In a fourth aspect, embodiments of the present application also provide a computer-readable storage medium having stored therein computer-readable instructions which, when read and executed by a computer, implement any of the methods of the first aspect described above.
In a fifth aspect, embodiments of the present application provide a computer program product comprising a computer program executable by a computer device to cause the computer device to perform any of the methods of the first aspect described above when the program is run on the computer device.
Drawings
FIG. 1 is a schematic diagram illustrating an effect of a branch instruction on a pipeline according to an embodiment of the present application;
FIG. 2 is a flow chart of an instruction execution method based on pipelining according to an embodiment of the present application;
FIG. 3 is a flow chart of an instruction execution method based on pipelining according to an embodiment of the present application;
FIG. 4 is a schematic structural diagram of an instruction execution device according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an instruction execution device based on pipelining according to an embodiment of the present application.
Detailed Description
Some of the terms of art used in the present application are explained below.
Delay cell: it is a concept in the instruction set that refers to the instruction slot immediately following the branch (also called jump, also translated into branch) instruction, where the instruction is executed before the branch instruction is executed. This design may increase the utilization of the pipeline because the pipeline may continue to execute instructions within the delay slot while the branch instruction is executing. The delay slot is used for overcoming pipeline pause caused by the branch instruction, so that the subsequent instruction can be continuously executed before the branch instruction judges the result, and the overall execution efficiency is improved. The instructions placed in the delay slots are typically instructions that are independent of the branch instructions to ensure the correctness of the program.
The delay cells in the present application are different from the definition of delay cells in proper terms in the art, and are called "microarchitectural-level delay cells". The "delay slot length" at the microarchitectural level is used to refer to how many instructions actually enter the pipeline before the branch instruction is executed by a Central Processing Unit (CPU).
Since most CPUs now use pipelines, for those instruction sets that do not define delay slots, the actual CPU execution of branch instructions needs to be performed by executing all instructions that pipeline stalls (stalls) in the delay slots at the microarchitectural level. For those instruction sets that define a delay slot, the delay slot length defined by the instruction set (typically the one immediately following the branch instruction) does not necessarily truly match the pipeline's attributes, as: the pipeline of modern CPUs is longer and longer than the year in which the instruction set is designed, the delay slot length at the microarchitectural level may already be greater than the delay slot length at the instruction set level, and more instructions need to be retired, which may cause pipeline stall. Pipeline stall caused by these revoked instructions and even cycles occupied by other negative effects are thus wasted.
Branch prediction: is one of speculative execution (speculative execution, also known as predictive execution, speculative execution). The word speculative can be seen as a technique with high yield in success and high loss in failure. Specifically, speculative execution is a prediction using existing information, and it may be useful to execute some instructions in the future or not in advance. If the prediction is correct, the performance can be greatly improved; once mispredicted, instructions that have been executed must be rolled back, retired, often with the effect of flushing (flushing) the pipeline, resulting in several or even tens of clock cycles being wasted.
Branch prediction is actually used to solve the problem that pipeline stalls caused by branch instructions require excessive clock cycles to be consumed under the condition that pipelines are deeper and deeper. In fact, if the microarchitecture has fewer pipelines, say 5-stage pipelines, and the instruction set defines a delay slot, then no branch prediction is required at all (since the delay slot at the instruction set level and the microarchitecture can typically be matched exactly, without pipeline stalls or cavitation). However, in practice, only some lower performance CPUs will use shallow pipeline depth, and branch prediction is becoming more and more indispensable as high performance CPU pipelines become deeper.
In particular to the implementation of branch prediction, it is through existing information including, but not limited to: the historical execution condition of the current jump instruction and the position of the jump destination address are used for predicting the destination address of one jump instruction and whether the jump should be executed or not. If the prediction is successful, most or all of the instructions in the microarchitectural-level delay slot need not be retired; however, once the prediction fails, since all instructions in the delay slots at the microarchitecture level have been speculatively executed, all the effects that they produce need to be rolled back, undone, often requiring flushing the pipeline to ensure correctness of program execution, resulting in more clock cycles.
It can be said that the cost of a prediction failure is much higher than the cost of a pipeline stall. This is because branch predictions inherently deepen the pipeline, so a branch prediction failure results in more cycles wasted by rollback than if the pipeline were stalled without branch prediction. Further, at shallow pipeline depths, the instructions in the delay slots of the microarchitecture level often do not execute until the stage where the actual effect is generated (e.g., FIG. 1, which is typically not generated until the stage is executed or even later), at which time they are actually only prevented from entering the subsequent stages, and no actual rollback is required. While some of the instructions in the delay slots at the microarchitecture level may already have a real effect, it may actually take additional cycles to roll back, which is more costly.
Therefore, an important index for measuring the quality of branch prediction, hit rate (also called accuracy, success rate) is proposed. Modern CPU branch prediction hit rates can typically be up to 90% or more. It can be seen that such a high hit rate can offset the cost of the miss, and can bring practical benefit.
However, the increasing hit rate of branch predictions in modern CPUs is not without cost. The cost is a complex and bulky branch predictor circuit, which is a part of the considerable power consumption in the CPU, while incurring high design and semiconductor manufacturing costs. Also, branch prediction itself adds at least one pipeline stage, which itself results in a deepening of the pipeline.
Clock cycle: the time required for the CPU to execute an instruction. The clock cycles are measured in terms of clock frequency, with higher clock frequencies, more clock cycles per second, and faster instructions being executed by the processor. The unit of clock frequency is hertz (Hz) and the relationship with clock period is expressed by the following equation: clock period = 1/clock frequency.
Register: one hardware element in a computer for storing and retrieving data is a small memory unit located within the central processing unit for temporarily storing instructions and data. The registers are very fast because they are directly connected to the CPU core and can provide data quickly. Registers play a critical role in computer architecture for providing data input or accepting data output for various operations such as arithmetic operations, logical operations, and data transfers, and the like, and the types include general purpose registers, special purpose registers, and status registers, among others.
Register pressure: the speed of CPU access to registers is much faster than memory. When optimizing the target code, the compiler can keep the variables in the register instead of the memory as much as possible, so that the running speed of the program can be greatly increased. However, the number of registers is limited. If the number of registers is not sufficient to store all the variables needed in a code block, the variables that are not used temporarily need to be stored back into memory to free up registers for the variables that are needed immediately, and then read out of memory if needed, resulting in a reduction in the speed of program operation.
Program loop: is a structure commonly used in programming for repeatedly executing a particular code block. There are many types of loops in programming, the most common of which include for loops and while loops. These loops allow the program to execute the same code multiple times according to specific conditions, improving the efficiency and flexibility of the code. The use of program loops may simplify the code making it more readable and maintainable.
Branching: also known as jumps, there are also processes that translate into jumps, often referred to as jumps in control flow from one instruction to another in program execution. This may be achieved in various ways, such as conditional branching, looping, or function calls. Instruction skipping is one of the important mechanisms to control program flow in assembly language and low-level programming. The program loop is implemented by a branch instruction (usually a conditional branch instruction, and branching is performed only if a certain condition is satisfied) when executed on a CPU.
Kernel context switching: refers to a process that requires context information to be saved and restored when an operating system switches from one process or thread to another. The context includes the state of the processor registers, the values of the program counters, stack pointers, and other information related to process or thread execution. In the kernel context switching process, the operating system is responsible for saving the context information of the currently executing process and loading the context information of the next process to be executed. This ensures a seamless switching between the processes, enabling them to share the processor's time slices, enabling pseudo-concurrent execution (time division multiplexing). Kernel context switching is one of the key mechanisms for operating system management multitasking, which ensures system stability and efficiency. Different operating systems have different implementations, but the basic principles are similar.
The impact of a branch (jump) instruction on the pipeline is described in more detail below with reference to FIG. 1. As can be seen from fig. 1, T2, T3, etc. are a time axis, and each cell represents one clock cycle. That is for a 5-stage pipeline, 5 clock cycles complete an instruction, but each clock cycle can read in an instruction, and from the fifth clock cycle, each clock cycle can complete an instruction. If the pipeline is always in such a full state, the performance goal of the pipeline is achieved, that is, the maximum instruction throughput rate is obtained; in particular to FIG. 1, after T5 (ignoring the delay slots at the microarchitecture level first), there are 5 instructions that are processed in parallel on different pipeline stages for each cycle.
For the program code in FIG. 1, assuming that the add instruction is fetched at this clock cycle of T1, then by the time T2 the add instruction has been fetched by the fetch unit and passed to the decode unit for decoding. At the same time, the fetch unit starts fetching a sub instruction. Then, to the T3 cycle, then, the next instruction is fetched, i.e., the beq instruction. It can be seen from this procedure that there is likely to be a loop. But this is only from the perspective of our bystanders, we can see all the program code. While for the processor, now in the period T3, it is fetching a next instruction, it does not know what this instruction is at all. After completion of the fetch operation of the T3 cycle, the beq instruction is provided to the decode unit during the T4 cycle, although the instruction encoding of the beq instruction is fetched. The instruction fetch unit will fetch an instruction in turn, that is, the lw instruction, and decode the beq instruction is completed when the lw instruction is fetched, but we still do not know whether its branch condition is satisfied, that is, whether the two registers s3 and s4 are equal (that is, eq in beq indicates equal). Therefore, we can only continue fetching further, that is, the sw instruction is fetched down, that is, when the clock cycle of T5 is completed, beq the instruction completes execution, that is, comparing the values of the two registers S3 and S4, so we can know whether this branch is about to occur. We assume that the condition of the branch is satisfied so that the lw and sw instructions that have entered the pipeline should not actually be executed. We can only clear it and then re-fetch from the correct address (i.e. the destination address of the branch, here characterized by the label next), i.e. fetch the subtraction instruction (sub), and then fetch the conditional branch instruction (beq) in turn. But now our constructed processor cannot remember that this conditional branch instruction was just fetched somewhere (beq). Therefore, at this point the processor simply removes the instruction, so it will still continue to remove the instruction down, will remove the lw instruction, and then remove the sw instruction. The instruction beq that was fetched only when the sw instruction was fetched is completed, and the processor may find that branching was originally occurring, must clean the lw and sw instructions, and re-fetch the instruction.
It is found that during execution of this loop, two correct instructions are always executed repeatedly and then two instructions that should not be executed are fetched. That is, during execution of this cycle, effectively 50% of the performance is wasted. This is also because the branch instruction itself and the pipeline's mode are conflicting because the branch instruction will change the instruction's flow direction, and the pipeline will want to be able to fetch instructions in turn, filling the pipeline.
The delay slots at the microarchitectural level (sw and lw instructions in this example) must not be executed when the jump instruction condition is satisfied, but can only be revoked, otherwise the program execution result (i.e., control Hazard Control Hazard) is destroyed. In general, to eliminate control hazards, pipeline stalls are employed, which are achieved by inserting pipeline bubbles (bubbles). Specifically, in fig. 1, after determining that this is a jump instruction in the fetch stage (T3 in fig. 1), a bubble is inserted at T4 and T5, and one bubble corresponds to an instruction that does nothing (also called nop or noop). That is, by inserting pipeline cavitation, the lw and sw instructions in FIG. 1 do not actually enter the pipeline. This has the advantage that the effect of executing instructions in the rollback delay slot is not required, and the circuit is simple.
However, lw and sw may be allowed to enter the pipeline without inserting a bubble, but only when the condition beq is determined to be satisfied, the lw and sw are prevented from entering the subsequent stage (equivalent to rollback/undo) at the execution stage (T5) of beq. In this way, if a jump does not occur (the condition is not satisfied), no waste is generated in the delay slot of the micro-architecture layer; only when the hop forwarding occurs, the clock period corresponding to the delay slot length of the micro-architecture layer is wasted. However, for a loop, the condition for the jump is always satisfied as long as the loop has not yet ended, so the jump almost always occurs, except when it is executed to the end of the loop. So this optimization has little effect.
In general, the delay slot length at the micro-architecture level of an unconditional jump will be shorter than the conditional jump. This is because when a conditional jump instruction is executed, a later pipeline stage is required to determine whether the condition is met to determine whether a jump should be performed. The unconditional jump is omitted, so that the micro-architecture with optimized architecture can execute jump at an earlier pipeline stage, for example, for the example (5-stage pipeline in fig. 1), after optimization, the unconditional jump can be executed after being put into a decoding stage and even a fetching stage, and then the delay slot length of the micro-architecture layer can be reduced to 1 (indirect jump, jump destination address is stored in a register and the decoding stage can determine the destination address), and even 0 (direct jump, jump destination address is stored in an immediate, and the fetching stage can determine the destination address). Conditional jumps may also be particularly optimized, in a similar way, and in a specific implementation the delay slot length at the micro-architecture level may be reduced to 1.
Based on this, the MIPS ISA defines a length-1 delay slot for almost all jump instructions at the instruction set level. That is, 1 instruction following the jump instruction, whether or not the jump occurs, is executed. This well avoids the cost of the micro-architecture level delay slot being revoked. However, this approach is applicable to processors with fewer pipeline stages. With the development of the age, the current common processors often have to increase the pipeline depth (also called the number of stages) in order to apply various techniques for improving performance. Many of these pipeline depths can even reach tens of stages, making the delay slot length at the microarchitectural level much longer, and there may be a large variance between different types of jump instructions. In this case, a latency slot of instruction set level length 1 may be called a coaster salary, and may complicate the design and optimization of the microarchitecture. Thus, in addition to MIPS (which has a very long history), ISA, which defines a delay slot at the instruction set level, is very rare. With the latest revised version of MIPS, jump instructions without delay slots are added and some jump instructions with delay slots are discarded.
Fig. 2 is a schematic flow chart of a pipelined instruction execution method according to an embodiment of the present application, where the method may be executed by a pipelined instruction execution device, and the pipelined instruction execution device may be a terminal device or a module for a terminal device, or a server or a module for a server. The application is not limited to the execution body of the method.
The method comprises the following steps:
Step 201, determining whether to use the processor auto-control loop based on the first indication information stored in the dedicated register.
In one possible implementation, a dedicated register is added to the conventional CPU, where the dedicated register is used to store the first indication information, the second indication information, and the third indication information. The first indication information is used for indicating whether the processor self-control loop is used or not; the second instruction information is used for indicating the number of the residual unexecuted instructions in the loop body; the third indication information is used for indicating the number of remaining loops in the loop body.
In one possible implementation method, if the corresponding value in the first indication information is the total number of instructions of the loop body, determining to use the processor to automatically control the loop; if the corresponding value in the first indication information is the initial value, namely the second value, the processor self-control loop is determined not to be used. The second value is exemplified as 0, although other values are possible, and the application is not limited thereto.
If it is determined that the processor auto-control loop is used, it is determined whether the second instruction information stored in the dedicated register is not greater than the length of the delay slot, step 202.
Wherein the second instruction information is used for indicating the number of the remaining unexecuted instructions in the loop body.
In a possible implementation method, if the second instruction information stored in the special register, that is, the number of remaining unexecuted instructions in the loop body, is greater than the length of the delay slot, executing a next instruction of a current instruction, where the current instruction is an instruction in the loop body; the next instruction of the current instruction is an instruction in the loop body. This scheme illustrates that the delay slot to the microarchitectural level has not yet been executed, then the next instruction of the current instruction is executed. According to the second instruction information equal to the length of the delay slot, all instructions in the delay slot of the micro-architecture layer can not be withdrawn, the utilization rate of the pipeline can be improved, the pause of part of the pipeline is avoided, and the cycle performance is improved.
Step 203, if the second instruction information is equal to the length of the delay slot, reducing the third instruction information and setting the second instruction information as the total number of instructions in the loop body.
The third indication information is used for indicating the number of remaining cycles in the cycle body.
In a possible implementation method, when the second instruction information is equal to the length of the delay slot, the third instruction information is reduced and the second instruction information is set as the total number of instructions in the loop body; judging whether third indication information in the special register meets a cycle end condition, if so, initializing the first indication information in the special register to be a second value, executing an instruction in the delay slot and jumping to a starting instruction of the cycle body; and if the cycle ending condition is not met, executing the instruction in the delay slot and jumping to the starting instruction of the cycle body. Wherein, reducing the third indication information refers to reducing the number of remaining cycles in the loop body by 1, or reducing the number of remaining cycles in the loop body by a specified step length, where the specified step length is a positive integer greater than 1.
In a possible implementation method, when the second instruction information is equal to the length of the delay slot and the cycle end condition is not satisfied, the third instruction information is reduced and the second instruction information is set as the total number of instructions in the cycle body. Executing the instruction in the delay slot and jumping to the starting instruction of the circulation body.
In a possible implementation method, when the second indication information is equal to the length of the delay slot and the loop end condition is satisfied, the first indication information in the special register is initialized to a second value. Executing the instruction in the delay slot and jumping to the starting instruction of the circulation body.
In a possible implementation method, if the second instruction information is smaller than the length of the delay slot, pipeline cavitation bubbles are inserted into the delay slot to ensure the correctness of program execution, wherein the number of the pipeline cavitation bubbles is the length of the delay slot minus the number of the residual unexecuted instructions in the loop body corresponding to the second instruction information.
In a possible implementation method, determining that the third indication information meets the cycle end condition, and initializing the first indication information in the special register to be a second value; executing the instruction in the delay slot and jumping to the starting instruction of the circulation body. Illustratively, the first indication information in the dedicated register is initialized to 0. According to the scheme, after the circulation ending condition is met, the first indication information is initialized, the self-control circulation of the processor can be terminated rapidly, the circuit design can be simplified, the circuit structure is optimized, the performance is improved, and the complexity is reduced.
Step 204, determining that the third instruction information in the special register does not meet the cycle end condition, executing the instruction in the delay slot and jumping to the start instruction of the cycle body.
In one possible implementation method, the jump to the start instruction of the loop body refers to the jump to the nth instruction of the current instruction, where N is the total number of instructions in the loop body.
According to the scheme, on one hand, the special register is added in the CPU, and the execution of the loop is accelerated through special optimization on hardware, so that the overhead loss is reduced; on the other hand, the special register can store the variables of the total number of cycles and the current number of cycles, and the temporary data generated in the cycle process are reduced in the memory and the storage pressure on the register, so that the condition that the program is slow due to insufficient registers is avoided to a great extent; on the other hand, the special optimization of introducing the loops into the hardware can fix the program behavior, reduce unnecessary risks and make the written program more robust; in still another aspect, the second instruction information is equal to the length of the delay slot, so that all instructions in the delay slot on the micro-architecture level can not be withdrawn, the utilization rate of the pipeline is improved, the pause of part of the pipeline is avoided, and the cycle performance is improved; finally, by the method, the branch predictor can be closed in the circulation body, and electric energy is saved.
The following describes the fixed number of circulation wheels and the non-fixed number of circulation wheels, respectively.
First, for the case where the number of circulation rounds is fixed, the method includes the following steps, as shown in fig. 3.
In step 301, the first indication information is initialized to a second value.
In one possible implementation, the CPU initializes a special register upon execution of the user program. Illustratively, the first indication information is initialized to a second value, the second indication information is initialized to the length of the delay slot, and the third indication information is initialized to the second value. Of course, the initialization values of the first indication information and the third indication information may also be different, which is not limited in the present application. According to the scheme, the circuit design can be simplified, the circuit structure is optimized to improve the performance, and the complexity is reduced.
Step 302, the second instruction information in the dedicated register is subtracted.
In a possible implementation method, if the current instruction is a loop instruction, the first instruction information and the second instruction information in the special register are set according to the total number of instructions in a loop body indicated by the loop instruction, and the third instruction information in the special register is set according to the number of loops indicated by the loop instruction. According to the scheme, on one hand, the first indication information, the second indication information and the third indication information are arranged in the special register, so that the storage pressure of temporary data generated in the circulation process on the memory and the register is reduced, and the situation that the program is slowed down due to insufficient registers is avoided to a great extent; on the other hand, the branch predictor can be closed in the loop body through the mutual coordination of the first indication information, the second indication information and the third indication information, so that electric energy is saved.
In one possible implementation, the second instruction information in the special purpose register is subtracted from the current instruction, whether or not the current instruction is a loop instruction. The application does not limit the step length of the self-subtracting operation, and the step length is a positive integer. The unconditional self-subtraction behavior of the scheme is similar to the self-subtraction of a PC (i.e. a program counter for storing the address of the currently executed instruction), and the PC can automatically self-increment to point to the next instruction to be executed every time an instruction is executed, so that the unconditional self-subtraction behavior can be deeply embedded into a pipeline, and the clock period can be saved.
Step 303, determining whether to use the processor auto-control loop according to the first indication information.
In a possible implementation method, if the corresponding value in the first indication information is the total number of instructions in the loop body, determining to use the processor to automatically control the loop, and executing step 304; if the corresponding value in the first indication information is an initial value, say 0, indicating that no processor auto-loop is currently enabled, the next instruction is executed normally (step 305), and the next instruction is an instruction in a non-loop body.
Step 304, it is determined whether the second indication information is not greater than the length of the delay slot.
In one possible implementation method, if the second instruction information stored in the special register, that is, the number of remaining unexecuted instructions in the loop body, is greater than the length of the delay slot, the next instruction of the current instruction is executed (step 305), where the current instruction is an instruction in the loop body, and the next instruction is an instruction in the loop body.
In one possible implementation, the number of non-executed instructions remaining in the loop body is less than or equal to the length of the delay slot, and step 306 is performed.
Step 305, execute the next instruction.
In one possible implementation, execution is resumed from step 302 until all instructions corresponding to the user program have been executed.
Step 306, determining whether the second indication information is smaller than the length of the delay slot.
In one possible implementation method, if the second indication information is smaller than the length of the delay slot, step 307 is performed; if the second indication information is equal to the length of the delay slot, step 308 is performed.
Step 307, inserting pipeline cavitation in the delay slot.
In a possible implementation method, if the second instruction information is smaller than the length of the delay slot, that is, the loop body is shorter than the delay slot of the micro-architecture layer, pipeline cavitation bubbles are inserted into the delay slot, where the number of pipeline cavitation bubbles is the length of the delay slot minus the number of remaining unexecuted instructions in the loop body corresponding to the second instruction information. Step 308 is performed. In the scheme, the second instruction information is smaller than the length of the delay slot, which indicates that the delay slot is about to enter the micro-architecture layer, but the total number of instructions in the loop body is smaller than the length of the delay slot, so that pipeline cavitation is inserted into the delay slot; although the total number of instructions in the loop body is smaller than the length of the delay slot, the instructions in the loop body can still be executed by means of the delay slot of the micro-architecture layer, all instructions in the delay slot of the micro-architecture layer are not withdrawn, the utilization rate of the pipeline can be improved, the pause of part of the pipeline is avoided, and the circulation performance is improved.
Step 308, reducing the third instruction information and setting the second instruction information as the total number of instructions in the loop body.
Step 309, determining whether the third indication information satisfies a cycle end condition.
In one possible implementation method, if the third indication information meets the cycle end condition, step 310 is executed; if the third indication information does not meet the cycle end condition, it indicates that the cycle has not ended, and then the next cycle is entered, and step 311 is executed.
In step 310, the first indication information is initialized to the second value.
In one possible implementation method, the last round of execution is still left to finish, and since the last round of execution does not need to jump to the beginning of the cycle when finishing later, the last round of execution can be executed under the state that the processor is not used for controlling the cycle automatically, and the first indication information in the special register is initialized to be a second value; illustratively, the first indication information in the dedicated register is initialized to 0. Step 311 is performed.
Step 311, execute the instruction in the delay slot and jump to the start instruction of the loop body.
In one possible implementation, the jump to the start instruction of the loop body, and for a fixed-length instruction set, such as MIPS, the jump offset can be obtained directly by multiplying the number of instructions by the instruction length. Generally, CPU processing multiplication is slow, but because the instruction length of a fixed-length instruction set is generally a power of 2, multiplication is not actually needed, and only the number of instructions is required to be shifted left.
In one possible implementation method, since there is a delay slot in the micro-architecture layer, after the jump to the start instruction of the loop body, the same number of instructions as the length of the delay slot enter the pipeline, but since it has been determined before whether the second instruction information is less than or equal to the length of the delay slot, all instructions in the delay slot do not need to be withdrawn, so that the performance can be improved.
In one possible implementation, the branch predictor may be turned off in the loop body when performing steps 301 to 311.
Second, for the case that the number of the circulation rounds is not fixed, the steps are similar to the steps 301 to 311, and are not repeated here. The main difference between the case where the number of circulation wheels is not fixed and the case where the number of circulation wheels is fixed is the determination of the third instruction information. For the case of fixed number of circulation wheels, the value corresponding to the third indication information is the number of circulation wheels; in the case that the number of the cycles is not fixed, the value corresponding to the third instruction information may not be reduced continuously in the step 308, and the first instruction information may not be initialized to the second value in the step 310, that is, in any step of fig. 3, the CPU may modify the value corresponding to the third instruction information or initialize the value corresponding to the first instruction information at any time according to the external condition, thereby affecting the above-mentioned flow. If the value corresponding to the third indication information is modified, the number of rounds still needed to be circulated next is modified, and the number of rounds can be prolonged or shortened (if the number is increased, the number is prolonged, if the number is reduced, the number is shortened). If the value corresponding to the first indication is initialized, the loop may be immediately aborted (the current round in execution will also be completed).
However, in order to ensure that the processor auto-control loop also operates normally before external conditions occur, it is necessary to assign an initial value to the third indication information before the processor auto-control loop is started. Here, the determination of the value corresponding to the third instruction information may be performed as follows.
In a possible implementation method, if the cycle instruction does not indicate the cycle number, initializing the third indication information to be a first value according to the cycle instruction; wherein the first value is a positive integer. Initializing the third indication information to be a first value according to the loop instruction includes: initializing the third indication information to be at least executed times if the loop instruction indicates the at least executed times; or if the loop instruction indicates the number of times of up to execution, initializing the third indication information to the number of times of up to execution; if the loop instruction indicates at least the number of times of execution and the number of times of execution, initializing the third indication information to be any value between the at least the number of times of execution and the number of times of execution. According to the scheme, if the circulation instruction does not indicate the circulation times, the third indication information is initialized to the first numerical value according to the circulation instruction, so that the third indication information can be accurately determined, further, the third indication information is stored in the special register, the storage pressure of temporary data generated in the circulation process on the memory and the register is reduced, and the situation that the program is slowed down due to insufficient registers is avoided to a great extent.
In yet another possible implementation method, the number of remaining cycles (third indication information) in the dedicated register is initialized to a first value, and an execution period and an update period are set; and initializing the residual circulation times in the special register to be a first value in the execution period at intervals of an update period until the execution period is ended or the execution of the circulation body is completed. For example, the execution period is set to 10 seconds, the update period is set to 1 second, and the number of remaining cycles in the dedicated register is initialized to the first value periodically every 1 second until the execution period is completed, and the foregoing periodic update is no longer performed.
In one possible implementation, the value of the third indication information is dynamically increased or decreased according to the external condition (i.e. corresponding to extending or shortening the cycle of the loop execution). The external conditions include one or more of the following: clock timer state, processor overload, and external device state changes.
In one possible implementation method, when it is determined that the processor auto-control loop can immediately end according to the external condition, the first indication information in the dedicated register is initialized to the second value. In the above scheme, the first indication information in the special register is initialized to be a second value, namely the self-control loop of the processor is immediately closed, so that no special operation exists when the loop body tail sound is executed, and the backward execution is continued instead of the next round of loop execution; thus, the processor self-control loop can be finished at any time according to external conditions.
The processor self-control circulation can be immediately ended according to the external condition, and the method is not only suitable for the condition that the number of the circulation wheels is not fixed, but also suitable for the condition that the number of the circulation wheels is fixed, and the application is not limited to the condition.
In one possible implementation, the special registers need to be saved and restored when the kernel performs a context switch. This process needs to take into account the complexity of protecting and recovering the state of the process, ensuring that no critical information is lost during the handover process.
Implementation of the autocycle instruction block involves design of the microarchitecture and instruction set architecture of the processor. The instruction block may be used for various loop control purposes, such as numerical computation, data processing, and the like. The control unit in the processor is responsible for interpreting and executing the instructions to realize the required functions, so that the cost consumed by instruction jump can be obviously reduced, the running time of the program is shortened, and the program efficiency is improved.
Because the application has the special register for storing the variables of the total round number and the current round number, the temporary data generated in the round process is stored in the memory and the storage pressure on the register are lightened, and the condition that the program is slowed down due to insufficient registers is avoided to a great extent.
Each cycle is a potential source of error in terms of tank rate. Loop nesting or improper use of conditions may result in logic errors, causing unexpected actions to the program, causing a series of errors. The special optimization of introducing loops into hardware can fix program behavior, reduce unnecessary risks and make the written program more robust.
Unlike the delay slot at the microarchitectural level, the delay slot length defined at the instruction set level refers to how many instructions are to follow the jump instruction to ensure that it will not be retired. However, microarchitectures may have longer latency slots that are longer than the instruction set definition and can only be revoked. In the application, all instructions in the delay slot of the micro-architecture level can not be revoked. The design can improve the utilization rate of the pipeline, avoid the pause of part of the pipeline and is beneficial to improving the circulation performance. In other words, the gap between the instruction set level latency slot and the microarchitectural level latency slot is eliminated.
Further, the need for branch prediction is eliminated. Thus, for low power processors, although they typically do not have branch predictors, very high branch performance may be achieved after the present application is applied. For a high-performance processor with a branch predictor, after the application is applied, the branch predictor can be closed in a loop body, so that electric energy is saved.
Based on the same technical concept, fig. 4 schematically illustrates a pipeline-based instruction execution apparatus 400 according to an embodiment of the present application. As shown in fig. 4, includes: a determination unit 401, a calculation unit 402, and an execution unit 403. The determining unit 401 is configured to determine whether to use a processor auto-control loop based on the first indication information stored in the dedicated register; if the processor self-control loop is determined to be used, determining whether second instruction information stored in the special register is not greater than the length of the delay slot; the second instruction information is used for indicating the number of the residual unexecuted instructions in the loop body; the calculating unit 402 is configured to reduce the third instruction information and set the second instruction information as a total number of instructions in the loop body if the second instruction information is equal to the length of the delay slot; the third indication information is used for indicating the number of remaining cycles in the cycle body; the execution unit 403 is configured to determine that the third instruction information in the dedicated register does not meet the cycle end condition, execute the instruction in the delay slot, and jump to the start instruction of the cycle body.
In a possible implementation method, the calculating unit 402 is configured to set, if the current instruction is a loop instruction, the first instruction information and the second instruction information in the dedicated register according to the total number of instructions in a loop body indicated by the loop instruction, and set, according to the number of loops indicated by the loop instruction, the third instruction information in the dedicated register.
In a possible implementation method, the calculating unit 402 is configured to initialize the third indication information to a first value according to the loop instruction if the loop instruction does not indicate the number of loops; wherein the first value is a positive integer.
In a possible implementation manner, the calculating unit 402 is configured to initialize the third indication information to at least the number of executions if the loop instruction indicates the at least the number of executions; or if the loop instruction indicates the number of times of up to execution, initializing the third indication information to the number of times of up to execution; if the loop instruction indicates at least the number of times of execution and the number of times of execution, initializing the third indication information to be any value between the at least the number of times of execution and the number of times of execution.
In a possible implementation method, the calculating unit 402 is configured to dynamically increase or decrease the value corresponding to the third indication information according to an external condition.
In a possible implementation manner, the calculating unit 402 is configured to initialize the first indication information in the dedicated register to the second value when it is determined that the processor auto-control loop can be immediately ended according to the external condition.
In a possible implementation method, the execution unit 403 is configured to execute a next instruction of a current instruction if the second instruction information is greater than the length of the delay slot, where the current instruction is an instruction in the loop body; the next instruction of the current instruction is an instruction in the loop body.
In a possible implementation method, the execution unit 403 is configured to insert pipeline cavitation bubbles in the delay slot if the second instruction information is smaller than the length of the delay slot, where the number of pipeline cavitation bubbles is the length of the delay slot minus the number of remaining unexecuted instructions in the loop body corresponding to the second instruction information.
In a possible implementation method, the calculating unit 402 is configured to determine that the third indication information meets the cycle end condition, and initialize first indication information in the dedicated register to a second value; executing the instruction in the delay slot and jumping to the starting instruction of the circulation body.
In a possible implementation, the computing unit 402 is configured to self-subtract the second instruction information in the dedicated register.
In a possible implementation method, the calculating unit 402 is configured to initialize the first indication information to the second value when the user program is just executed.
Based on the same technical concept, the embodiment of the present application provides a pipeline-based instruction execution apparatus 500, where the pipeline-based instruction execution apparatus 500 may be a computing device, for example. As shown in fig. 5, a pipeline-based instruction execution apparatus 500 includes at least one processor 501 and a memory 502 connected to the at least one processor, and in the embodiment of the present application, a specific connection medium between the processor 501 and the memory 502 is not limited, and in fig. 5, the processor 501 and the memory 502 are connected by a bus, for example. The buses may be divided into address buses, data buses, control buses, etc.
In an embodiment of the present application, the memory 502 stores instructions executable by the at least one processor 501, and the at least one processor 501 may execute a pipeline-based instruction execution method as described above by executing the instructions stored in the memory 502.
The processor 501 is a control center of the pipeline-based instruction execution apparatus 500, and may use various interfaces and lines to connect various parts of a computer device, and perform resource setting by executing or executing instructions stored in the memory 502 and calling data stored in the memory 502. Alternatively, the processor 501 may include one or more determination units, and the processor 501 may integrate an application processor and a modem processor, wherein the application processor primarily processes an operating system, a user interface, an application program, etc., and the modem processor primarily processes wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 501. In some embodiments, processor 501 and memory 502 may be implemented on the same chip, or they may be implemented separately on separate chips in some embodiments.
The processor 501 may be a general purpose processor such as a Central Processing Unit (CPU), digital signal processor, application SPECIFIC INTEGRATED Circuit (ASIC), field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc., that may implement or perform the methods, steps, and logic diagrams disclosed in embodiments of the present application. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in the processor for execution.
The memory 502, as a non-volatile computer readable storage medium, may be used to store non-volatile software programs, non-volatile computer executable programs, and modules. The Memory 502 may include at least one type of storage medium, and may include, for example, flash Memory, hard disk, multimedia card, card Memory, random access Memory (Random Access Memory, RAM), static random access Memory (Static Random Access Memory, SRAM), programmable Read-Only Memory (Programmable Read Only Memory, PROM), read-Only Memory (ROM), charged erasable programmable Read-Only Memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ-Only Memory, EEPROM), magnetic Memory, magnetic disk, optical disk, and the like. Memory 502 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 502 in embodiments of the present application may also be circuitry or any other device capable of performing storage functions for storing program instructions and/or data.
The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer executable program, and the computer executable program is used for enabling a computer to execute a pipelining-based instruction execution method listed in any one of the above modes.
Embodiments of the present application provide a computer program product comprising a computer program executable by a computer device, the program when run on the computer device causing the computer device to perform a pipeline-based instruction execution method as set forth in any one of the above aspects.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (12)

1. A pipelined-based instruction execution method comprising:
Based on first indication information stored in a special register, if the corresponding numerical value in the first indication information is the total number of instructions of a loop body, determining whether second indication information stored in the special register is not greater than the length of a delay slot; the second instruction information is used for indicating the number of the remaining unexecuted instructions in the loop body;
if the second instruction information is equal to the length of the delay slot, reducing third instruction information and setting the second instruction information as the total number of instructions in the loop body; the third indication information is used for indicating the number of remaining cycles in the cycle body;
and determining that the third indication information in the special register does not meet a cycle ending condition, executing the instruction in the delay slot and jumping to the starting instruction of the cycle body.
2. The method of claim 1, wherein prior to the based on the first indication information stored in the dedicated register, further comprising:
If the current instruction is a loop instruction, setting first instruction information and second instruction information in the special register according to the total number of instructions in a loop body indicated by the loop instruction, and setting third instruction information in the special register according to the number of loops indicated by the loop instruction.
3. The method as recited in claim 2, further comprising:
If the circulation instruction does not indicate the circulation times, initializing the third indication information to be a first numerical value according to the circulation instruction; wherein the first value is a positive integer.
4. The method of claim 3, wherein initializing the third indication information to a first value according to the loop instruction comprises:
Initializing the third indication information to be at least executed times if the loop instruction indicates the at least executed times; or alternatively, the first and second heat exchangers may be,
Initializing the third indication information to be the number of times of up to execution if the loop instruction indicates the number of times of up to execution; or alternatively, the first and second heat exchangers may be,
If the loop instruction indicates at least the number of times of execution and the number of times of execution, initializing the third indication information to be any value between the at least the number of times of execution and the number of times of execution.
5. The method of claim 1, wherein the method further comprises:
And dynamically increasing or decreasing the value corresponding to the third indication information according to the external condition.
6. The method as recited in claim 1, further comprising:
Initializing first indication information in the special register to be a second value when the cycle is determined to be ended according to an external condition; the second value is used to indicate the end of the cycle.
7. The method as recited in claim 1, further comprising:
if the second instruction information is greater than the length of the delay slot, executing the next instruction of the current instruction, wherein the current instruction is an instruction in the loop body; the next instruction of the current instruction is an instruction in the loop body.
8. The method as recited in claim 1, further comprising:
If the second instruction information is smaller than the length of the delay slot, pipeline cavitation bubbles are inserted into the delay slot, wherein the number of the pipeline cavitation bubbles is the length of the delay slot minus the number of the residual unexecuted instructions in the loop body corresponding to the second instruction information.
9. The method as recited in claim 1, further comprising:
Determining that the third indication information meets the cycle end condition, and initializing the first indication information in the special register to be a second value;
executing the instruction in the delay slot and jumping to the starting instruction of the circulation body.
10. The method of claim 1, wherein prior to the based on the first indication information stored in the dedicated register, further comprising:
the second instruction information in the dedicated register is subtracted from the first instruction information.
11. The method of any one of claims 1 to 10, wherein the method further comprises:
when the user program is just executed, the first indication information is initialized to the second value.
12. A pipelined-based instruction execution apparatus comprising a determination unit, a calculation unit and an execution unit:
the determining unit is configured to determine, based on first indication information stored in a dedicated register, whether second indication information stored in the dedicated register is not greater than a length of a delay slot if a corresponding value in the first indication information is a total number of instructions of a loop body; the second instruction information is used for indicating the number of the remaining unexecuted instructions in the loop body;
The calculating unit is configured to reduce third instruction information and set the second instruction information as a total number of instructions in the loop body if the second instruction information is equal to the length of the delay slot; the third indication information is used for indicating the number of remaining cycles in the cycle body;
And the execution unit is used for determining that the third indication information in the special register does not meet the cycle ending condition, executing the instruction in the delay slot and jumping to the starting instruction of the cycle body.
CN202410076144.9A 2024-01-18 2024-01-18 Instruction execution method and device based on pipelining Active CN117850881B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410076144.9A CN117850881B (en) 2024-01-18 2024-01-18 Instruction execution method and device based on pipelining

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410076144.9A CN117850881B (en) 2024-01-18 2024-01-18 Instruction execution method and device based on pipelining

Publications (2)

Publication Number Publication Date
CN117850881A CN117850881A (en) 2024-04-09
CN117850881B true CN117850881B (en) 2024-06-18

Family

ID=90539889

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410076144.9A Active CN117850881B (en) 2024-01-18 2024-01-18 Instruction execution method and device based on pipelining

Country Status (1)

Country Link
CN (1) CN117850881B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116974572A (en) * 2023-07-06 2023-10-31 中国人民解放军国防科技大学 Memory access address calculation optimization method and device based on cyclic stripping

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006040173A (en) * 2004-07-29 2006-02-09 Fujitsu Ltd Branch prediction device and method
US7873820B2 (en) * 2005-11-15 2011-01-18 Mips Technologies, Inc. Processor utilizing a loop buffer to reduce power consumption
CN103218206B (en) * 2012-01-18 2015-09-02 上海算芯微电子有限公司 The pre-jump method of instruction branches and system
CN112148367A (en) * 2019-06-26 2020-12-29 北京百度网讯科技有限公司 Method, apparatus, device and medium for processing a set of loop instructions
CN112000370B (en) * 2020-08-27 2022-04-15 北京百度网讯科技有限公司 Processing method, device and equipment of loop instruction and storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116974572A (en) * 2023-07-06 2023-10-31 中国人民解放军国防科技大学 Memory access address calculation optimization method and device based on cyclic stripping

Also Published As

Publication number Publication date
CN117850881A (en) 2024-04-09

Similar Documents

Publication Publication Date Title
CN101373427B (en) Program execution control device
CN101681259B (en) System and method for using local condition code register for accelerating conditional instruction execution in pipeline processor
JP3851707B2 (en) Central processing unit of super scalar processor
US7609582B2 (en) Branch target buffer and method of use
JP5734980B2 (en) Method and apparatus for predicting non-execution of conditional non-branching instructions
US20120079255A1 (en) Indirect branch prediction based on branch target buffer hysteresis
US7302557B1 (en) Method and apparatus for modulo scheduled loop execution in a processor architecture
EP2864868A1 (en) Methods and apparatus to extend software branch target hints
JP2000132390A (en) Processor and branch prediction unit
EP3289444A1 (en) Explicit instruction scheduler state information for a processor
US20030120882A1 (en) Apparatus and method for exiting from a software pipeline loop procedure in a digital signal processor
EP3314397A1 (en) Locking operand values for groups of instructions executed atomically
JP4134179B2 (en) Software dynamic prediction method and apparatus
CN100590592C (en) Processor and its instruction distributing method
US20030154469A1 (en) Apparatus and method for improved execution of a software pipeline loop procedure in a digital signal processor
US20030120900A1 (en) Apparatus and method for a software pipeline loop procedure in a digital signal processor
CN117850881B (en) Instruction execution method and device based on pipelining
US20030120905A1 (en) Apparatus and method for executing a nested loop program with a software pipeline loop procedure in a digital signal processor
US20020078333A1 (en) Resource efficient hardware loops
US20030120899A1 (en) Apparatus and method for processing an interrupt in a software pipeline loop procedure in a digital signal processor
CN113918225A (en) Instruction prediction method, instruction data processing apparatus, processor, and storage medium
US9489204B2 (en) Method and apparatus for precalculating a direct branch partial target address during a misprediction correction process
CN116113940A (en) Graph calculation device, graph processing method and related equipment
GB2416412A (en) Branch target buffer memory array with an associated word line and gating circuit, the circuit storing a word line gating value
US20030182511A1 (en) Apparatus and method for resolving an instruction conflict in a software pipeline nested loop procedure in a digital signal processor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant