CN114116010B - Architecture optimization method and device for processor cycle body - Google Patents

Architecture optimization method and device for processor cycle body Download PDF

Info

Publication number
CN114116010B
CN114116010B CN202210096815.9A CN202210096815A CN114116010B CN 114116010 B CN114116010 B CN 114116010B CN 202210096815 A CN202210096815 A CN 202210096815A CN 114116010 B CN114116010 B CN 114116010B
Authority
CN
China
Prior art keywords
instruction
loop
current instruction
short
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210096815.9A
Other languages
Chinese (zh)
Other versions
CN114116010A (en
Inventor
廖述京
陈钦树
欧艳凤
朱晓明
黄旭松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Communications and Networks Institute
Original Assignee
Guangdong Communications and Networks Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Communications and Networks Institute filed Critical Guangdong Communications and Networks Institute
Priority to CN202210096815.9A priority Critical patent/CN114116010B/en
Publication of CN114116010A publication Critical patent/CN114116010A/en
Application granted granted Critical
Publication of CN114116010B publication Critical patent/CN114116010B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3005Arrangements for executing specific machine instructions to perform operations for flow control
    • G06F9/30065Loop control instructions; iterative instructions, e.g. LOOP, REPEAT
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3842Speculative instruction execution

Abstract

The present disclosure provides a method and a device for optimizing an architecture of a processor cycle body, wherein the method comprises the following steps: judging whether the current instruction is a short loop body instruction or not, and if so, caching the current instruction in a short loop body cache; and looking up the PC value of the current instruction, if the effective records in the table are matched, determining that the current instruction is a conditional branch instruction of the control loop body, and prefetching the subsequent instruction from the corresponding jump _ PC in the table. The method and the device for optimizing the system structure of the processor cycle body can lower power consumption, more efficiently and quickly take out the short cycle body branch instruction, and reduce the probability of cavitation in a production line, thereby avoiding the problem of low kernel performance of the production line caused by slow instruction taking as much as possible; the loop body prediction is more accurate, the conditional branch instruction for controlling the loop body can be quickly detected, the flushing frequency of a flow line is reduced, and the performance of the processor is improved.

Description

Architecture optimization method and device for processor cycle body
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for optimizing an architecture of a processor loop body.
Background
In a program, there are usually a large number of loop statements. In C code, statements are typically embodied in the form of for, while, do.. The content in the body is circulated, and the instruction fetching and the repeated execution are required to be repeated. The result of execution is not the same, but due to the differences in register contents.
Firstly, in the aspect of branch prediction, at present, for the branch prediction behavior of a loop body, there is a policy of "jump backward and not jump forward" proposed by academics, where jump backward means that a jump target address is smaller than a current address, and jump forward means that a jump target address is larger than a current address. But there is no mention of how the loop body is detected. The critical issue is how to distinguish between conditional branch instructions in the loop body and conditional branch instructions in the non-loop body quickly, efficiently, and stably.
On the other hand, front-end fetching generally focuses on how fast the instruction accesses the icache fetch and instruction prefetch, and for reducing the number of times of accessing the cache by a short loop body (the number of instructions in the loop body is small) to save power consumption and quickening the fetch of the short loop body instruction, such research is rarely focused.
Disclosure of Invention
It is an object of the present disclosure to provide a method and apparatus for architecture optimization for processor cycle volumes that solves one or more of the above-mentioned problems of the prior art.
According to one aspect of the present disclosure, there is provided an architecture optimization method for a processor cycle body, comprising the steps of:
acquiring a current instruction of fetching;
judging whether the current instruction is a short loop body instruction, if so, caching the current instruction in a short loop buffer of a short loop body cache, and if not, not caching the current instruction;
looking up a table for a PC value of a current instruction, and judging whether effective records of the PC value of the current instruction are matched in a table loop table, wherein the information recorded in the table loop table comprises a conditional branch instruction PC value cur _ br _ PC of a control loop body, a jump address jump _ PC for predicting the conditional branch instruction and an effectiveness valid of a corresponding entry;
if the effective records in the table loop table are matched, determining that the current instruction is a conditional branch instruction of a control loop body, prefetching a subsequent instruction from a corresponding jump _ pc in the table, decoding and executing the current instruction, and acquiring the subsequent instruction from icache or short loop buffer;
if no valid record in the table loop table matches, only the current instruction is decoded and executed.
In a possible embodiment, the determining whether the current instruction is a short loop body instruction comprises:
acquiring a historical execution result of a current instruction;
judging whether the current instruction is a loop body instruction or not according to the historical execution result of the current instruction,
if the current instruction is a loop body instruction, judging whether the length of the current instruction meets the requirement of the length of a short loop body, and if the length of the current instruction meets the requirement of the length of the short loop body, determining that the current instruction is the short loop instruction.
In a possible embodiment, if there is no valid record match in the table loop table, then only decoding and executing the current instruction includes,
decoding the current instruction, judging whether the current instruction is a conditional branch instruction or not according to a decoding result, and if not, not operating the table loop table;
if the instruction is a conditional branch instruction, judging whether the jump direction of the current instruction is the direction with a smaller PC value according to the execution result of the current instruction;
if not, not operating the table loop table;
if yes, determining that the current instruction is a loop body branch instruction, performing updating operation on the table loop table, updating cur _ br _ pc and jump _ pc corresponding to the current instruction into the table loop table, and setting the validity of the entry as valid true.
In a possible embodiment, determining that the current instruction is a conditional branch instruction of the control loop body, prefetching subsequent instructions starting from a corresponding jump _ pc in the table comprises:
determines whether the current instruction hits the short loop buffer,
if the short loop buffer is hit, taking out the required instruction from the short loop buffer;
if the short loop buffer is not hit, the required instruction is fetched from the icache.
According to another aspect of the present disclosure, there is provided an architecture optimization apparatus for a processor loop body, configured to implement any one of the above-mentioned architecture optimization methods for a processor loop body, including:
the short loop body judging module is used for judging whether the current instruction is a short loop body instruction or not;
the short loop body cache is used for caching the current instruction when the current instruction is the short loop body instruction;
the loop body conditional branch prediction module is used for predicting whether the current instruction is a conditional branch instruction for controlling a loop body, performing table lookup on a PC value of the current instruction, and judging whether effective records of the PC value of the current instruction are matched in a table loop table, wherein information recorded in the table loop table comprises a conditional branch instruction PC value cur _ br _ PC for controlling the loop body, a jump address jump _ PC for predicting the conditional branch instruction and an effective value of a corresponding entry;
and the instruction prefetching module is used for prefetching the subsequent instruction according to the jump _ pc position of the table lookup when the current instruction is the conditional branch instruction of the control loop body, and the subsequent instruction is obtained from the icache or the short loop buffer.
In a possible embodiment, the short cycle body judgment module includes:
the loop body judging submodule is used for acquiring the historical execution result of the current instruction and judging whether the current instruction is a loop body instruction or not according to the historical execution result of the current instruction;
and the short cycle body judgment sub-module is used for judging whether the length of the current instruction meets the requirement of the length of the short cycle body or not when the current instruction is the cycle body instruction, and determining that the current instruction is the short cycle instruction if the length of the current instruction meets the requirement of the length of the short cycle body.
In a possible implementation manner, the device further includes a table loop table updating module, configured to determine whether to perform an update operation on the table loop table according to the decoding result and the execution result when there is no valid record matching in the table loop table.
In a possible embodiment, the table loop table update module comprises:
the conditional branch instruction judgment sub-module is used for judging whether the current instruction is a conditional branch instruction or not according to the decoding result, and if the current instruction is not the conditional branch instruction, the table loop table is not operated;
the loop body branch instruction judgment submodule is used for judging whether the jump direction of the current instruction is the direction with a smaller PC value or not according to the execution result of the current instruction when the current instruction is the conditional branch instruction; if not, not operating the table loop table;
and the table updating submodule is used for updating the table loop table when the current instruction is a loop body branch instruction, updating cur _ br _ pc and jump _ pc corresponding to the current instruction into the table loop table, and setting the validity of the entry as valid true.
In a possible implementation, the instruction prefetch module further comprises,
an SLB hit module for judging whether the current instruction hits the short loop buffer,
if the short loop buffer is hit, taking out the required instruction from the short loop buffer;
if the short loop buffer is not hit, the required instruction is fetched from the icache.
According to the method and the device for optimizing the system structure of the processor loop body, the short loop body cache is arranged, so that the short loop body branch instruction can be taken out more efficiently and quickly with lower power consumption, the probability of cavitation in a production line is reduced, and the problem of low kernel performance caused by slow instruction taking of the production line is avoided as much as possible; the table recording the PC value of the conditional branch instruction and the predicted jump address is set to perform table look-up operation, so that more accurate loop body prediction is realized, the conditional branch instruction for loop body control can be quickly detected, the flushing frequency of a production line is reduced, and the performance of the processor is improved.
In addition, in the technical solutions of the present disclosure, the technical solutions can be implemented by adopting a conventional means in the art, unless otherwise specified.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present disclosure, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
Fig. 1 is a flowchart of an architecture optimization method for a processor cycle body according to an embodiment of the present disclosure.
FIG. 2 is an example of the translation of for loops and if structures into assemblies in C language.
Fig. 3 is an example of information recorded in the table loop table.
FIG. 4 is an example of a loop table after initialization.
FIG. 5 is an example of adding entry contents to an initialized table loop table.
Fig. 6 is a table looptable entry content replacement example.
Fig. 7 is a schematic structural diagram of an architecture optimization apparatus for a processor cycle body according to an embodiment of the present disclosure.
Detailed Description
The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, system, article, or apparatus.
To make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, but not all embodiments of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
Referring to the specification, fig. 1 is a block diagram for providing an architecture optimization method for a processor loop body according to an embodiment of the present disclosure, including the following steps:
step 1: acquiring a current instruction of fetching;
step 2: judging whether the current instruction is a short loop body instruction, if so, executing a step 3, and if not, executing a step 4;
and step 3: caching the current instruction in a short loop buffer of a short loop body cache, and executing the step 4;
and 4, step 4: looking up a table for a PC value of a current instruction, and judging whether effective records of the PC value of the current instruction are matched in a table loop table, wherein the information recorded in the table loop table comprises a conditional branch instruction PC value cur _ br _ PC of a control loop body, a jump address jump _ PC for predicting the conditional branch instruction and an effectiveness valid of a corresponding entry; if the effective records in the table loop table are matched, executing step 5; if no effective record match exists in the table loop table, executing step 6;
and 5: determining that the current instruction is a conditional branch instruction of a control loop body, prefetching a subsequent instruction from a corresponding jump _ pc in the table, acquiring the subsequent instruction from an icache or a short loop buffer, and executing the step 6;
step 6: the current instruction is decoded and executed.
In an optional embodiment, in step 2, the determining whether the current instruction is a short loop body instruction includes:
step 2.1: acquiring a historical execution result of the instruction;
step 2.2: judging whether the current instruction is a loop body instruction or not according to the historical execution result of the instruction;
step 2.3: if the current instruction is a loop body instruction, judging whether the length of the current instruction meets the requirement of the length of a short loop body, and if the length of the current instruction meets the requirement of the length of the short loop body, determining that the current instruction is the short loop instruction.
Since the loop body instruction executes regularly, the loop body instruction repeatedly fetches instructions and then executes repeatedly. While icache is generally a relatively small memory such as 32KB/64 KB. The cache caches recently accessed instruction data by using the space limitation principle. The smaller the memory size is, the faster the access speed is. The continuous high-frequency access of the icache like the loop body causes large power consumption. Therefore, a short loop buffer (hereinafter abbreviated as SLB) is set, and the short loop is cached in the SLB in step 3 through loop detection and short loop determination, so that a corresponding instruction can be directly fetched from the SLB during subsequent instruction prefetching, thereby reducing the retrieval time in icache, reducing power consumption, accelerating the fetching of the short loop, and improving the performance of the processor.
Specifically, in the instruction execution stage, according to whether the detected conditional branch instruction is in a direction in which the PC value is smaller (i.e. backward jump), whether the executed instruction is an instruction in the loop body may be determined, and it may be determined that a range of a layer of instruction section code PC of the loop body is [ cur _ PC, jump _ PC ], where cur _ PC represents the PC value of the current instruction and jump _ PC represents the current instruction jump address. Thus, it is possible to determine whether or not the instruction to be subsequently fetched is a loop body, based on the historical execution result of the instruction.
Specifically, assuming that the head and tail instructions pc of the short loop body are start _ pc and end _ pc, respectively, the length of the short loop body is end _ pc-start _ pc. Since one pc corresponds to 1B data and the short loop body buffer SLB is also indexed by pc, the length of the short loop body should be less than or equal to the size of the SLB.
Referring to fig. 2 in the specification, an example of translating a loop structure and an if structure in C language into assembly is shown, and both the loop body and the if structure, after being compiled into an executable file, contain a conditional branch instruction, and therefore it is necessary to determine whether a current instruction is a conditional branch instruction of the loop body.
The conditional branch instructions of the loop body, and the conditional branch instructions in the if structure, do not differ significantly at the assembly level. Both compare the values in the two registers and jump according to the branch result. However, in compiling and disassembling the for loop and if structures, it can be concluded that the jump direction is the direction of smaller PC value (backward jump) when the condition is satisfied, and only the branch instruction of the loop body; when the condition is satisfied, the jump direction is a direction (jump forward) in which the PC value is larger, and only the branch instruction of the if structure is allowed.
Referring to the specification, fig. 2, in the for loop, taking pc =12 as an example, when the condition is met, the pc value becomes smaller and jumps backwards; when the condition is not satisfied, pc =12+4, and the pc value increases, jumping forward. In the if structure, taking pc =0 as an example, when the condition is not satisfied, pc =0+4, the pc value becomes large, and jumps forward; when the condition is met, pc =8, the pc value is increased and jumps forward.
However, since the conditional branch instruction is not yet executed in the fetch stage, it cannot be known explicitly whether the conditional branch actually jumps forward or backward, and therefore the direction of the subsequent jump and the jump address need to be predicted, so as to fetch the required instruction ahead of time,
thus, in step 4, the set table loop table is used to record the conditional branch instruction PC value of the control loop body, predict the conditional branch instruction jump address, and whether the entry is valid. Referring to FIG. 3 of the specification, the information recorded in the table loop table is shown. For the current instruction, if the PC of the current instruction has an effective record match in the table loop table, it may be determined that the current instruction is a conditional branch instruction of the control loop body, and it may be predicted that the loop body will begin fetching the instruction from the jump _ PC later, so as to fetch the instruction from icache or short loop body cache short loop buffer in advance. Therefore, the instructions of the loop section can be expected with a high correct probability, and the performance loss caused by waiting for fetching at the back end is reduced.
In an alternative embodiment, in step 5, where the current instruction is determined to be a conditional branch instruction of the control loop body, prefetching the subsequent instructions starting from the corresponding jump _ pc in the table comprises:
step 5.1: judging whether the current instruction hits the short loop buffer, if so, executing the step 5.2, and if not, executing the step 5.3;
and step 5.2: fetching the required instruction from the short loop buffer;
step 5.3: the required instruction is fetched from icache.
In particular, the SLB acts as a cache, involving write operations and read operations. Write operations are SLB updates and read operations are taken from SLB.
The SIZE of the SLB (SLB _ SIZE), directly determines the maximum number of instructions that can be cached. Generally, an address corresponds to 1B data, i.e., a pc stores 1B data. However, for rv64g, since its instruction can only be 4B, its pc is aligned to 4N, i.e., pc can only be 4N, otherwise the kernel should report a wrong address misalignment. Whereas for rv64gc, pc is at least aligned to 2N since its instructions can be 4B or 2B.
To store two-level short-loop body instructions, the size of the SLB should be greater than 128B, but to obtain the benefits of high-speed fetching and low power consumption, the SLB should be less than 1024B at maximum. Given that superscalar processors typically need to fetch multiple instructions at once, such as a 2-issue processor, 2 instructions need to be fetched from the cache at the same time. If the four instructions are standard instructions 32 bits long, then one fetch address requires fetching instruction data of size 2 x 32B =64B = 8B.
Since the minimum granularity of the instruction is 2B (i.e., the compression instruction), the data size corresponding to one address in each bank of the SLB can be designed to be 2B. This embodiment takes the instruction bit width 8B as an example, and illustrates the implementation of the write operation and the read operation of the SLB. Since 8B data needs to be fetched at a time, and one SLB bank can only provide 2B data, 4 banks are required.
In order to better control the SLB content reading and writing, three key registers are designed, and the functions are as follows:
slb _ start: indicating where the pc of the current SLB cache instruction starts;
slb _ end: indicating where the pc of the current SLB cache instruction ends;
slb _ valid: and indicating whether a cache valid instruction exists in the current SLB, namely whether a short loop body instruction exists.
The write SLB operation in both cases is explained below by way of example.
First, when the SLB is invalid or overriding the SLB content. Taking the example that the aforementioned one instruction fetch address needs to fetch 2 × 32B =64B =8B SIZE instruction data, if SLB _ valid = = false and br _ pc-jump _ pc < SLB _ SIZE of the backward jump conditional branch instruction (br _ pc = fetch _ pc if the current instruction is a loop body conditional branch instruction), or SLB _ valid & ((br _ pc > SLB _ end) | (jump _ pc < SLB _ start)), instruction data d [63:0] is cached to SLB at the same time when an instruction is fetched from icache.
At this time, it is necessary to determine the address for accessing each bank of the SLB and the data written to each bank in four cases:
if fetch _ pc = = 4N:
{bank3_addr,bank2_addr,bank1_addr,bank0_addr}={N,N,N,N},
{bank3_data,bank2_data,bank1_data,bank0_data}={d[63,48],d[47,32],d[31:16],d[15:0]}。
if fetch _ pc = =4N + 1:
{bank3_addr,bank2_addr,bank1_addr,bank0_addr}={N,N,N,N+1},
{bank3_data,bank2_data,bank1_data,bank0_data}={d[47,32],d[31:16],d[15:0],d[63,48]}。
if fetch _ pc = =4N + 2:
{bank3_addr,bank2_addr,bank1_addr,bank0_addr}={N,N,N+1,N+1},
{bank3_data,bank2_data,bank1_data,bank0_data}={d[31:16],d[15:0],d[63,48],d[47,32]}。
if fetch _ pc = =4N + 3:
{bank3_addr,bank2_addr,bank1_addr,bank0_addr}={N,N+1,N+1,N+1},
{bank3_data,bank2_data,bank1_data,bank0_data}={d[15:0],d[63,48],d[47,32],d[31:16]}。
in the second case, in the case of a multi-level loop, the memory loop need not re-save the inner loop specified segment code to the SLB when the outer loop has already saved its instruction segment in the SLB.
If SLB _ valid & ((cur _ pc ≦ SLB _ end) | (jump _ pc ≧ SLB _ start)), no action is done because the inner loop instruction is already contained in the outer loop code segment, so there is no need to re-fetch the padding SLB.
The read SLB operation is described below by way of example.
The process of reading the SLB needs to judge whether the SLB is hit or not.
Specifically, whether the current instruction hits the SLB may be determined according to the current instruction PC value cur _ PC of the loop body conditional branch instruction and the jump address jump _ PC thereof, that is, the determination condition is SLB _ valid & (cur _ PC ≦ SLB _ end) & (jump _ PC ≧ SLB _ start).
Upon a hit to the SLB, the SLB table is indexed with addr = cur _ pc-SLB _ start, taking out the concatenation of data from the 4 banks to obtain the required 8B instruction data.
The read 4 pieces of 2B data are spliced in sequence, and the specific splicing scheme depends on the addr value:
if addr =8N, then:
the address for accessing each bank is:
{bank3_addr,bank2_addr,bank1_addr,bank0_addr}={N,N,N,N},
the data read from the 4 banks are spliced as follows:
{bank3_data,bank2_data,bank1_data,bank0_data};
if addr =8N +1, then:
the address for accessing each bank is:
{bank3_addr,bank2_addr,bank1_addr,bank0_addr}={N,N,N,N+1},
the data read from the 4 banks are spliced as follows:
{bank0_data,bank3_data,bank2_data,bank1_data};
if addr =8N +2, then:
the address for accessing each bank is:
{bank3_addr,bank2_addr,bank1_addr,bank0_addr}={N,N,N+1,N+1},
the data read from the 4 banks are spliced as follows:
{bank1_data,bank0_data,bank3_data,bank2_data};
if addr =8N +3, then:
the address for accessing each bank is:
{bank3_addr,bank2_addr,bank1_addr,bank0_addr}={N,N+1,N+1,N+1},
the data read from the 4 banks are spliced as follows:
{bank2_data,bank1_data,bank0_data,bank3_data}。
in an alternative embodiment, if there is no valid record match in the table loop table, decoding and executing the current instruction may include,
step 6.1: decoding the current instruction, judging whether the current instruction is a conditional branch instruction or not according to a decoding result, and executing the step 6.2 if the current instruction is not the conditional branch instruction; if yes, executing step 6.3;
step 6.2: the table loop table is not operated;
step 6.3: judging whether the jump direction of the current instruction is the direction with a smaller PC value or not according to the execution result of the current instruction; if not, executing step 6.2; if yes, executing step 6.4;
step 6.4: and determining that the current instruction is a loop body branch instruction, updating the table loop table, updating cur _ br _ pc and jump _ pc corresponding to the current instruction into the table loop table, and setting the validity of the entry as valid true.
Therefore, whether the current instruction belongs to the loop body conditional branch instruction or not is predicted through the table loop table; when the table loop table has no item which can be matched with the current instruction, whether the current instruction is a loop body conditional branch instruction is judged by detecting whether the execution result of the conditional branch instruction jumps backwards or not.
When the table loop table does not have an entry which can be matched with the current instruction, but the current instruction is determined to belong to the conditional branch instruction of the control loop body according to the execution result of the current instruction, the table loop table needs to be updated.
Since the branch prediction itself is based on the historical jump behavior, the following jump direction and jump address of the branch instruction are predicted. The program accesses instructions with locality for a period of time. I.e., the most recently instructed instruction, has a relatively higher probability of being executed again. Therefore, the executed jump instruction and the predicted jump address thereof should be saved in the loop table, i.e., the oldest entry should be replaced when the loop table entry is replaced.
Specifically, an index indicating the oldest entry may be added to the looptable: aggregate _ entry _ index. The next _ entry _ index is used to indicate the table index the next time the entry content is added to the loop table or replaced. Referring to FIG. 4 in the specification, an example of a post-initialization loop table is shown. At this time, all entries in the loop table are invalid, i.e., there is no jump direction and jump address prediction of the conditional branch, and next _ entry _ index = oldest _ entry _ index = 0.
When the entry content needs to be added newly, keeping the old _ entry _ index unchanged, adding the set of information of cur _ br _ pc1 and jump address jump _ pc1 of the loop body conditional instruction fed back by the instruction execution phase to the entry with the index of 0, setting the valid to true, and updating next _ entry _ index = next _ entry _ index +1= 1. The above add entry operation may be performed until next _ entry _ index = N-1, where N represents the depth of the table loop table. Referring to the description, FIG. 5 illustrates an example of adding entry contents to an initialized table loop table.
Because the table loop table has a limited depth, when all entries in the table loop table have jump addresses of effective loop body conditional branch instructions and new information needs to be written, the loop table entries need to be rewritten. Referring to the table loop table entry content replacement example shown in fig. 6 in the description, when all entries of the table loop table valid = = true, information to be written is replaced at the old _ entry _ index, and old _ entry _ index + +, next _ entry _ index + +. Therefore, the latest branch jump address always stored in the loop table can be ensured.
According to the method for optimizing the system structure of the processor cycle body, the short cycle body cache is arranged, so that the short cycle body branch instruction can be taken out more efficiently and quickly with lower power consumption, the probability of cavitation in a production line is reduced, and the problem of low kernel performance of the production line caused by slow instruction fetching is avoided as much as possible; the table recording the PC value of the conditional branch instruction and the predicted jump address is set for table lookup operation, so that more accurate loop body prediction is realized, the conditional branch instruction for loop body control can be rapidly detected, the pipeline flushing frequency is reduced, the performance of the processor is improved, and the method can be widely applied to superscalar processors with higher requirements on performance.
Example 2:
in this embodiment, referring to fig. 7 in the specification, there is provided an architecture optimization apparatus for a processor loop body, configured to implement any one of the above method embodiments of the architecture optimization method for a processor loop body, where the architecture optimization apparatus for a processor loop body at least includes:
the short loop body judging module is used for judging whether the current instruction is a short loop body instruction or not;
the short loop body cache is used for caching the current instruction when the current instruction is the short loop body instruction;
the loop body conditional branch prediction module is used for predicting whether the current instruction is a conditional branch instruction for controlling a loop body, performing table lookup on a PC value of the current instruction, and judging whether effective records of the PC value of the current instruction are matched in a table loop table, wherein information recorded in the table loop table comprises a conditional branch instruction PC value cur _ br _ PC for controlling the loop body, a jump address jump _ PC for predicting the conditional branch instruction and an effective value of a corresponding entry;
and the instruction prefetching module is used for prefetching a subsequent instruction according to the jump _ pc of the table lookup when the current instruction is a conditional branch instruction of the control loop body, wherein the subsequent instruction is obtained from the icache or the short loop buffer.
In an optional embodiment, the short loop body determining module includes:
the loop body judging submodule is used for acquiring the historical execution result of the current instruction and judging whether the current instruction is a loop body instruction or not according to the historical execution result of the current instruction;
and the short cycle body judgment sub-module is used for judging whether the length of the current instruction meets the requirement of the length of the short cycle body or not when the current instruction is the cycle body instruction, and determining that the current instruction is the short cycle instruction if the length of the current instruction meets the requirement of the length of the short cycle body.
In an optional embodiment, the apparatus further includes a table loop table updating module, configured to determine whether to perform an update operation on the table loop table according to a decoding result and an execution result when there is no valid record matching in the table loop table.
In an optional embodiment, the table loop table updating module includes:
the conditional branch instruction judgment sub-module is used for judging whether the current instruction is a conditional branch instruction or not according to the decoding result, and if the current instruction is not the conditional branch instruction, the table loop table is not operated;
the loop body branch instruction judgment submodule is used for judging whether the jump direction of the current instruction is the direction with a smaller PC value or not according to the execution result of the current instruction when the current instruction is the conditional branch instruction; if not, not operating the table loop table;
and the table updating submodule is used for updating the table loop table when the current instruction is a loop body branch instruction, updating cur _ br _ pc and jump _ pc corresponding to the current instruction into the table loop table, and setting the validity of the entry as valid true.
In an alternative embodiment, the instruction prefetch module further comprises,
an SLB hit module for judging whether the current instruction hits the short loop buffer,
if the short loop buffer is hit, taking out the required instruction from the short loop buffer;
if the short loop buffer is not hit, the required instruction is fetched from the icache.
According to the method and the device for optimizing the system structure of the processor cycle body, the short cycle body cache is arranged, so that the short cycle body branch instruction can be taken out more efficiently and quickly with lower power consumption, the probability of cavitation in a production line is reduced, and the problem of low kernel performance of the production line caused by slow instruction fetching is avoided as much as possible; the table recording the PC value of the conditional branch instruction and the predicted jump address is set to perform table look-up operation, so that more accurate loop body prediction is realized, the conditional branch instruction for loop body control can be rapidly detected, the flushing frequency of a production line is reduced, and the performance of a processor is improved.
The sequence of the embodiments in this specification is merely for description, and does not represent the advantages or disadvantages of the embodiments. And specific embodiments thereof have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only exemplary of the present disclosure and is not intended to limit the present disclosure, which is to be construed in any way as imposing limitations thereon, such as the appended claims, and all changes and equivalents that fall within the true spirit and scope of the present disclosure.

Claims (6)

1. A method for architectural optimization for a processor cycle body, comprising the steps of:
acquiring a current instruction of fetching;
judging whether the current instruction is a short loop body instruction, if so, caching the current instruction in a short loop buffer of a short loop body cache, and if not, not caching the current instruction;
performing table lookup on a PC value of a current instruction, and judging whether effective records of the PC value of the current instruction are matched in a table loop table, wherein information recorded in the table loop table comprises a conditional branch instruction PC value cur _ br _ PC of a control loop body, a jump address jump _ PC for predicting the conditional branch instruction and an effectiveness valid of a corresponding entry;
if the effective records in the table loop table are matched, determining that the current instruction is a conditional branch instruction of a control loop body, prefetching a subsequent instruction from a corresponding jump _ pc in the table, decoding and executing the current instruction, wherein the subsequent instruction is obtained from icache or a short loop buffer;
if the effective records in the table loop table are not matched, only decoding and executing the current instruction;
wherein, if no valid record in the table loop table is matched, only decoding and executing the current instruction comprises,
decoding the current instruction, judging whether the current instruction is a conditional branch instruction or not according to a decoding result, and if not, not operating the table loop table;
if the instruction is a conditional branch instruction, judging whether the jump direction of the current instruction is the direction with a smaller PC value according to the execution result of the current instruction;
if not, not operating the table loop table;
if yes, determining that the current instruction is a loop body branch instruction, performing updating operation on the table loop table, updating cur _ br _ pc and jump _ pc corresponding to the current instruction into the table loop table, and setting the validity of the entry as valid true.
2. The method of claim 1, wherein said determining whether the current instruction is a short loop body instruction comprises:
acquiring a historical execution result of a current instruction;
judging whether the current instruction is a loop body instruction or not according to the historical execution result of the current instruction,
if the current instruction is a loop body instruction, judging whether the length of the current instruction meets the requirement of the length of a short loop body, and if the length of the current instruction meets the requirement of the length of the short loop body, determining that the current instruction is the short loop instruction.
3. The method of claim 1, wherein determining that the current instruction is a conditional branch instruction of the control loop body, wherein prefetching subsequent instructions starting from a corresponding jump _ pc in the table comprises:
determines whether the current instruction hits the short loop buffer,
if the short loop buffer is hit, taking out the required instruction from the short loop buffer;
if the short loop buffer is not hit, the required instruction is fetched from the icache.
4. An architecture optimization device for a processor cycle body, which is used for implementing the architecture optimization method for the processor cycle body as claimed in any one of claims 1 to 3, and is characterized by comprising:
the short loop body judging module is used for judging whether the current instruction is a short loop body instruction or not;
the short loop body cache is used for caching the current instruction when the current instruction is the short loop body instruction;
the loop body conditional branch prediction module is used for predicting whether the current instruction is a conditional branch instruction for controlling a loop body, searching a table for a PC value of the current instruction, and judging whether an effective record of the PC value of the current instruction is matched in a table loop table, wherein information recorded in the table loop table comprises a conditional branch instruction PC value cur _ br _ PC for controlling the loop body, a jump address jump _ PC for predicting the conditional branch instruction and an effective value of a corresponding entry;
the instruction pre-fetching module is used for pre-fetching a subsequent instruction according to the jump _ pc of the table look-up when the current instruction is a conditional branch instruction of the control loop body, wherein the subsequent instruction is obtained from the icache or a short loop buffer;
the table loop table updating module is used for judging whether to update the table loop table according to the decoding result and the execution result when no effective record in the table loop table is matched;
the table loop table updating module comprises:
the conditional branch instruction judgment sub-module is used for judging whether the current instruction is a conditional branch instruction or not according to the decoding result, and if the current instruction is not the conditional branch instruction, the table loop table is not operated;
the loop body branch instruction judgment submodule is used for judging whether the jump direction of the current instruction is the direction with a smaller PC value or not according to the execution result of the current instruction when the current instruction is the conditional branch instruction; if not, not operating the table loop table;
and the table updating submodule is used for updating the table loop table when the current instruction is a loop body branch instruction, updating cur _ br _ pc and jump _ pc corresponding to the current instruction into the table loop table, and setting the validity of the entry as valid true.
5. The apparatus of claim 4, wherein the short loop body determining module comprises:
the loop body judging submodule is used for acquiring the historical execution result of the current instruction and judging whether the current instruction is a loop body instruction or not according to the historical execution result of the current instruction;
and the short cycle body judgment sub-module is used for judging whether the length of the current instruction meets the requirement of the length of the short cycle body or not when the current instruction is the short cycle body instruction, and determining that the current instruction is the short cycle instruction if the length of the current instruction meets the requirement of the length of the short cycle body.
6. The architecture optimization device for processor cycle bodies of claim 4, wherein the instruction prefetch module further comprises,
an SLB hit module for judging whether the current instruction hits the short loop buffer,
if the short loop buffer is hit, taking out the required instruction from the short loop buffer;
if the short loop buffer is not hit, the required instruction is fetched from the icache.
CN202210096815.9A 2022-01-27 2022-01-27 Architecture optimization method and device for processor cycle body Active CN114116010B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210096815.9A CN114116010B (en) 2022-01-27 2022-01-27 Architecture optimization method and device for processor cycle body

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210096815.9A CN114116010B (en) 2022-01-27 2022-01-27 Architecture optimization method and device for processor cycle body

Publications (2)

Publication Number Publication Date
CN114116010A CN114116010A (en) 2022-03-01
CN114116010B true CN114116010B (en) 2022-05-03

Family

ID=80361972

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210096815.9A Active CN114116010B (en) 2022-01-27 2022-01-27 Architecture optimization method and device for processor cycle body

Country Status (1)

Country Link
CN (1) CN114116010B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101901133A (en) * 2009-06-01 2010-12-01 富士通株式会社 Messaging device and branch prediction method
CN106681695A (en) * 2015-11-09 2017-05-17 想象技术有限公司 Fetch ahead branch target buffer
CN106775591A (en) * 2016-11-21 2017-05-31 江苏宏云技术有限公司 A kind of hardware loop processing method and system of processor
CN111258654A (en) * 2019-12-20 2020-06-09 宁波轸谷科技有限公司 Instruction branch prediction method
CN112230992A (en) * 2019-07-15 2021-01-15 杭州中天微系统有限公司 Instruction processing device comprising branch prediction loop, processor and processing method thereof
CN113760371A (en) * 2020-06-01 2021-12-07 晶心科技股份有限公司 Method for branch prediction and microprocessor and data processing system thereof

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7194606B2 (en) * 2004-09-28 2007-03-20 Hewlett-Packard Development Company, L.P. Method and apparatus for using predicates in a processing device
US9971393B2 (en) * 2015-12-16 2018-05-15 International Business Machines Corporation Dynamic workload frequency optimization

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101901133A (en) * 2009-06-01 2010-12-01 富士通株式会社 Messaging device and branch prediction method
CN106681695A (en) * 2015-11-09 2017-05-17 想象技术有限公司 Fetch ahead branch target buffer
CN106775591A (en) * 2016-11-21 2017-05-31 江苏宏云技术有限公司 A kind of hardware loop processing method and system of processor
CN112230992A (en) * 2019-07-15 2021-01-15 杭州中天微系统有限公司 Instruction processing device comprising branch prediction loop, processor and processing method thereof
CN111258654A (en) * 2019-12-20 2020-06-09 宁波轸谷科技有限公司 Instruction branch prediction method
CN113760371A (en) * 2020-06-01 2021-12-07 晶心科技股份有限公司 Method for branch prediction and microprocessor and data processing system thereof

Also Published As

Publication number Publication date
CN114116010A (en) 2022-03-01

Similar Documents

Publication Publication Date Title
US7631146B2 (en) Processor with cache way prediction and method thereof
US8943300B2 (en) Method and apparatus for generating return address predictions for implicit and explicit subroutine calls using predecode information
US7155574B2 (en) Look ahead LRU array update scheme to minimize clobber in sequentially accessed memory
US20150186293A1 (en) High-performance cache system and method
US7444501B2 (en) Methods and apparatus for recognizing a subroutine call
US9396117B2 (en) Instruction cache power reduction
JP4585005B2 (en) Predecode error handling with branch correction
US7797520B2 (en) Early branch instruction prediction
US7266676B2 (en) Method and apparatus for branch prediction based on branch targets utilizing tag and data arrays
TWI502347B (en) Branch prediction power reduction
CN114116016B (en) Instruction prefetching method and device based on processor
TW201407470A (en) Branch prediction power reduction
US7640422B2 (en) System for reducing number of lookups in a branch target address cache by storing retrieved BTAC addresses into instruction cache
KR20230025409A (en) Instruction address translation and instruction prefetch engine
US20030204705A1 (en) Prediction of branch instructions in a data processing apparatus
CN114116010B (en) Architecture optimization method and device for processor cycle body
US20150193348A1 (en) High-performance data cache system and method
JP3843048B2 (en) Information processing apparatus having branch prediction mechanism
US8578134B1 (en) System and method for aligning change-of-flow instructions in an instruction buffer
CN111190645B (en) Separated instruction cache structure
CN117472446B (en) Branch prediction method of multi-stage instruction fetching target buffer based on processor
CN117311814A (en) Instruction fetch unit, instruction reading method and chip
US11151054B2 (en) Speculative address translation requests pertaining to instruction cache misses
KR100265332B1 (en) Branch prediction method
CN116107638A (en) Processing method, processing device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant