CN116302114A

CN116302114A - Compiler instruction scheduling optimization method for supporting instruction macro fusion CPU

Info

Publication number: CN116302114A
Application number: CN202310161288.XA
Authority: CN
Inventors: 许谦; 庄秋彬
Original assignee: Jindi Spacetime Zhuhai Technology Co ltd
Current assignee: Jindi Spacetime Zhuhai Technology Co ltd
Priority date: 2023-02-24
Filing date: 2023-02-24
Publication date: 2023-06-23
Anticipated expiration: 2043-02-24
Also published as: CN116302114B

Abstract

The invention discloses a compiler instruction scheduling optimization method for supporting an instruction macro fusion CPU, which comprises the following steps: step 1, adding total latency value information of each group of fusion instructions into instruction description information of a compiler; step 2, replacing the whole fusion instruction of each group with a corresponding self-defined fusion total instruction, and performing instruction scheduling by a scheduler based on the replaced instruction sequence; and 3, after the instruction scheduling is finished, replacing each self-defined fusion total instruction one by one to each corresponding fusion instruction group. The invention takes part in scheduling in a form of fusing the total instructions, can avoid that two instructions are accidentally disassembled to cause incapability of fusion, can embody the latency condition of real hardware execution, ensures that the instruction scheduling algorithm of software can correctly eliminate the bubble, controls the optimized influence within the scheduling stage, does not need other parts of a compiler and hardware to change, has strong compatibility, does not need to modify the algorithm of the existing scheduler, and can lead the scheduling result to contain consideration of influence on the macro fusion of the instructions.

Description

Compiler instruction scheduling optimization method for supporting instruction macro fusion CPU

Technical Field

The invention relates to a CPU compiler technology, in particular to an instruction scheduling optimization method of a compiler back end.

Background

1. Instruction fusion

Instruction fusion is a dynamic process of combining two instructions into a single instruction that produces one operation, micro-operation, sequence within a processor. Instructions stored in a processor Instruction Queue (IQ) may be "fused" after being read out of the IQ and before being sent to an instruction decoder, or after being decoded by the instruction decoder.

Typically, instruction fusion that occurs prior to instruction decoding is referred to as "macrofusion," while instruction fusion that occurs after instruction decoding (e.g., as uops) is referred to as "micro-fusion.

The macro fusion can effectively increase the throughput of instructions, reduce the latency of fused instructions, increase IPC (Instruction per cycle, the average instruction number executed in one clock cycle, and measure the important index of the performance of the processor), and improve the running efficiency of the processor. Especially for RISCV, instruction macrofusion can make up for the defect of simple instruction set: because the RISCV simplified instruction set does not contain complex instructions common to other instruction sets, such as load double gpr instructions in an ARM architecture, a macro fusion mode is needed for the RISCV basic instructions, so that the CPU of the RISCV can achieve the same performance as the CPU of other complex instruction set architectures when completing the same complex functions.

It is therefore desirable to find as many opportunities as possible to achieve instruction fusion.

2. Compiler support for instruction macrofusion

Implementation of instruction macrofusion requires hardware-level implementation and software-level support.

The support in software mainly means that the compiler needs to identify the fusible instruction pair and arrange the fusible instruction pair according to the fusion requirement of hardware, for example, when the CPU requires that the instruction pair is adjacent to be fused, the compiler needs to strictly arrange the fusible instruction pair adjacent to each other, and the opportunity of fusion is created. When the CPU supports out-of-order, the fusion may not require strict adjacency, but still requires the instructions to be arranged within a certain range by the compiler, creating a fused opportunity.

Optimization of the scheduling phase of a compiler typically includes: list scheduling+heuristic algorithm, but only make the scheduling result reach the optimal solution as much as possible, and scenes like instruction macro fusion may not be covered. Thus, a compiler (e.g., LLVM) typically provides an opportunity to redefine dependencies before scheduling is complete, and software can perform the relevant processing of instruction macrofusion.

For example, the object of the current microarchitecture-supported instruction macrofusion is instruction a and instruction b, and requires that instruction b be executed immediately after instruction a has been executed, while the DAG generated after scheduling by the compiler is shown in fig. 1.

The DAG display instruction sequence may be: instruction a|instruction b|instruction c|instruction d, or instruction a|instruction c|instruction b|instruction d. In this case, in order to force the instruction sequence to be: instruction a|instruction b|instruction c|instruction d is typically implemented in a manner that "adds a mutation" between instruction b and instruction c, i.e., by forcing an added dependency relationship to guarantee the required node order, as shown in fig. 2.

According to the DAG relationship of fig. 2, the instruction is forced to: the order of instructions a|instruction b|instruction c|instruction d is arranged.

3. Instruction scheduling principle

Instruction scheduling is a code performance optimization technique provided by compilers to enable efficient execution of programs on a central processor having an instruction pipeline through parallelism at the instruction level. Generally we divide scheduling into static scheduling and dynamic scheduling according to the stage at which it occurs.

However, both static scheduling and dynamic scheduling are performed by reordering the order of execution of instructions such that:

(1) Reducing bubble among instruction pipelines;

(2) Increasing IPC (Instruction per cycle), the average instruction number executed in one clock cycle, and measuring the important index of the performance of the processor;

(3) Relieving register pressure (which may reduce the life cycle of the registers to some extent if the scheduler is well handled).

If two dependent instructions are adjacent, for example, instruction a is adjacent to instruction b, instruction a is arranged before instruction b, and instruction b depends on instruction a, and the latency of instruction a is greater than 1, then instruction b needs to wait for the latency of instruction a to execute and then execute, so that when the CPU actually executes, a cavitation bubble is generated between the two instructions, as shown in fig. 3.

When the adjacent instruction has no dependency, the instruction pipeline is full, and a bubble does not exist; if instruction b depends on instruction a, a bubble is generated between instructions a and b.

In the prior art, for a specific CPU, a compiler has information such as resource use, latency and the like of each instruction, so the compiler has the opportunity to eliminate the bubbles as much as possible through scheduling on the basis of ensuring the unchanged function.

In the above example, it is assumed that the instruction c has no dependency relationship with the instruction a and the instruction b, and the execution sequence of the instruction c and the instruction b is exchanged without changing the overall function, so that the instruction c can be mobilized to the position where the bubble occurs to fill the bubble as much as possible. If the total latency value of the stuffed instruction c is greater than or equal to the latency of the execution of instruction a, then the bubble may be completely eliminated, as shown in FIG. 4.

The technical problem is that the compiler performs instruction scheduling based on the proprietary latency information of each actual instruction, and the change of the latency of the instruction after the instruction macro fusion is not considered, so that the problem can occur after the instruction macro fusion, and the performance improvement brought by the instruction macro fusion is counteracted.

For example, when instruction b depends on instruction a, the latency of instruction a is 3, then a- > b will bring about a bubble of 2 cycles, as shown in FIG. 5.

The compiler may schedule two other instructions x and y, both x and y being 1, forming a sequence of a- > x- > y- > b, eliminating the bubbles of a through b, as shown in FIG. 6.

However, when the hardware supports x and y fusion, the total latency after x and y fusion is reduced to 1, which can bring about performance improvement of 1 cycle, but a- > (x, y) - > b after fusion can cause b to need a bubble of 1 cycle, and the overall performance of the final instruction x and y before and after fusion is not improved, as shown in fig. 7.

The root cause of the scheduling optimization defect here is: the scheduler schedules based on individual instruction latency information without considering the impact of instruction macrofusion.

The related prior art can be seen in patent literature: CN115357230A, CN105378683A, CN104050077A, CN104050026A, CN104049945A, CN103870243a.

Disclosure of Invention

In order to solve the technical problems, the invention provides a compiler instruction scheduling optimization method for supporting an instruction macro fusion CPU, which adopts the following technical scheme that the information range of a scheduler to instructions is covered to the total latency value after the instruction macro fusion, the influence of fusion is considered during scheduling, and the problem that a bubble is generated in the original position without the bubble after the instruction macro fusion is avoided, wherein the method comprises the following steps:

a compiler instruction scheduling optimization method for supporting an instruction macro fusion CPU comprises the following steps:

step 1, adding total latency value information of each group of fusion instructions into instruction description information of a compiler;

step 2, replacing the whole fusion instruction of each group with a corresponding self-defined fusion total instruction, and performing instruction scheduling by a scheduler based on the replaced instruction sequence;

and 3, after the instruction scheduling is finished, replacing each self-defined fusion total instruction one by one to each corresponding fusion instruction group.

The invention has the following beneficial technical effects:

1. before dispatching, the fused instruction is replaced by a self-defined fusion total instruction, and the dispatching is participated in the form of the fusion total instruction, so that the situation that two instructions are accidentally disassembled and cannot be fused can be avoided, and the latency condition of real hardware execution can be reflected, so that the instruction dispatching algorithm of software can correctly eliminate the bubble.

2. After the dispatching is finished, the fusion total instruction is replaced back to a corresponding group of fusion instructions, namely the actual original CPU instruction, so that the influence of the optimization can be controlled within the dispatching stage, other parts and hardware of a compiler are not required to be changed, and the compatibility is strong.

3. By means of instruction replacement, the scheduling result can be made to contain consideration of influence on instruction macro fusion without modifying an algorithm of an existing scheduler.

Drawings

The invention is further described below with reference to the accompanying drawings:

FIG. 1 is a diagram of a DAG without adding a mutation after scheduling by a compiler;

FIG. 2 is a schematic diagram of a DAG after adding a mutation after compiler scheduling;

FIG. 3 is a diagram illustrating a bubble caused by a dependency relationship between adjacent instructions;

FIG. 4 is a diagram illustrating the elimination of a bubble by a dispatch instruction c;

FIG. 5 is a diagram showing that instruction a and instruction b generate a buffer of 2 cycles;

FIG. 6 is a schematic diagram of elimination of a bubble by scheduling instruction x and instruction y;

FIG. 7 is a diagram showing a special scenario in which the number of cycles is not reduced before and after the fusion of instruction x and instruction y;

FIG. 8 is a diagram showing the generation of a new bubble after the fusion of instruction x and instruction y;

FIG. 9 is a diagram showing that the scheduler knows the latency value after the combination of instruction x and instruction y, and eliminates the bubble by scheduling instruction z;

fig. 10 is a schematic drawing of the DAG for instructions 1 to 4.

Detailed Description

The technical terms described in the invention are defined as follows:

latency: indicating the clock cycle required to fully execute an instruction, the unit is a cycle.

Bubble: the execution of the following instruction is blocked by a blocking (stall) mode on hardware, and the execution of the previous instruction is delayed. This is called pipeline blocking and the delay caused is called bubble.

3. Dependency relationship: to complete a function, interactions between multiple instructions are sometimes required, and some instructions have a dependency relationship, that is, the completion of one instruction may be completed only by relying on the execution results of other instructions.

Dag: in graph theory, a directed graph is a directed acyclic graph (DAG, directed Acyclic Graph) if it cannot go back to any vertex through several edges from that vertex.

The DAG we discuss refers to a DAG constructed by a compiler for each basic block instruction of a program and dependency relationship between instructions after instruction selection is completed in a scheduling stage, and the purpose is to perform local optimization. The following basic blocks:

instruction 1: a=b+c

Instruction 2: b=a-d

Instruction 3: c=b+c

Instruction 4: d=a-c

If the DAG is drawn, it is shown in FIG. 10.

We say that a DAG is a directed acyclic graph that represents the data dependencies between instructions in a basic block. Based on the data dependency relationship among the instructions, the following can be constructed: dag= (V, E), where V represents the set of nodes corresponding to all instructions in the program and E represents the set of data dependencies between instructions.

5. Adding a mutation in the DAG: it is known that we refer to a DAG as a graph with basic block instructions as nodes and dependency relationships between instructions as edges, and adding a mutation refers to adding an edge to the DAG, i.e. forcing a dependency relationship between two instructions.

The invention discloses a compiler instruction scheduling optimization method for supporting an instruction macro fusion CPU, which comprises the following steps:

and step 1, adding the total latency value information of each group of fused instructions into the instruction description information of the compiler. a. Custom fusion general instruction: classifying the macro fusion situation of all instructions in instruction description information of a compiler, defining each macro fusion situation of the instructions as a fusion total instruction, representing the total of a group of macro fusion instructions, defining the name of each fusion total instruction at the rear end of the compiler, but not distributing machine codes for the fusion total instruction, distributing a special computing resource for each fusion total instruction, and defining the special computing resources; b. the correct latency value is set: in the scheduling model file, the latency value of the special computing resource fused by all the instruction macros is set as the total latency value of a group of fused instructions, and the special computing resource is distributed to each corresponding fused total instruction, namely, the mapping of the total latency value fused by the instruction macros to the corresponding fused total instruction is realized, and the total latency value information fused by the instruction macros is supplemented into the scheduler in this way.

And 2, replacing the whole fused instruction group with a corresponding self-defined fused total instruction, and performing instruction scheduling by a scheduler based on the replaced instruction sequence. After the instruction sequence is generated, the instruction sequence is searched, each group of fused instructions is integrally replaced by a corresponding user-defined fused total instruction, and the scheduler performs scheduling based on the replaced instruction sequence, namely, the scheduler can schedule according to the total latency value after the macro fusion of the instructions, so that a scheduling stage is realized. After each group of fusion instructions are replaced, the scheduler can obtain the correct latency value after fusion, and the scheduler works according to the existing algorithm, so that the bubbles caused by the macro fusion of the instructions can be eliminated correctly.

And 3, after the instruction scheduling is finished, searching the instruction sequence again, and replacing each self-defined fusion total instruction one by one to each corresponding fusion instruction group, namely, a real original CPU instruction.

For example, when the instruction b depends on the instruction a, the latency of the instruction a is 3, the instruction sequence a- > b brings about a bubble of 2 cycles; the instruction sequence is x- > y- > z, and each latency is 1.

Regardless of instruction macrofusion, the compiler may generate the sequence a- > x- > y- > b- > z, with 2 instructions between compiler angles a and b, resulting in a latency of 2 cycles, sufficient to eliminate the bubbles from a to b. However, when x and y are fused, and in fact, when the hardware is executing, there is still a bubble between a and b, as shown in fig. 8.

By adopting the technical scheme of the invention, a new fusion total instruction xy is firstly customized to enable the latency to be 1, and the instruction sequence of x- > y- > z is modified into xy- > z; the compiler forms a sequence of a- > xy- > z- > b after scheduling, and eliminates the bubbles from a to b, as shown in fig. 9; and after the dispatching is finished, replacing the fusion total instruction with the actual original CPU instruction to form a sequence of a- > x- > y- > z- > b.

Relationship of total latency value after fusion and the sum of latency values before fusion.

1. Sum of pre-fusion latency values: regardless of the sum of the respective latency values of the instructions when instruction fusion occurs.

2. Total latency value after fusion: consider the total latency value required for fused execution of a fused instruction on a supported instruction when instruction fusion occurs.

3. Total latency value after fusion < sum of latency values before fusion:

the meaning of instruction fusion is that: when two specific instructions are adjacent in a certain order, the two instructions can be executed in parallel, namely (x 1 +x2) cycle is needed for originally executing the two instructions, after the group of instructions are fused, y cycle < (x 1 +x2) cycle is needed for executing the two instructions, namely the total latency value after fusion is less than the sum of the latency values before fusion.

4. Meaning of setting total latency value after fusion:

after implementing the instruction fusion on hardware, the tool chain cannot set a fused new latency value for each group of the fusible instruction pairs like the normal instruction, because the tool chain (taking LLVM as an example) can only set a latency value for a certain single instruction (i.e. each instruction can only correspond to one latency value), if one fused total latency value is set, the method can cause conflict with the latency value of the single instruction in the fusible instruction pairs, and the fusible instruction pairs are formed to appear not only in the fusible scene, but also independently or in an arrangement sequence not conforming to the fusion rule, and the scheduler should schedule according to the respective latency values.

The above is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto. Any of the features based on the present invention, which are basically the same means to realize basically the same functions and basically the same effects, are also included in the protection scope of the present invention, and can be replaced by features that can be suggested by a person of ordinary skill in the art without creative effort when the infringement occurs.

Claims

1. A compiler instruction scheduling optimization method for supporting an instruction macro fusion CPU comprises the following steps:

2. The compiler instruction scheduling optimization method for supporting an instruction macro fusion CPU according to claim 1, wherein step 1 is:

a. custom fusion general instruction

Classifying the macro fusion situation of all instructions in instruction description information of a compiler, defining each macro fusion situation of the instructions as a fusion total instruction, representing the total of a group of macro fusion instructions, defining the name of each fusion total instruction at the rear end of the compiler, but not distributing machine codes for the fusion total instruction, distributing a special computing resource for each fusion total instruction, and defining the special computing resources;

b. setting the correct latency value

In the scheduling model file, the latency value of the special computing resource fused by all the instruction macros is set as the total latency value of a group of fused instructions, and the special computing resource is distributed to each corresponding fused total instruction, namely, the mapping of the total latency value fused by the instruction macros to the corresponding fused total instruction is realized, and the total latency value information fused by the instruction macros is supplemented into the scheduler in this way.

3. The compiler instruction scheduling optimization method for supporting instruction macro fusion CPU according to claim 1, wherein step 2 is: after the instruction sequence is generated, the instruction sequence is searched, each group of fused instructions is integrally replaced by a corresponding user-defined fused total instruction, and the scheduler performs scheduling based on the replaced instruction sequence, namely, the scheduler can schedule according to the total latency value after the macro fusion of the instructions, so that a scheduling stage is realized.

4. The compiler instruction scheduling optimization method for supporting instruction macro fusion CPU according to claim 1, wherein step 3 is: after the instruction scheduling is finished, the instruction sequence is searched again, and each self-defined fusion total instruction is replaced one by one to each corresponding fusion instruction group, namely the actual original CPU instruction.