CN116974572A

CN116974572A - Memory access address calculation optimization method and device based on cyclic stripping

Info

Publication number: CN116974572A
Application number: CN202310825932.9A
Authority: CN
Inventors: 王耀华; 刘昕睿; 郭阳; 扈啸; 李哲; 文梅; 陈照云; 时洋; 张天
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2023-07-06
Filing date: 2023-07-06
Publication date: 2023-10-31

Abstract

The application discloses a memory address calculation optimization method and a memory address calculation optimization device based on cyclic stripping. The application can strip the calculation of the memory access address in the linear assembly layer to obtain the core loop, can realize the loop body code by using more simplified sentences, effectively reduce redundant address calculation in the core loop body, reduce the total execution instruction number and the program execution time cost of the core loop, improve the efficiency of the program, and can be applicable to all loops.

Description

Memory access address calculation optimization method and device based on cyclic stripping

Technical Field

The application relates to the technical field of program compiling of linear assembly language, in particular to a memory address calculation optimization method and device based on cyclic stripping.

Background

A linear assembly language is a programming language that is intermediate between assembly language and high-level programming language, has a simpler syntax than assembly language and is more efficient than high-level programming language. Linear assembly language has the following 3 advantages: (1) no manual allocation of registers is required; (2) the assignment of the functional units, the arrangement of instruction beats and the filling of delay slots do not need to be considered; (3) the parallel scheduling of the design codes is not needed, and the parallel scheduling can be automatically completed by a linear assembly compiler, so that the coding efficiency of the codes is improved. At present, common linear assembly optimization methods are cyclic expansion and cyclic soft running water. And (3) cyclic unfolding: the loops are unfolded, so that the total times of branch instructions can be reduced, and the instruction parallelism is improved; circulating soft running water: the length of a critical path of the circulating body is reduced, and the code execution efficiency is improved. However, the existing linear assembly optimization method still has the problems of insufficient conciseness and poor optimization effect.

Disclosure of Invention

The application aims to solve the technical problems: aiming at the problems in the prior art, the application provides a memory address calculation optimization method and a memory address calculation optimization device based on cyclic stripping, which can strip the calculation of the memory address from a linear assembly layer to a core cycle by analyzing and reconstructing the calculation of the memory address, can realize cyclic body codes by using more simplified sentences, effectively reduce redundant address calculation in a core cycle body, can reduce the total execution instruction number and the program execution time cost of the core cycle, improve the efficiency of the program, and can be applicable to all cycles.

In order to solve the technical problems, the application adopts the following technical scheme:

a memory address calculation optimization method based on cyclic stripping comprises the following steps:

s101, determining a base address calculation expression and an offset calculation expression in an original linear assembly code loop body to be optimized, and analyzing the loop body to obtain dependent variables used in the base address calculation expression and the offset calculation expression, and initial values and step sizes of the dependent variables;

s102, if the initial value and the step length of the dependent variable are determined values which can be recorded or the expression which does not contain the dependent variable in the circulation, reconstructing the base address calculation expression and the offset calculation expression into a form of adding and subtracting a plurality of factors, wherein each factor contains at most one dependent variable, and jumping to the next step; otherwise, ending and exiting;

s103, when all factors in the reconstructed offset calculation expression only contain a single dependent variable, firstly bringing an initial value of the dependent variable into a base address calculation expression, bringing a step length of the dependent variable into the offset calculation expression, and then respectively using intermediate variables to represent the base address calculation expression and the offset calculation expression before the base address calculation expression and the offset calculation expression are put out from a circulating body to the circulating body; and finally, replacing the base address calculation expression and the offset calculation expression in the loop body with corresponding intermediate variables, so that the calculation of the base address and the offset is not needed in the loop body.

Optionally, when all factors in the reconstructed offset calculation expression only contain a single dependent variable, the original base address calculation expression and the offset calculation expression determined in the step S103 are added first, then split into a form of adding and subtracting multiple factors, finally merging the factors without the dependent variable in the present loop together, presenting the factors to the front of the loop body, representing the result by an intermediate variable, and replacing the corresponding part in the loop body by the intermediate variable.

Optionally, the proposing to the front of the loop body refers to proposing from the loop body to the last code block before the loop body.

Alternatively, the use of the intermediate variable to represent the base address calculation expression and the offset calculation expression in step S103 refers to the use of the intermediate variable AR to represent the result of the base address calculation expression and the use of the intermediate variable OR to represent the result of the base address calculation expression.

Optionally, after replacing the base address calculation expression and the offset calculation expression in the loop body with corresponding intermediate variables in step S103, the method further includes replacing the address of the variables with an address of ar++ [ OR ], where addr is the original address of the variables, AR is an intermediate variable for representing the base address calculation expression, and OR is an intermediate variable for representing the base address calculation expression.

Optionally, step S101 is preceded by a step of semantically downgrading class C language code corresponding to the original linear assembly code to be optimized to a linear assembly language hierarchy to obtain the original linear assembly code to be optimized.

Optionally, step S103 further includes a step of converting the optimized linear assembly code into assembly code and compiling the assembly code to obtain an execution program.

In addition, the application also provides a memory address calculation optimizing device based on cyclic stripping, which comprises the following steps:

the assembly code analysis program unit is used for determining a base address calculation expression and an offset calculation expression in an original linear assembly code loop body to be optimized, and analyzing the loop body to obtain dependent variables used in the base address calculation expression and the offset calculation expression and initial values and step sizes of the dependent variables;

the dependent variable judging program unit is used for reconstructing the base address calculation expression and the offset calculation expression into a form of adding and subtracting a plurality of factors if the initial value and the step length of the dependent variable are determined values which can be recorded or the expression without the dependent variable in the cycle, and each factor at most comprises one dependent variable, and the jump execution expression replaces the program unit; otherwise, ending and exiting;

an expression replacement program unit, configured to, when all factors in the reconstructed offset calculation expression contain only a single dependent variable, first bring an initial value of the dependent variable into a base calculation expression, bring a step length of the dependent variable into the offset calculation expression, and then respectively use intermediate variables to represent the base calculation expression and the offset calculation expression before the base calculation expression and the offset calculation expression are put out from the loop body to the loop body; and finally, replacing the base address calculation expression and the offset calculation expression in the loop body with corresponding intermediate variables, so that the calculation of the base address and the offset is not needed in the loop body.

In addition, the application also provides a memory access calculation optimization device based on the cyclic stripping, which comprises a microprocessor and a memory which are connected with each other, wherein the microprocessor is programmed or configured to execute the memory access calculation optimization method based on the cyclic stripping.

Furthermore, the present application provides a computer readable storage medium having stored therein a computer program for being programmed or configured by a microprocessor to perform the cyclic stripping based memory access calculation optimization method.

Compared with the prior art, the application has the following advantages: according to the application, through analyzing and reconstructing the calculation of the memory address, the core loop can be stripped from the calculation of the memory address in the linear assembly layer, the loop body code can be realized by using more simplified sentences, the length of the core loop body is reduced, the corresponding total execution length is reduced, the redundant address calculation in the core loop body is greatly reduced, the calculation time is faster than that of the common cyclic linear assembly, the total execution instruction number and the program execution time cost of the core loop can be reduced, the program efficiency is improved, and the method is applicable to all loops and has good reliability.

Drawings

FIG. 1 is a schematic diagram of a basic flow of a method according to an embodiment of the present application.

Fig. 2 is a schematic diagram of a method for correcting the error of the upstream and downstream integrity rates according to an embodiment of the present application.

Detailed Description

As shown in fig. 1, the memory address calculation optimization method based on loop stripping in this embodiment includes:

s101, determining a base address calculation expression and an offset calculation expression in an original linear assembly code loop body to be optimized, and analyzing the loop body to obtain dependent variables used in the base address calculation expression and the offset calculation expression, and initial values and step sizes of the dependent variables; wherein the dependent variable refers to a variable affected by the number of cycles, and the base address expression and the offset expression refer to an expression for calculating a base address and an expression for calculating an offset, respectively. For example, in the C language, the value is obtained by indexing a [ i+1] variable, when the degradation is carried out to the stage of linear assembly, a is a base address expression, i+1 is an offset expression, a+i+1 is used for obtaining a value address, and finally the value of a [ i+1] is obtained. The base address calculation expression and the offset calculation expression in the original linear assembly code loop body are defined in the original linear assembly code loop body, and can be obtained according to the definition. For example, in this embodiment, for a loop body of a certain original linear assembly code, the base address calculation expression in the loop body is extracted as follows:

Conv2d_NCHW_ft_out_transform_write_cach，

the offset calculation expression is:

16*a_inner+7168*j_inner+64*k_inner，

where a_inner is the dependent variable of the current cycle.

S102, if the initial value and the step length of the dependent variable are determined values which can be recorded or the expression which does not contain the dependent variable in the cycle, reconstructing the base address calculation expression and the offset calculation expression into a form of adding and subtracting a plurality of factors, wherein each factor contains at most one dependent variable (only one or zero dependent variable, and the single use of the same dependent variable is regarded as one, for example, a is a, and a is used twice, so the number of the dependent variables is two), and jumping to the next step; otherwise, ending and exiting;

for the base address calculation expression and offset calculation expression examples previously described, the base address expression after reconstruction:

Conv2d_NCHW_ft_out_transform_write_cach+7168*j_inner+64*k_inner+16*a_inner，

the offset expression is:

16*a_inner

In this embodiment, the original linear assembly code defines an initial value of 0, a step size of 1 (1 is added each time as the number of loops increases) and a final base address expression is:

Conv2d_NCHW_ft_out_transform_write_cach+7168*j_inner+64*k_inner+16*0，

the offset expression is 16 x 1,

the calculations for the access base address and offset become now independent of the current cycle. Then, the base address calculation expression and the offset calculation expression are respectively expressed by using intermediate variables before being put out from the circulating body to the circulating body; and finally, replacing the base address calculation expression and the offset calculation expression in the loop body with corresponding intermediate variables, so that the calculation of the base address and the offset is not needed in the loop body. For example, in the present embodiment, the intermediate variables ar_ex_76 and or_ex_76 are used to represent the base address calculation expression and the offset calculation expression, respectively; finally, the base address calculation expression and the offset calculation expression in the loop body are replaced by the intermediate variables AR_EX_76 and OR_EX_76, so that the calculation of the base address and the offset is not needed in the loop body.

The method of the embodiment can optimize the memory access calculation of the linear assembly layer. If the dependent variable exists in one polynomial factor and the number of the dependent variable is not more than one, the optimal optimization can be performed, and the calculation of the memory address is stripped out of the loop body; otherwise, only primary optimization can be performed, namely, the loop body is stripped after the polynomial analysis and recombination which are irrelevant to the dependent variables, and the purposes of optimal optimization and primary optimization are to reduce codes in the current loop body as much as possible, reduce the time cost of code execution, reduce the memory access time during code execution and improve the overall execution efficiency of the program.

Needless to say, the above step is independent of whether and how the optimization is performed when all the factors in the offset calculation expression after the reconstruction contain only a single dependent variable are not established in step S103. Referring to fig. 1, as an alternative embodiment, when all factors in the reconstructed offset calculation expression only contain a single dependent variable, step S103 further includes adding the original base address calculation expression and the offset calculation expression determined in step S103, splitting the original base address calculation expression and the offset calculation expression into a form of adding and subtracting multiple factors, merging the factors without the dependent variable in the present loop, presenting the merged factors to the front of the loop body, using an intermediate variable to represent the result, and replacing the corresponding part in the loop body with the intermediate variable.

In this embodiment, the proposal to the front of the loop body refers to the proposal from the loop body to the last code block before the loop body, so that the positioning and reading are easy.

In the present embodiment, the use of the intermediate variable to represent the base address calculation expression and the offset calculation expression in step S103 refers to the use of the intermediate variable AR to represent the result of the base address calculation expression and the use of the intermediate variable OR to represent the result of the base address calculation expression. In this embodiment, after replacing the base address calculation expression and the offset calculation expression in the loop body with corresponding intermediate variables in step S103, the method further includes replacing the memory address of the variables with an addr form of ar++ [ OR ], where addr is the original address of the variables, AR is an intermediate variable for representing the base address calculation expression, OR is an intermediate variable for representing the base address calculation expression, and AR and OR can be regarded as constants in the loop. For example, var_11_59 is replaced with the form of ar_ex_76++ [ or_ex_76], and the intermediate variables ar_ex_76 and or_ex_76 represent the base address calculation expression and the offset calculation expression.

As shown in fig. 2, step S101 in the present embodiment further includes a step of semantically downgrading class C language code corresponding to the original linear assembly code to be optimized to a linear assembly language hierarchy to obtain the original linear assembly code to be optimized. The class C language code is converted into FTIR, and the FTIR is converted into original linear assembly code.

As shown in fig. 2, step S103 in this embodiment further includes a step of converting the optimized linear assembly code into assembly code, compiling the assembly code to obtain an execution program, and then executing the assembly code program.

It should be noted that, when the method of this embodiment is optimized, the following two cases should be noted: 1) The number of all the dependent variables with the factors of the dependent variables in the index obtained after splitting is not more than one. 2) The initial value and the step length of the dependent variable need to be a determined value or an expression which can be recorded, and the expression does not contain the dependent variable of the cycle; when this occurs, only part of the loop-independent calculations can be stripped off, while loop-dependent calculations can still only be calculated within the loop.

In order to verify the effect of the method of the embodiment, the number of code lines before and after the optimization of the method of the embodiment is adopted for a certain linear assembly code for comparison, and the experimental results are shown in table 1.

Table 1: the code line number comparison schematic diagram before and after optimization by adopting the method of the embodiment.

	Number of lines of code	Executing beats
			Before optimization	43	40
After optimization	15	24

Referring to table 1, the number of code lines optimized by the method of the embodiment is reduced by two thirds, and the execution efficiency of the code can be remarkably improved.

In summary, in the analysis method for reconstructing the calculation of the memory address in the embodiment, a series of processes of calculating, loop stripping, data substitution, code replacement and the like of the memory address are analyzed and recombined to realize optimization, so that all the calculation of the memory address by the linear assembly layer can be stripped out of a core loop, only the important instruction of an algorithm is reserved in the core loop, thereby effectively avoiding redundant address calculation in the core loop body, reducing the length of the core loop body, correspondingly reducing the total execution length, only assigning the address base address and the offset value to the corresponding register after calculation, and being faster than the common loop linear assembly in operation time, reducing the total execution instruction number and the program execution time cost of the core loop, improving the code execution efficiency, and having ideal effect and reliability.

In addition, the embodiment also provides a memory address calculation optimizing device based on cyclic stripping, which comprises:

In addition, the embodiment also provides a memory access computing optimization device based on cyclic stripping, which comprises a microprocessor and a memory which are connected with each other, wherein the microprocessor is programmed or configured to execute the memory access computing optimization method based on cyclic stripping. In addition, the present embodiment also provides a computer readable storage medium having a computer program stored therein, the computer program being configured or programmed by a microprocessor to perform the cyclic stripping-based address calculation optimization method.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present application, and the protection scope of the present application is not limited to the above examples, and all technical solutions belonging to the concept of the present application belong to the protection scope of the present application. It should be noted that modifications and adaptations to the present application may occur to one skilled in the art without departing from the principles of the present application and are intended to be within the scope of the present application.

Claims

1. The access address calculation optimization method based on cyclic stripping is characterized by comprising the following steps of:

2. The optimization method according to claim 1, wherein step S103 further comprises adding the original base address calculation expression and the offset calculation expression determined in step S103 when all factors in the reconstructed offset calculation expression only contain a single dependent variable, splitting the added and subtracted multiple factors, merging the factors that do not contain the dependent variable in the present loop, extracting the merged factors to the front of the loop, representing the result by an intermediate variable, and replacing the corresponding part of the intermediate variable in the loop body.

3. The optimization method of claim 2, wherein the step of extracting the memory address before the loop body refers to extracting the memory address from the loop body to a last code block before the loop body.

4. The method according to claim 3, wherein the step S103 of expressing the base address calculation expression using the intermediate variable and the offset calculation expression means that the base address calculation expression is expressed using the intermediate variable AR and the base address calculation expression is expressed using the intermediate variable OR.

5. The optimization method of claim 4, wherein after replacing the base address calculation expression and the offset calculation expression in the loop body with corresponding intermediate variables in step S103, the method further comprises replacing the base address of the variables with an addr in the form of ar++ [ OR ], where addr is an original address of the variables, AR is an intermediate variable for representing the base address calculation expression, and OR is an intermediate variable for representing the base address calculation expression.

6. The optimization method of claim 1, further comprising the step of semantically downgrading class C language code corresponding to the original linear assembly code to be optimized to a linear assembly language hierarchy to obtain the original linear assembly code to be optimized before step S101.

7. The method for optimizing memory address calculation based on loop stripping as recited in claim 6, further comprising the step of converting the optimized linear assembly code into assembly code and compiling the assembly code to obtain an execution program after step S103.

8. The access address calculation optimizing device based on cyclic stripping is characterized by comprising:

9. A cyclic stripping-based memory address calculation optimization device comprising a microprocessor and a memory connected to each other, wherein the microprocessor is programmed or configured to perform the cyclic stripping-based memory address calculation optimization method of any one of claims 1 to 7.

10. A computer readable storage medium having a computer program stored therein, wherein the computer program is for programming or configuring by a microprocessor to perform the cyclic stripping based address calculation optimization method of any one of claims 1 to 7.