CN115469931B

CN115469931B - Instruction optimization method, device, system, equipment and medium of loop program

Info

Publication number: CN115469931B
Application number: CN202211359396.XA
Authority: CN
Inventors: 周浩; 张亚林; 官孝峰; 胡正平; 姚建国
Original assignee: Beijing Suiyuan Intelligent Technology Co ltd
Current assignee: Beijing Suiyuan Intelligent Technology Co ltd
Priority date: 2022-11-02
Filing date: 2022-11-02
Publication date: 2023-03-24
Anticipated expiration: 2042-11-02
Also published as: CN115469931A

Abstract

The invention discloses a method, a device, a system, equipment and a medium for optimizing instructions of a loop program. The method comprises the following steps: matching the input cyclic program with the cyclic mode matched with the bottom layer operation module to obtain a target cyclic program; according to a data format corresponding to the bottom layer operation module, carrying out cyclic blocking and cyclic exchange processing on the cyclic conditions in the target cyclic program to obtain a post-processing cyclic program; and establishing a corresponding executable file according to a storage rule corresponding to the bottom layer operation module and a post-processing cycle program, and transmitting the executable file into the bottom layer operation module for execution. By the technical scheme, the input cyclic program can be converted into the executable file corresponding to the bottom-layer operation module, the running performance of the cyclic program is improved, and the development cycle of the program is shortened.

Description

Instruction optimization method, device, system, equipment and medium of loop program

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, a system, a device, and a medium for optimizing instructions of a loop program.

Background

With the rapid development of computer technology, due to matrix and tensor-oriented operations, such as vector multiplication, vector matrix multiplication, matrix multiplication or convolution, the computation is usually large, the execution frequency is high, and the duration is long. Therefore, it is important to improve the matrix and tensor oriented operation performance.

In the prior art, a programmer is required to complete matrix and tensor-oriented operations by using a matrix operation instruction. However, since the matrix operation instruction usually has specific constraints on the data type, the matrix shape, the loading of the input data, and the storing of the output result. Therefore, to achieve the desired computational performance, an experienced programmer is also typically required to elaborate the assembly code or use compiler-supported primitives so that the optimized instructions can take full advantage of the underlying hardware to achieve the operation acceleration.

However, in the method used in the prior art, the development cycle of the program is usually long, the code reuse and the portability are limited, and the bottom layer hardware instruction needs to be sensed in advance, and the application development is performed by using the high-level programming language under the condition that the bottom layer hardware instruction is not sensed, for example, if a compiler cannot transparently convert the program written by the high-level programming language into the code of the application matrix operation instruction in the compiling process, so that the matrix operation functional unit is efficiently utilized, the utilization rate of the bottom layer hardware is reduced, and the running performance of the program cannot be effectively improved. Therefore, how to reduce the program development cycle and realize the optimization of program instructions while improving the utilization rate of the underlying operation module and the running performance of the program is a problem to be solved urgently at present.

Disclosure of Invention

The invention provides an instruction optimization method, device, system, equipment and medium of a loop program, which can realize the optimization of program instructions on the basis of improving the utilization rate of a bottom layer operation module, improving the operation performance of the loop program and reducing the program research and development period.

According to an aspect of the present invention, there is provided an instruction optimization method of a loop program, including:

matching the input cyclic program with the cyclic mode matched with the bottom layer operation module to obtain a target cyclic program;

according to the data format corresponding to the bottom layer operation module, carrying out cyclic blocking and cyclic exchange processing on the cyclic conditions in the target cyclic program to obtain a post-processing cyclic program;

and establishing a corresponding executable file according to a storage rule corresponding to the bottom layer operation module and a post-processing circulation program, and transmitting the executable file into the bottom layer operation module for execution.

According to another aspect of the present invention, there is provided an instruction optimization apparatus of a loop program, including:

the target cyclic program acquisition module is used for matching the input cyclic program with the cyclic mode matched with the bottom layer operation module to obtain a target cyclic program;

the post-processing cycle program generating module is used for performing cycle blocking and cycle exchange processing on the cycle conditions in the target cycle program according to the data format corresponding to the bottom layer operation module to obtain a post-processing cycle program;

and the executable file generation module is used for establishing a corresponding executable file according to the storage rule corresponding to the bottom layer operation module and the post-processing circulating program, and transmitting the executable file into the bottom layer operation module for execution.

According to another aspect of the present invention, there is provided an instruction optimization system of a loop program, including:

the compiler is used for matching the input loop program with the loop mode matched with the bottom layer operation module to obtain a target loop program; according to the data format corresponding to the bottom layer operation module, carrying out cyclic blocking and cyclic exchange processing on the cyclic conditions in the target cyclic program to obtain a post-processing cyclic program; establishing a corresponding executable file according to a post-processing cycle program according to a storage rule corresponding to the bottom layer operation module, and transmitting the executable file into the bottom layer operation module for execution;

and the bottom layer operation module is used for receiving and executing the executable file transmitted by the compiler.

According to another aspect of the present invention, there is provided an electronic apparatus including:

the system comprises a compiler, a bottom layer operation module and at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform a method of instruction optimization of a cyclic program according to any of the embodiments of the invention.

According to another aspect of the present invention, there is provided a computer-readable storage medium storing computer instructions for causing a processor to execute a method of optimizing instructions of a loop program according to any one of the embodiments of the present invention.

According to the technical scheme of the embodiment of the invention, the input cyclic program is matched with the cyclic mode matched with the bottom layer operation module to obtain a target cyclic program; further, according to a data format corresponding to the bottom layer operation module, carrying out cyclic blocking and cyclic exchange processing on the cyclic conditions in the target cyclic program to obtain a post-processing cyclic program; finally, according to the storage rule corresponding to the bottom layer operation module, the corresponding executable file is established according to the post-processing cycle program and is transmitted to the bottom layer operation module for execution, so that the problems of long period and limited code reuse and transportability of the program instruction optimization process are solved, and the optimization of the program instruction can be realized on the basis of improving the utilization rate and the program operation performance of the bottom layer operation module and reducing the program research and development period.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present invention, nor do they necessarily limit the scope of the invention. Other features of the present invention will become apparent from the following description.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flowchart of an instruction optimization method for a loop program according to an embodiment of the present invention;

FIG. 2a is a flowchart of an instruction optimization method for a loop program according to a second embodiment of the present invention;

FIG. 2b is a flowchart of an alternative method for optimizing instructions of a loop program according to the second embodiment of the present invention;

FIG. 3 is a schematic structural diagram of an instruction optimization apparatus for a loop program according to a third embodiment of the present invention;

FIG. 4 is a schematic structural diagram of an instruction optimization system of a loop program according to a fourth embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device implementing the instruction optimization method of the loop program according to the embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," "target," and the like in the description and claims of the present invention and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example one

Fig. 1 is a flowchart of an instruction optimization method for a loop program according to an embodiment of the present invention, which is applicable to a case where a computational logic loop program written in a high-level programming language is automatically converted into equivalent machine code using vector matrix multiplication instructions, and the method can be executed by an instruction optimization apparatus for a loop program, which can be implemented in hardware and/or software, and the instruction optimization apparatus for a loop program can be configured in an electronic device. As shown in fig. 1, the method includes:

and S110, matching the input cyclic program with the cyclic mode matched with the bottom layer operation module to obtain a target cyclic program.

The input loop program may refer to a source code of computational logic that needs to be optimized for instructions, and may be, for example, a source code of computational logic written in a high-level programming language, such as C language or C + + language, such as for loop code.

The bottom layer operation module may refer to a module that runs computation logic corresponding to the loop program. Illustratively, it may be a Central Processing Unit (CPU). The circular pattern may refer to a format condition that needs to be satisfied by the computational logic that the underlying operational module is capable of executing. For example, the number of times of loop in the calculation logic may be included, and the format requirement of the loop structure in the calculation logic may also be included. The target loop program may refer to a loop program that matches the loop pattern that matches the underlying operational module.

Therefore, by matching the input cyclic program with the cyclic pattern matched with the bottom layer operation module, the target cyclic program matched with the cyclic pattern matched with the bottom layer operation module can be obtained by screening under the condition that the instruction type supported by the bottom layer operation module is not sensed, and an effective basis is provided for subsequent operation.

And S120, performing cyclic blocking and cyclic exchange processing on the cyclic conditions in the target cyclic program according to the data format corresponding to the bottom layer operation module to obtain a post-processing cyclic program.

The data format may refer to a data size supported by the bottom layer operation module. For example, if the computation logic of the bottom layer operation module is matrix multiplication, the corresponding data format may be the size of the matrix.

The loop blocking may refer to grouping the set loop conditions according to a data format corresponding to the bottom layer operation module, so as to iteratively merge multiple loops, thereby performing one-time processing by using bottom layer hardware. The loop exchange processing may refer to sequentially exchanging the set loop conditions. The post-processing cycle program can be a cycle program obtained after cycle blocking and cycle exchange processing.

Therefore, by performing loop exchange processing on the loop conditions in the target loop program according to the data format corresponding to the bottom layer operation module, the locality of data access can be improved, and more arrays can be continuously returned and stored.

And S130, establishing a corresponding executable file according to the storage rule corresponding to the bottom layer operation module and the post-processing circulation program, and transmitting the executable file to the bottom layer operation module for execution.

The storage rule may refer to a rule corresponding to each register and stored data preset in the bottom layer operation module. It is noted that register names are typically set differently for different platforms. An executable file may refer to tensor code that the underlying operational module is capable of executing.

According to the technical scheme of the embodiment of the invention, the target cyclic program is obtained by matching the input cyclic program with the cyclic mode matched with the bottom layer operation module; further, according to the data format corresponding to the bottom layer operation module, carrying out cyclic blocking and cyclic exchange processing on the cyclic conditions in the target cyclic program to obtain a post-processing cyclic program; finally, according to the storage rule corresponding to the bottom layer operation module, the corresponding executable file is established according to the post-processing cycle program and is transmitted to the bottom layer operation module for execution, so that the problems of long period, code reuse and limited transportability of the program instruction optimization process are solved, and the optimization of the program instruction can be realized on the basis of improving the utilization rate and the execution performance of the bottom layer operation module and reducing the program research and development period.

Example two

Fig. 2a is a flowchart of an instruction optimization method for a loop program according to a second embodiment of the present invention, where this embodiment is refined based on the foregoing embodiment, and in this embodiment, specifically, the operation of matching an input loop program with a loop pattern matched with a bottom-layer operation module to obtain a target loop program is refined, which specifically includes: matching an input cyclic program with a cyclic mode matched with a bottom layer operation module, and judging whether the input cyclic program meets the cyclic mode matched with the bottom layer operation module; if yes, taking the input circulation program as a target circulation program; if not, converting the cycle body logic contained in the input cycle program according to a preset logic conversion principle to obtain a cycle program to be selected, matching the cycle program to be selected with the cycle mode matched with the bottom layer operation module, and if the cycle program to be selected meets the cycle mode matched with the bottom layer operation module, taking the cycle program to be selected as a target cycle program. As shown in fig. 2a, the method comprises:

s210, acquiring an operation instruction corresponding to a bottom operation module; the operation instruction comprises an operation instruction function, a data format and a loop nesting layer number corresponding to the bottom layer operation module.

The operation instruction function may refer to a function implemented by operation logic corresponding to the bottom layer operation module. Illustratively, a vector matrix multiply instruction function may be used. The number of loop nesting layers may refer to the number of nested layers in a loop program. For example, if the program is a for loop program, the number of loop nesting layers may be the number of for loops.

Specifically, if the operation instruction function corresponding to the bottom layer operation module is a vector matrix multiplication instruction; the data format is that the length of a vector Vi is N, the shape of a matrix M is [ N, M ], and the length of a vector Vo is M; the number of loop nesting layers is 2. Then, the for loop mode corresponding to the bottom layer operation module can be as follows:

and S220, establishing a cyclic structure judgment condition according to the number of the cyclic nesting layers and the standard cyclic structure format.

The standard cyclic structure format may refer to a standard cyclic structure format. The cyclic structure body determination condition may refer to a condition for determining the format of the cyclic structure body.

Specifically, the judgment condition of the loop structure body may include that the number of nested loops in the loop structure body is not less than the number of nested loops corresponding to the bottom layer operation module; the circulation structure body conforms to a perfect nested circulation, namely only the innermost circulation contains the circulation body; the loop structure does not include branch jumps, i.e., does not include keywords such as if, else if, return, and the like.

And S230, establishing an operation logic judgment condition according to the operation instruction function.

The operation logic determination condition may be a condition for determining the operation logic included in the cyclic structure. For example, it may be determined whether the operation logic includes only a multiply-add operation logic. Specifically, the operation logic determination condition may be established according to an a = a + B × C format. It is noted that the evaluation condition of the calculation logic is satisfied only when there is and only has the calculation logic in the format of a = a + B × C in the cyclic structure. It should be noted that, in the embodiment of the present invention, the number of the operation logics in the format of a = a + B × C in the cyclic structure is not limited.

And S240, establishing a cycle number judgment condition according to the data format.

The cycle number determination condition may be a condition for determining the number of cycles in the cyclic structure. For example, the loop number judgment condition may be set such that the loop number in the loop structure should be an integral multiple of the corresponding data structure.

Specifically, if the data format corresponding to the cyclic variable j in the cyclic structure is M and the data format corresponding to the cyclic variable k is N, the cycle number determination condition may be satisfied when the cyclic variable j is an integer multiple of M and the cyclic variable k is an integer multiple of N.

It is worth noting that on the basis of establishing the cycle number judgment condition according to the data format, the access and storage behaviors in the cycle structure body can be judged according to the condition. For example, if the operation logic in the loop structure is as follows: a = a + B × C, array form:

the memory access behavior determination condition may specifically be: A. the memory spaces accessed by B and C do not overlap with each other, namely, no data dependence exists; if A, B and C have no constants, the array indices of A, B and C are affine polynomials for loop variables, i.e. for loop variables i, j, k, the array indices can be expressed as a + bi + cj + dk; if A and C are both array accesses, the continuous access of data is controlled by the only and same cyclic variablej; if B is array access, the continuous access of the data is controlled by a unique cyclic variable k, and k is different from j; if A and B are both array accesses, the array subscript of A does not contain a loop variable k, and the array subscript of B does not contain a loop variable j. Therefore, by establishing the memory access behavior judgment condition, the complexity of subsequent cycle conversion can be facilitated to be simplified, and meanwhile, the continuous storage of input calculation data on a memory can be ensured, so that the data reading and writing back can be efficiently completed.

And S250, combining the cyclic structure judgment condition, the operation logic judgment condition and the cycle number judgment condition to generate a cycle mode matched with the bottom operation module.

S260, matching the input cyclic program with a cyclic mode matched with a bottom-layer operation module, and judging whether the input cyclic program meets the cyclic mode matched with the bottom-layer operation module; if yes, go to S270; if not, go to step S280.

S270, taking the input circulation program as a target circulation program, and executing S290.

Specifically, if the operation instruction function corresponding to the bottom layer operation module is a vector matrix multiplication instruction; the data format is: the length of array B is N, the shape of array C is [ N, M ], and the length of vector A is M; the number of the loop nesting layers is two. Then the target loop program that matches the loop pattern matched by the underlying operation module can be as follows:

s280, converting the circulation body logic contained in the input circulation program according to a preset logic conversion principle to obtain a circulation program to be selected, matching the circulation program to be selected with the circulation mode matched with the bottom layer operation module, and taking the circulation program to be selected as a target circulation program if the circulation program to be selected meets the circulation mode matched with the bottom layer operation module.

The logic transformation principle may refer to a rule for performing logic transformation on loop body logic in the loop structure body. The loop body logic may refer to the operational logic included in the input loop program. The cycle program to be selected may refer to a cycle program obtained after the logic transformation is performed preliminarily.

Specifically, the logic transformation principle may be: if the cycle body logic within the cycle body is a = B C, then it is transformed to a =0+ B C; if the loop body logic within the loop body is a = a + C, then it is transformed to a = a +1*C; if the cycle body logic within the cycle body is se:Sub>A = B + C-se:Sub>A, then it is transformed to se:Sub>A = (-se:Sub>A) + B + C; if the loop body logic within the loop body is a = a + B/C, then it is converted to a = a + B (1/C); if the cycle body logic in the cycle body is A = B/C, then the cycle body logic is converted into A =0+B (1/C); if the loop body logic within the loop body is se:Sub>A = C-se:Sub>A, it is converted to se:Sub>A = (-se:Sub>A) +1*C.

Therefore, by adding a proper amount of extra operations or operands to transform the loop body logic contained in the input loop program, a high-performance target loop program can be generated in a wider range for program acceleration.

And S290, multiplying the loop step corresponding to the inner loop variable in the loop condition by the first data format corresponding to the bottom layer operation module to generate a first loop condition.

Wherein, the inner loop variable may refer to the innermost loop variable in the loop condition. The first data format may refer to a data format corresponding to an inner loop variable in the bottom layer operation module. The first loop condition may refer to a loop condition generated after processing the loop step of the inner-layer loop variable according to the first data format.

Specifically, if the target loop program is as follows:

then, the first data format may be N, and the corresponding first cycle condition may be:

。

in an optional embodiment, before the step of the loop corresponding to the inner-layer loop variable in the loop condition is multiplied by the first data format corresponding to the bottom-layer operation module to generate the first loop condition, the method further includes: and judging whether the loop step corresponding to the inner-layer loop variable in the loop condition is a standard step value or not, and if the loop step corresponding to the inner-layer loop variable is a nonstandard step value, standardizing the loop step corresponding to the inner-layer loop variable.

The standard step value may refer to a standard value of a single step, and may be 1 for example. The non-standard step value may refer to any number other than the standard step value. Specifically, before the loop step corresponding to the inner-layer loop variable in the loop condition is multiplied by the first data format corresponding to the bottom-layer operation module to generate the first loop condition, it is necessary to check whether the loop step corresponding to the inner-layer loop variable in the loop condition is 1, and if the loop step is not 1, it is necessary to perform normalization processing on the loop step to change the value of the loop step to 1. Therefore, by checking and standardizing the loop steps corresponding to the inner loop variables, an effective basis can be provided for subsequent operations.

And S2100, multiplying the loop step corresponding to the outer loop variable in the loop condition by a second data format corresponding to the bottom layer operation module to generate a second loop condition.

Wherein, the outer loop variable may refer to the loop variable of the next inner layer in the loop condition, i.e. the loop variable adjacent to the inner layer loop variable. The second data format may refer to a data format corresponding to an outer loop variable in the bottom layer operation module. And the second cycle condition is generated after the cycle stepping of the outer-layer cycle variable is processed according to a second data format.

Specifically, if the target loop program is as follows:

then, the second data format may be M, and the corresponding second cycle condition may be:

。

in an optional embodiment, before multiplying the loop step corresponding to the outer loop variable in the loop condition by the second data format corresponding to the bottom layer operation module to generate the second loop condition, the method further includes: and judging whether the circulation step corresponding to the outer circulation variable in the circulation condition is a standard step value or not, and if the circulation step corresponding to the outer circulation variable is a nonstandard step value, standardizing the circulation step corresponding to the outer circulation variable.

Specifically, before the loop step corresponding to the outer loop variable in the loop condition is multiplied by the second data format corresponding to the bottom-layer operation module to generate the second loop condition, it is necessary to check whether the loop step corresponding to the outer loop variable in the loop condition is 1, and if the loop step is not 1, it is necessary to perform normalization processing on the loop step, so as to change the value of the loop step to 1. Therefore, by checking and standardizing the loop steps corresponding to the outer-layer loop variables, an effective basis can be provided for subsequent operations.

And S2110, carrying out position transformation on the first circulation condition and the second circulation condition according to the existence condition of the inner circulation variable and the outer circulation variable to obtain a post-processing circulation program.

The existence condition of the inner-layer loop variable and the outer-layer loop variable may refer to a data access continuous condition of the inner-layer loop variable and the outer-layer loop variable. When the number of the arrays continuously accessed in the inner-layer loop variable and the outer-layer loop variable is less, if the number of the arrays is less than half of the total number of the arrays, the number of the arrays continuously accessed can be increased by adopting a position exchange mode. The post-processing cycle program may be a cycle program obtained by converting the positions of the first cycle condition and the second cycle condition.

Specifically, if the target loop program is as follows:

as can be seen from the above, if only the array B is continuously accessed and the data a and the array C are discontinuously accessed in the target loop program, in order to increase the number of arrays continuously accessed, the post-processing loop program obtained after performing position transformation under the first loop condition and the second loop condition may have the form:

it should be noted that, if there are many consecutively accessed arrays in the inner loop variable and the outer loop variable in the target loop program, for example, the number of the arrays is greater than half of the total number of the arrays, the loop program after generating the first loop condition and the second loop condition may be directly used as the post-processing loop program, and the loop exchange process does not need to be performed again.

And S2120, obtaining the register name corresponding to each array in the post-processing loop program according to the storage rule corresponding to the bottom layer operation module.

The registers corresponding to the arrays are independent of each other.

S2130, establishing corresponding executable files according to the post-processing circulation program and the registers corresponding to the arrays, and transmitting the executable files to the bottom layer operation module for execution.

Specifically, the arithmetic instruction function is taken as a vector matrix multiplication instruction function as an example. Then the following process can be done for each bar code statement in the inner loop body of the post-processing loop program, e.g. a = a + B × C: reading M continuous data elements of an A array from a memory to a vector register Rvo, reading N continuous data elements of a B array from the memory to a vector register Rvi, reading N groups of M continuous data elements of a C array from the memory to a matrix register Rm, calculating by using the Rvo, the Rvi and the Rm as operands and using a vector matrix multiplication instruction, and writing a calculation result stored in the vector register Rvo back to the memory. Thus, a code using a vector matrix multiplication instruction can be generated.

It should be noted that the register names corresponding to different hardware platforms are different. And if the post-processing loop program is obtained by a preset logic transformation principle, corresponding codes can be generated aiming at the newly added extra operation or operand so as to ensure the integrity of the codes.

According to the technical scheme of the embodiment of the invention, the target cyclic program is obtained by matching the input cyclic program with the cyclic mode matched with the bottom layer operation module; further, according to the data format corresponding to the bottom layer operation module, carrying out cyclic blocking and cyclic exchange processing on the cyclic conditions in the target cyclic program to obtain a post-processing cyclic program; finally, according to the storage rule corresponding to the bottom layer operation module, the corresponding executable file is established according to the post-processing circulating program and is transmitted to the bottom layer operation module to be executed, the problems that the period of the program instruction optimization process is long, and the code reuse and the transportability are limited are solved, and the optimization of the program instruction can be realized on the basis of improving the utilization rate and the program operation performance of the bottom layer operation module and reducing the program research and development period.

FIG. 2b is a flowchart of an alternative method for optimizing instructions of a loop program according to the second embodiment of the present invention; specifically, an operation instruction corresponding to a bottom layer operation module is obtained, taking an operation instruction function in the operation instruction as a vector matrix multiplication instruction function as an example, a for loop mode (i.e., a loop mode matched with the bottom layer operation module) equivalent to the vector matrix multiplication instruction function is generated according to the operation instruction corresponding to the bottom layer operation module, and then a for loop program (i.e., an input loop program) compiled by an input high-level programming language is matched with the for loop mode equivalent to the vector matrix multiplication instruction function, so that the vector matrix multiplication mode identification is completed; taking a cyclic program with successfully recognized vector matrix multiplication patterns as a target cyclic program; if the vector matrix multiplication pattern recognition fails, converting the circulation body logic contained in the input circulation program according to a preset logic conversion principle to obtain a circulation program to be selected, matching the circulation program to be selected with a circulation pattern matched with a bottom layer operation module, if the circulation program to be selected meets the circulation pattern matched with the bottom layer operation module, taking the circulation program to be selected as a target circulation program, and if the circulation program to be selected still does not meet the circulation pattern matched with the bottom layer operation module, outputting an unoptimized code; further, multiplying the loop step corresponding to the inner loop variable in the loop condition of the target loop program by the first data format corresponding to the bottom layer operation module to generate a first loop condition; multiplying the loop step corresponding to the outer loop variable in the loop condition of the target loop program by a second data format corresponding to the bottom layer operation module to generate a second loop condition; performing position transformation on the first circulation condition and the second circulation condition to obtain a post-processing circulation program, and realizing transformation of a for circulation structure; and finally, acquiring the register name corresponding to each array in the post-processing cycle program according to the storage rule corresponding to the bottom-layer operation module, generating a for-cycle structure core code according to the post-processing cycle program and the register corresponding to each array to establish a corresponding executable file, and transmitting the executable file to the bottom-layer operation module to execute the code optimized by using the vector matrix multiplication instruction.

EXAMPLE III

Fig. 3 is a schematic structural diagram of an instruction optimization apparatus of a loop program according to a third embodiment of the present invention. As shown in fig. 3, the apparatus includes: a target cyclic program acquisition module 310, a post-processing cyclic program generation module 320 and an executable file generation module 330;

the target cyclic program obtaining module 310 is configured to match an input cyclic program with a cyclic mode matched with the bottom layer operation module to obtain a target cyclic program;

a post-processing cycle program generating module 320, configured to perform cycle blocking and cycle switching processing on the cycle conditions in the target cycle program according to the data format corresponding to the bottom layer operation module, so as to obtain a post-processing cycle program;

the executable file generating module 330 is configured to establish a corresponding executable file according to the storage rule corresponding to the bottom layer operation module and according to the post-processing loop program, and transmit the executable file to the bottom layer operation module for execution.

According to the technical scheme of the embodiment of the invention, the input cyclic program is matched with the cyclic mode matched with the bottom layer operation module to obtain a target cyclic program; further, according to the data format corresponding to the bottom layer operation module, carrying out cyclic blocking and cyclic exchange processing on the cyclic conditions in the target cyclic program to obtain a post-processing cyclic program; finally, according to the storage rule corresponding to the bottom layer operation module, the corresponding executable file is established according to the post-processing cycle program and is transmitted to the bottom layer operation module for execution, so that the problems of long period and limited code reuse and transportability of the program instruction optimization process are solved, and the optimization of the program instruction can be realized on the basis of improving the utilization rate and the program operation performance of the bottom layer operation module and reducing the program research and development period.

Optionally, the instruction optimization apparatus of the loop program may further include: the circulation pattern generation module is used for acquiring an operation instruction corresponding to the bottom operation module before matching the input circulation program with the circulation pattern matched with the bottom operation module; the operation instruction comprises an operation instruction function, a data format and a loop nesting layer number corresponding to a bottom layer operation module; establishing a cyclic structure judgment condition according to the number of the cyclic nesting layers and a standard cyclic structure format; establishing an operation logic judgment condition according to the operation instruction function; establishing a cycle number judgment condition according to the data format; and combining the cyclic structure judgment condition, the operation logic judgment condition and the cycle number judgment condition to generate a cyclic mode matched with the bottom operation module.

Optionally, the target loop program obtaining module 310 may be specifically configured to: matching an input cyclic program with a cyclic mode matched with a bottom layer operation module, and judging whether the input cyclic program meets the cyclic mode matched with the bottom layer operation module; if yes, taking the input circulation program as a target circulation program; if not, converting the circulation body logic contained in the input circulation program according to a preset logic conversion principle to obtain a circulation program to be selected, matching the circulation program to be selected with the circulation mode matched with the bottom layer operation module, and if the circulation program to be selected meets the circulation mode matched with the bottom layer operation module, taking the circulation program to be selected as a target circulation program.

Optionally, the post-processing loop program generating module 320 may be specifically configured to: multiplying the loop step corresponding to the inner loop variable in the loop condition by the first data format corresponding to the bottom layer operation module to generate a first loop condition; multiplying the loop step corresponding to the outer loop variable in the loop condition by a second data format corresponding to the bottom layer operation module to generate a second loop condition; and carrying out position transformation on the first circulation condition and the second circulation condition according to the existence condition of the inner circulation variable and the outer circulation variable.

Optionally, the instruction optimization apparatus of the loop program may further include: the first circulating step processing module is used for judging whether the circulating step corresponding to the inner layer circulating variable in the circulating condition is a standard step value or not before multiplying the circulating step corresponding to the inner layer circulating variable in the circulating condition by a first data format corresponding to the bottom layer operation module to generate a first circulating condition, and if the circulating step corresponding to the inner layer circulating variable is a nonstandard step value, standardizing the circulating step corresponding to the inner layer circulating variable;

the instruction optimization device of the loop program may further include: and the circulating step second processing module is used for judging whether the circulating step corresponding to the outer circulating variable in the circulating condition is a standard step value or not before multiplying the circulating step corresponding to the outer circulating variable in the circulating condition by a second data format corresponding to the bottom layer operation module to generate a second circulating condition, and if the circulating step corresponding to the outer circulating variable is a nonstandard step value, standardizing the circulating step corresponding to the outer circulating variable.

Optionally, the executable file generating module 330 may be specifically configured to: acquiring register names corresponding to the arrays in the post-processing circulating program according to the storage rule corresponding to the bottom layer operation module; the registers corresponding to the arrays are independent of each other; and establishing corresponding executable files according to the post-processing circulation program and the registers corresponding to the arrays.

The instruction optimization device for the loop program provided by the embodiment of the invention can execute the instruction optimization method for the loop program provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

Example four

Fig. 4 is a schematic structural diagram of an instruction optimization system of a loop program according to a fourth embodiment of the present invention. As shown in fig. 4, the system includes: a compiler 410 and a bottom-level operation module 420;

the compiler 410 is configured to match an input loop program with a loop pattern matched with a bottom layer operation module to obtain a target loop program; according to the data format corresponding to the bottom layer operation module, carrying out cyclic blocking and cyclic exchange processing on the cyclic conditions in the target cyclic program to obtain a post-processing cyclic program; establishing a corresponding executable file according to a storage rule corresponding to the bottom layer operation module and a post-processing circulation program, and transmitting the executable file into the bottom layer operation module for execution;

the bottom layer operation module 420 is configured to receive and execute the executable file transmitted by the compiler.

EXAMPLE five

FIG. 5 illustrates a schematic diagram of an electronic device 510 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 5, the electronic device 510 includes a compiler 5110, an underlying operational module 5120, at least one processor 520, and a memory communicatively coupled to the at least one processor 520, such as a Read Only Memory (ROM) 530, a Random Access Memory (RAM) 540, and the like, wherein the memory stores computer programs executable by the at least one processor, and the processor 520 may perform various suitable actions and processes according to the computer programs stored in the Read Only Memory (ROM) 530 or the computer programs loaded from the storage unit 590 into the Random Access Memory (RAM) 540. In the RAM540, various programs and data required for the operation of the electronic device 510 can also be stored. The processor 520, the ROM 530, the compiler 5110, the underlying operational module 5120, and the RAM540 are connected to each other by a bus 550. An input/output (I/O) interface 560 is also connected to bus 550.

A number of components in the electronic device 510 are connected to the I/O interface 560, including: an input unit 570 such as a keyboard, a mouse, and the like; an output unit 580 such as various types of displays, speakers, and the like; a storage unit 590 such as a magnetic disk, an optical disk, or the like; and a communication unit 5100 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 5100 allows the electronic device 510 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

Processor 520 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of processor 520 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The processor 520 performs the various methods and processes described above, such as an instruction optimization method of a loop program.

The method comprises the following steps:

matching the input cyclic program with a cyclic mode matched with the bottom layer operation module to obtain a target cyclic program;

In some embodiments, the instruction optimization method of the loop program may be implemented as a computer program tangibly embodied in a computer-readable storage medium, such as the storage unit 590. In some embodiments, part or all of the computer program can be loaded and/or installed onto the electronic device 510 via the ROM 530 and/or the communication unit 5100. When the computer program is loaded into RAM540 and executed by processor 520, one or more steps of the instruction optimization method of the loop program described above may be performed. Alternatively, in other embodiments, the processor 520 may be configured by any other suitable means (e.g., by means of firmware) to execute an instruction optimization method of a loop program.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for implementing the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on a machine, as a stand-alone software package partly on a machine and partly on a remote machine or entirely on a remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.

The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for instruction optimization of a loop program, comprising:

wherein the cyclic pattern comprises: a loop structure judging condition matched with the number of loop nesting layers corresponding to the bottom layer operation module, an operation logic judging condition corresponding to the operation instruction function of the bottom layer operation module and a loop frequency judging condition corresponding to the data format of the bottom layer operation module;

establishing a corresponding executable file according to a storage rule corresponding to the bottom layer operation module and a post-processing circulation program, and transmitting the executable file into the bottom layer operation module for execution;

the matching of the input cyclic program and the cyclic mode matched with the bottom layer operation module to obtain the target cyclic program comprises the following steps:

matching an input cyclic program with a cyclic mode matched with a bottom layer operation module, and judging whether the input cyclic program meets the cyclic mode matched with the bottom layer operation module;

if yes, taking the input circulation program as a target circulation program;

if not, converting the cycle body logic contained in the input cycle program according to a preset logic conversion principle to obtain a cycle program to be selected, matching the cycle program to be selected with a cycle mode matched with a bottom layer operation module, and if the cycle program to be selected meets the cycle mode matched with the bottom layer operation module, taking the cycle program to be selected as a target cycle program;

wherein the logic transformation rules transform loop body logic included in an input loop program by adding an appropriate amount of additional operations or operands.

2. The method of claim 1, further comprising, prior to matching the input loop program to the loop pattern matched to the underlying operational module:

acquiring an operation instruction corresponding to a bottom layer operation module; the operation instruction comprises an operation instruction function, a data format and a loop nesting layer number corresponding to a bottom layer operation module;

establishing a cyclic structure judgment condition according to the number of the cyclic nesting layers and the standard cyclic structure format;

establishing an operation logic judgment condition according to the operation instruction function;

establishing a cycle number judgment condition according to the data format;

and combining the cyclic structure judgment condition, the operation logic judgment condition and the cycle number judgment condition to generate a cyclic mode matched with the bottom operation module.

3. The method of claim 1, wherein the performing loop blocking and loop swapping on the loop condition in the target loop program according to the data format corresponding to the bottom layer operation module comprises:

multiplying the loop step corresponding to the inner loop variable in the loop condition by the first data format corresponding to the bottom layer operation module to generate a first loop condition;

multiplying the loop step corresponding to the outer loop variable in the loop condition by a second data format corresponding to the bottom layer operation module to generate a second loop condition;

and carrying out position transformation on the first circulation condition and the second circulation condition according to the existence condition of the inner circulation variable and the outer circulation variable.

4. The method of claim 3, wherein before generating the first loop condition by multiplying the loop step corresponding to the inner loop variable in the loop condition by the first data format corresponding to the bottom-layer operation module, further comprising:

judging whether the circulation step corresponding to the inner circulation variable in the circulation condition is a standard step value or not, and if the circulation step corresponding to the inner circulation variable is a nonstandard step value, standardizing the circulation step corresponding to the inner circulation variable;

before the step of the loop corresponding to the outer loop variable in the loop condition is multiplied by the second data format corresponding to the bottom layer operation module to generate a second loop condition, the method further includes:

and judging whether the circulation step corresponding to the outer circulation variable in the circulation condition is a standard step value or not, and if the circulation step corresponding to the outer circulation variable is a nonstandard step value, standardizing the circulation step corresponding to the outer circulation variable.

5. The method of claim 1, wherein the creating a corresponding executable file according to the storage rule corresponding to the underlying operation module and the post-processing loop program comprises:

acquiring register names corresponding to each array in the post-processing loop program according to storage rules corresponding to the bottom operation module; the registers corresponding to the arrays are independent of each other;

and establishing corresponding executable files according to the post-processing circulation program and the registers corresponding to the arrays.

6. An instruction optimization apparatus for a loop program, comprising:

the executable file generation module is used for establishing a corresponding executable file according to the storage rule corresponding to the bottom layer operation module and the post-processing circulation program, and transmitting the executable file to the bottom layer operation module for execution;

the target loop program obtaining module is specifically configured to:

if yes, taking the input circulation program as a target circulation program;

wherein the logic transformation rules transform the loop body logic contained in the input loop program by adding an amount of additional operations or operands.

7. An instruction optimization system for a loop program, comprising:

if yes, taking the input circulation program as a target circulation program;

wherein, the logic transformation rule transforms the loop body logic contained in the input loop program by adding proper amount of additional operation or operand;

8. An electronic device, characterized in that the electronic device comprises:

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform a method of instruction optimization of a cyclic program as claimed in any one of claims 1 to 5.

9. A computer-readable storage medium, characterized in that it stores computer instructions for causing a processor, when executing it, to implement a method of instruction optimization of a cyclic program according to any one of claims 1-5.