CN116450138A - Code optimization generation method and system oriented to SIMD and VLIW architecture - Google Patents

Code optimization generation method and system oriented to SIMD and VLIW architecture Download PDF

Info

Publication number
CN116450138A
CN116450138A CN202310223369.8A CN202310223369A CN116450138A CN 116450138 A CN116450138 A CN 116450138A CN 202310223369 A CN202310223369 A CN 202310223369A CN 116450138 A CN116450138 A CN 116450138A
Authority
CN
China
Prior art keywords
simd
program
instruction
cyclic
cost
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310223369.8A
Other languages
Chinese (zh)
Inventor
陈照云
赵宵磊
时洋
文梅
扈啸
王耀华
张春元
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202310223369.8A priority Critical patent/CN116450138A/en
Publication of CN116450138A publication Critical patent/CN116450138A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/443Optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/445Exploiting fine grain parallelism, i.e. parallelism at instruction level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/445Exploiting fine grain parallelism, i.e. parallelism at instruction level
    • G06F8/4452Software pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/447Target code generation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a code optimization generation method and a system for SIMD and VLIW architecture, wherein the method comprises the following steps: s1, dividing an input original circulation program into blocks to obtain a vector DSL program; s2, vectorizing the vector DSL program to obtain an inline assembler; and S3, performing instruction-level code optimization on the inline assembler to obtain assembly codes. The invention can automatically generate the kernel codes which are efficiently operated on the SIMD+VLIW architecture for kernel programs of various scales, and can fully and efficiently utilize the characteristics of the SIMD+VLIW architecture for optimization, thereby reducing the burden of light developers, greatly improving the execution efficiency of the kernel codes and realizing the balance of programmability and performance.

Description

Code optimization generation method and system oriented to SIMD and VLIW architecture
Technical Field
The invention relates to the technical field of automatic generation of DSP program codes, in particular to a code optimization generation method and system for SIMD and VLIW architectures.
Background
DSP (digital signal processor) typically contains VLIW (very long instruction word) and SIMD (single instruction multiple data) architectures, bringing computational performance, real-time processing and power efficiency for various embedded applications. Simd+vliw architecture can further improve overall performance through the cooperation of scalar and vector units.
The ultimate performance of a DSP depends on highly optimized kernels and libraries that are generated in a manner that includes: 1. manually generated in assembly code by a developer; 2. generated by a compiler from the C code. The manual generation method has poor portability and high complexity. Meanwhile, the general compiler is guided by a specific architecture and a specific embedded instruction thereof, so that the balance between the programmability and the efficiency is difficult to achieve, and the potential of the SIMD+VLIW architecture cannot be released. Therefore, there is a need to provide a high performance code auto-optimization and generation method for simd+vliw architecture to free programmers from the burden of kernel optimization.
Disclosure of Invention
The invention aims to solve the technical problems: aiming at the problems in the prior art, the invention provides a code optimization generation method and a code optimization generation system for SIMD and VLIW architectures, which can automatically generate kernel codes which are efficiently operated on the SIMD and VLIW architectures for kernel programs of various scales, fully and efficiently utilize the characteristics of the SIMD and VLIW architectures to optimize, reduce the burden of light developers, greatly improve the execution efficiency of the kernel codes and realize the balance of programmability and performance.
In order to solve the technical problems, the invention adopts the following technical scheme:
a code optimization generation method facing SIMD and VLIW architecture includes:
s1, dividing an input original circulation program into blocks to obtain a vector DSL program;
s2, vectorizing the vector DSL program to obtain an inline assembler;
and S3, performing instruction-level code optimization on the inline assembler to obtain assembly codes.
Optionally, step S1 includes:
s1.1, initializing the block scale, the circulation dimension, the data reuse proportion of each circulation dimension, the cost function and the total cost of circulation of an input original circulation program;
s1.2, judging whether the current value of the total cost is not converged, if so, jumping to S1.3; otherwise, performing loop nested segmentation on the original loop program according to the current block size, converting the original loop program into a vector DSL program by using a conversion tool, and jumping to S2;
s1.3, updating and calculating a current data reuse proportion according to a cyclic dimension, and determining a current block size according to the current data reuse proportion and a memory constraint condition;
s1.4, dividing an original circulation program into a sub-nesting circulation with the size equal to the block size and n1/n2 external circulation used for calling the sub-nesting circulation, wherein n1 is the size of the original circulation program, and n2 is the block size;
s1.5, vectorizing the cyclic blocks obtained after segmentation;
s1.6, calculating corresponding single program block cost by adopting a cost function for the vectorized cyclic blocks respectively;
s1.7, multiplying the single-program block cost by the number of the circulating blocks to obtain the current total cost, and skipping to S1.2.
Optionally, in step S1.2, it is determined whether the current value of the total cost is not converged, and if the current value of the total cost is not converged, the non-converged condition is that the current value of the total cost is greater than or equal to the previous value of the total cost of the previous iteration.
Optionally, the data reuse ratio of each cycle dimension in step S1.1 is 1; in step S1.3, when calculating the current data reuse ratio according to the cyclic dimension update, the inverse ratio of the vector width of each cyclic dimension obtained when vectorizing the cyclic block obtained after segmentation according to the previous iteration is included as the data reuse ratio of the cyclic dimension.
Optionally, the memory constraint condition in step S1.3 includes a data amount that can be read simultaneously, and when determining the current block size according to the current data reuse ratio and the memory constraint condition, the method includes satisfying that a data amount corresponding to a cyclic block obtained according to the block size is less than or equal to a data amount that can be read simultaneously in the memory constraint condition.
Optionally, step S2 includes:
s2.1, searching instruction sequences of various expression modes with the same functions as the circulation blocks in the vector DSL program;
s2.2, calculating the cost values of execution of all instructions in the instruction sequences of various expression modes respectively through a preset cost evaluation function, selecting the instruction sequence of the expression mode with the minimum cost value as the instruction sequence with the optimal cyclic block, and obtaining the inline assembler by the instruction sequence with the optimal cyclic block.
Optionally, calculating the cost value of all instructions in the instruction sequences of the various expressions in step S2.2 refers to calculating the real execution period of each instruction in the instruction sequence on the DSP and the time cost of performing calculation by overlapping the space beats among the instructions.
Optionally, step S3 includes:
s3.1, performing VLIW instruction level scheduling on an instruction sequence of the inline assembler to obtain a rearranged instruction sequence;
s3.2, carrying out soft flow arrangement on the rearranged instruction sequence;
s3.3, generating assembly codes for the instruction sequences after soft pipeline arrangement according to the ISA information of the DSP chip.
In addition, the invention also provides a code optimization generating system facing the SIMD and VLIW architecture, which comprises a microprocessor and a memory which are connected with each other, wherein the microprocessor is programmed or configured to execute the code optimization generating method facing the SIMD and VLIW architecture.
Furthermore, the present invention provides a computer readable storage medium having stored therein a computer program for being programmed or configured by a microprocessor to perform the SIMD and VLIW architecture oriented code optimization generation method.
Compared with the prior art, the invention has the following advantages: the method comprises the following steps: s1, dividing an input original circulation program into blocks to obtain a vector DSL program; s2, vectorizing the vector DSL program to obtain an inline assembler; and S3, performing instruction-level code optimization on the inline assembler to obtain assembly codes. The invention can automatically generate the kernel codes which are efficiently operated on the SIMD+VLIW architecture for kernel programs of various scales, and can fully and efficiently utilize the characteristics of the SIMD+VLIW architecture for optimization, thereby reducing the burden of light developers, greatly improving the execution efficiency of the kernel codes and realizing the balance of programmability and performance.
Drawings
FIG. 1 is a schematic diagram of a basic flow of a method according to an embodiment of the present invention.
FIG. 2 is a schematic diagram of a modular implementation of the method of the embodiments of the present invention.
FIG. 3 is a schematic diagram of the basic flow of step S1 in the embodiment of the present invention.
Fig. 4 is a schematic diagram of the basic flow of step S2 in the embodiment of the present invention.
FIG. 5 is a diagram of a vectorized example of step S2 in an embodiment of the present invention.
Fig. 6 is a schematic diagram of the basic flow of step S3 in the embodiment of the present invention.
Detailed Description
The invention aims to reduce the burden of manually writing highly optimized kernel function codes (core calculation codes such as convolution, matrix multiplication, filtering and other operator codes in programs running on a DSP chip) for the rear end of a SIMD+VLIW architecture, and simultaneously, can fully utilize the characteristics of the architecture and automatically generate high-efficiency kernel codes for the rear end. The programmer only needs to write the naive implementation logic (C++ implementation) of the kernel function, the original kernel code is optimized through a series of optimization modules, and meanwhile, the assembly instruction which can directly run at the back end and is optimized through vectorization and instruction scheduling is generated by referring to the characteristics of the back end architecture. In addition, the invention provides agile development design, and can provide rapid transplanting support for the back ends of various suppliers.
As shown in fig. 1, the code optimization generation method for SIMD and VLIW architecture of this embodiment includes:
s1, partitioning an input original cyclic program to obtain a vector DSL (domain specific language) program;
s2, vectorizing the vector DSL program to obtain an inline assembler;
and S3, performing instruction-level code optimization on the inline assembler to obtain assembly codes.
In this embodiment, the input original loop program (see fig. 2, abbreviated as input program) is specifically written in c++ language, and may be written in other languages as needed.
The size of the input program has a significant impact on the search space and efficiency of subsequent vectorization. The data partitioning and layout determines the utilization of on-chip memory and computing resources, including scalar and vector processing units. Especially for nested loops, the optimal block size is critical for a DSP employing SIMD architecture to perform extremely well. In this embodiment, a feedback-equipped adaptive loop blocking method supporting a medium-large-scale input program is provided, and an optimal blocking size is found based on a DSP architecture.
Referring to fig. 2, in this embodiment, a feedback adaptive cyclic blocking module is used to execute step S1, where the feedback adaptive cyclic blocking module selects a blocking size according to an initial data reuse ratio and divides a nested loop based on a cyclic parameter size. The scalar vector co-optimization module (the module executing step S2) provides the cost value of the optimized vector DSL procedure, and the data reuse is correspondingly changed. Meanwhile, the total cost value is equal to the product of the number of the blocks and the cost, and the cost is calculated based on the optimized vector DSL program. The method adjusts the block size according to the changed data reuse proportion and returns to the block stage. The auto-tuning process will continue to iterate until the total cost value reaches convergence, i.e., the feedback value of the vectorized final state cannot change data reuse and the execution cost of the vectorized segmented code segments is minimal.
As shown in fig. 3, step S1 in the present embodiment includes:
s1.1, initializing the block scale, the circulation dimension, the data reuse proportion of each circulation dimension, the cost function and the total cost of circulation of an input original circulation program;
s1.2, judging whether the current value of the total cost is not converged, if so, jumping to S1.3; otherwise, performing loop nested segmentation on the original loop program according to the current block size, and converting the original loop program into a vector DSL program by using a conversion tool (for example, a socket is adopted in the embodiment, and a declarative intra-system function programming language conversion tool) to jump S2;
s1.3, updating and calculating a current data reuse proportion according to a cyclic dimension, and determining a current block size according to the current data reuse proportion and a memory constraint condition;
s1.4, dividing an original circulation program into a sub-nesting circulation with the size equal to the block size and n1/n2 external circulation used for calling the sub-nesting circulation, wherein n1 is the size of the original circulation program, and n2 is the block size; for example, in a 1024 loop, the block size is 64, then the loop is divided into an outer loop with parameters of 16=1024/64 and a child nested loop with size 64;
s1.5, vectorizing the cyclic blocks obtained after segmentation (see step S2);
s1.6, calculating corresponding single program block cost by adopting a cost function for the vectorized cyclic blocks respectively;
s1.7, multiplying the single-program block cost by the number of the circulating blocks to obtain the current total cost, and skipping to S1.2.
In this embodiment, in step S1.2, when it is determined whether the current value of the total cost is not converged, the condition of non-convergence is that the current value of the total cost is greater than or equal to the previous value of the total cost in the previous iteration.
In this embodiment, the data reuse ratio of each cycle dimension in step S1.1 is 1; in step S1.3, when calculating the current data reuse ratio according to the cyclic dimension update, the inverse ratio of the vector width of each cyclic dimension obtained when vectorizing the cyclic block obtained after segmentation according to the previous iteration is included as the data reuse ratio of the cyclic dimension. For example, when the cyclic dimensions of the original cyclic program are i, k, and j, the data reuse ratio of each cyclic dimension in step S1.1 is 1, and is initialized to i:j:k=1:1:1; when vectorization is performed on the cyclic dimension i and the vector width is 4, the overall ratio is updated to i:j:k=4:1:1, and the ratio of the cyclic blocking parameters should be set to be inversely proportional to the ratio so as to fully use the adjusted dimension data.
In this embodiment, the memory constraint condition in step S1.3 includes a data amount that can be read simultaneously, and when determining the current block size according to the current data reuse ratio and the memory constraint condition, the data amount corresponding to the cyclic block obtained according to the block size is less than or equal to the data amount that can be read simultaneously in the memory constraint condition.
Current mainstream compilers require several days of time and mass storage to generate high performance execution code for medium and large scale inputs. Time and storage overhead reduce its availability and limit the range of applications in practical scenarios. The manual optimization of the medium-large-scale input consumes a great deal of labor and time cost, and meanwhile, the fixed cyclic blocking technology cannot fully utilize the characteristics brought by different architecture parameters, so that the waste of hardware resources is caused, the generated kernel code cannot be executed efficiently, and the portability is not high enough. Therefore, the embodiment starts from simplifying the operation of a programmer and effectively utilizing architecture resources, designs a self-adaptive cyclic blocking optimization mode, guides updating the cyclic blocking mode through designing feedback values, can dynamically and self-adaptively update the cyclic blocking parameters when the scale of a kernel function and architecture parameters change, provides a cyclic blocking mode adapting to the kernel and hardware, greatly improves the searching efficiency and the utilization rate of the hardware, and also releases the programmer from complex kernel optimization work.
In order to take advantage of the VLIW-SIMD nature of the DSP, vectorization should be employed to maximize computing power and hardware utilization. The kernel compiler of a DSP typically employs vectorization techniques, including cyclic dependency analysis, but only focuses on vector units. Releasing the potential of scalar units to work in concert with vector units and achieving better performance is a primary goal of designing vectorization modules for code generation systems. In fact, some specific DSP vendor-supplied instructions, such as broadcast, reduce, scatter, gather, all-gather, all-reduce, etc., may increase the efficiency of data movement between scalar units and vector units. In order to sufficiently improve hardware utilization efficiency, it is important to support these specific instructions and rewrite rules in code generation. Equality saturation can be optimized and rewritten to vector DSL through user-defined equality rules to achieve minimum cost data movement. The module adds rewrite rules for specific instructions to utilize scalar vector cooperation to reduce data movement. For example, broadcast instructions require only one data load, reducing redundant memory access and enabling reduced overhead. In addition, the module provides an interface for the user to adjust the cost function of the basic operator according to the instruction period on the target DSP. The cost function supports accurate feedback on the equal saturation to find the best code rewriting and optimization. The value is also sent back to the adaptive loop blocking module with feedback to guide the adaptive loop blocking process. As shown in fig. 4, step S2 in this embodiment includes:
s2.1, searching instruction sequences of various expression modes with the same functions as the circulation blocks in the vector DSL program;
s2.2, calculating the cost values of execution of all instructions in the instruction sequences of various expression modes respectively through a preset cost evaluation function, selecting the instruction sequence of the expression mode with the minimum cost value as the instruction sequence with the optimal cyclic block, and obtaining the inline assembler by the instruction sequence with the optimal cyclic block.
The step S2.1 of searching the instruction sequences of the multiple expression modes with the same function as the loop block in the vector DSL program may be expressed as an equal saturation searching method, because the instruction sequences of the multiple expression modes with the same function as the loop block and the loop block are equal saturation relations. The example shown in fig. 5 illustrates the process of generating a vectored DSL program optimized by a targeted vector synergy module from an original cyclic blocking program. First, the first stage shows a section of scaleThe matrix multiplication of (2) is converted by a Racket program to generate a vector DSL program code section of a second stage, wherein Concat represents the connection operation of vector elements, vecmac is the multiplication and addition operation of vectors, and Vec isThe vectorization operation of forming 4 unit elements into one vector. The code segments of the second stage may be replaced by code segments of the third stage to achieve the same function via an equally satisfied search process, where Trans is the transpose operation of the matrix, vecadd is the vector add operation, and Vecmul is the vector multiply operation. Since Vec (a, a, a) performs the same function as Broadcast (a), the two instructions can be interchanged, and a fourth stage code fragment can be generated. While the Broadcast instruction can be implemented on the DSP chip using scalar units, then, via this process, the original program completes the scalar-coordinated optimization process.
In step S2.2 of this embodiment, calculating the cost value of executing all instructions in the instruction sequences of various expression modes refers to calculating the time cost of calculating the real execution period of each instruction in the instruction sequence on the DSP and the space beat superposition between the instructions.
The kernel compiler of a DSP typically employs vectorization techniques, including cyclic dependency analysis, focusing only on vector units. The rear end of the SIMD+VLIW architecture mostly adopts the organization mode of scalar+vector components, and provides some internal instructions for data movement, which can greatly simplify the data movement of the kernel program, accelerate the execution of the program and also improve the utilization efficiency of hardware. The mainstream vectorization technology only focuses on the application of vector units, and omits the possible performance improvement caused by scalar components in hardware. The embodiment starts from the potential of releasing the cooperative work of the scalar unit and the vector unit, provides an equivalent replacement search function of instruction description, can provide the replacement of instructions or instruction fragments with the same function for different data movement or other special instructions, and a user only needs to add needed instructions and the function description thereof. The user can also revise the execution cost function according to different costs of executing various instructions by hardware of different suppliers, so as to generate more applicable kernel codes.
Since VLIW architecture relies heavily on static instruction level optimization, this module provides a low-level vector DSL code component to generate and reschedule the intrinsic function dependent assembler instructions employed by the vendor. The built-in assembly instruction does not consider instruction delay slots, parallel packing and register allocation, and is suitable for fine-grained code optimization and generation. The components include fine-grained instruction scheduling, software pipelining, and adaptive code generation, the overall process being as shown in FIG. 6. Referring to fig. 6, step S3 includes:
s3.1, performing VLIW instruction level scheduling on an instruction sequence of the inline assembler to obtain a rearranged instruction sequence;
s3.2, carrying out soft flow arrangement on the rearranged instruction sequence;
s3.3, generating assembly codes for the instruction sequences after soft pipeline arrangement according to the ISA information of the DSP chip.
In step S3.1, the VLIW instruction level scheduling may be performed by using a desired scheduling method according to needs, for example, as an optional implementation manner, in this embodiment, a list scheduling policy based on dynamic programming improvement is used to obtain an approximately optimal solution of the VLIW instruction level scheduling, so as to explore a possibility of more instruction level parallelism for the VLIW architecture, and the obtained rearranged instruction sequence is an assembly instruction sequence after completing the VLIW instruction arrangement, where the method is an existing method, and specific reference may be made to: C. deng, Z.Chen, Y.Shi, X.Kong and M.Wen, "Exploring ILP for VLIW Architecture by Quantified Modeling and Dynamic Programming-Based Instruction Scheduling," 2022 27th Asia and South Pacific Design Automation Conference (ASP-DAC), 2022, pp. 256-261, doi: 10.1109/ASP-DAC52403.2022.9712500.
In step S3.2, the soft-running configuration input is a vector DSL-based class assembly loop code sequence (rearranged instruction sequence), and the output is an assembly instruction sequence after soft-running configuration. Automatic soft running water is a scheduling strategy for core-oriented core circulation for utilizing unoccupied functional units of the DSP. Considering that the optimal start interval (i.e., the length of the stable phase) is critical to static scheduling of loop blocks in the pipeline, the present embodiment employs an iterative search-based approach to find the optimal instruction layout and start interval: firstly, considering forward dependence and backward dependence, searching a stable stage by adopting an ascending enumeration method, and then backfilling a circulation block according to an optimal result and generating all pipeline stages, wherein the method is an existing method, and concretely can be seen as follows: J.A. Pienaar, S. Chakradhar and A. Raghunathan, "Automatic generation of software pipelines for heterogeneous parallel systems," SC' 12: proceedings of the International Conference on High Performance Computing, networking, storage and Analysis, salt Lake City, UT, USA, 2012, pp. 1-12, doi: 10.1109/SC.2012.22. All codes will only be optimized and validated by one of the soft pipeline arrangement module or the instruction dispatch module, where the soft pipeline module is only valid for loop codes, and the non-loop codes will be optimized by the instruction dispatch module.
Step S3.3 is used for generating a rapid self-adaptive code, the step input is an assembly instruction sequence optimized by the two modules, an instruction set (InstructionSetArchitecture, ISA) information description file is provided by a back-end architecture, the output is an assembly code added with back-end information, and the code is directly generated into a final binary file through a tool chain provided by a DSP manufacturer. Generally, code generation is combined with the respective instruction set architecture and internal backend instructions of the vendor. In order to break the break-over and realize the adaptive code generation of various back ends, the step provides a agile design for the VLIW back end by providing an external instruction set ISA description file, including instruction format, beat, functional units, registers and the like. Based on the external ISA description file, the module can acquire all necessary ISA information and convert corresponding instructions, and then compile the instructions to a back end on a processor of a VLIW architecture for execution. The VLIW architecture in the DSP requires fine-grained instruction level optimization and adaptive code generation of various DSPs, and the execution efficiency of codes without this process is greatly compromised, so that the characteristics of the VLIW architecture cannot be fully utilized. Meanwhile, general code generation needs to be dependent on a complete tool chain provided by a manufacturer, and development staff has large workload and lacks portability. The embodiment provides an instruction scheduling method based on dynamic programming, which can effectively generate optimized instruction arrangement for the back end and automatically search an optimized parallel arrangement scheme; at the same time, this embodiment provides additional instruction set description files in order to support other backend. The programmer can complete quick and quick transplanting and development of kernel codes oriented to different back ends by changing the instruction format in the description file.
In summary, for the input of the kernel program in medium and large scale, the automatic vectorization code generation method consumes too long search time and storage space due to too large search space, reduces the usability and limits the application range in the actual scene; the naive vectorization method does not consider the synergic optimization space of scalar and vector, can not fully mine the performance gain brought by the characteristic, can not fully exert the potential of the architecture, and the scalar-vector synergy is critical to kernel optimization; at the same time, fine-grained instruction-level optimization is essential for VLIW architecture, and rational instruction scheduling and optimization play an important role in exploiting the potential of VLIW architecture. Therefore, the invention provides a high-performance code automatic optimizing generation method and system for a SIMD+VLIW architecture, which are used for carrying out combined optimization on the code generation of the kernel through steps S1 to S3, and can generate high-efficiency kernel codes for the SIMD+VLIW architecture without manual intervention, thereby greatly reducing the burden of developers, simultaneously well utilizing the characteristics of the architecture, exerting the advantages thereof and establishing the execution time of the kernel program. The invention mainly comprises the core technical problems of self-adaptive cyclic code blocking, vector collaborative vectorization of search standard vectors and fine-grained instruction level optimization. The method of the embodiment provides a high-performance code automatic optimization generation technology for a SIMD+VLIW architecture, and can generate efficient kernel codes for the SIMD+VLIW architecture under the condition of no need of manual intervention. The invention aims to reduce the burden of manually writing highly optimized kernel function codes (core calculation codes such as convolution, matrix multiplication, filtering and other operator codes in programs running on a DSP chip) for the rear end of a SIMD+VLIW architecture, and simultaneously, can fully utilize the characteristics of the architecture and automatically generate high-efficiency kernel codes for the rear end. The programmer only needs to write the naive implementation logic (C++ implementation) of the kernel function, the original kernel code is optimized through a series of optimization modules, and meanwhile, the assembly instruction which can directly run at the back end and is optimized through vectorization and instruction scheduling is generated by referring to the characteristics of the back end architecture. In addition, the invention provides agile development design, and can provide rapid transplanting support for the back ends of various suppliers. The following technical effects can be achieved by adopting the method of the embodiment: first: the method is suitable for a platform adopting a SIMD+VLIW architecture, so that a programmer can generate high-efficiency kernel codes capable of being executed at the back end by the method only by writing naive implementation of kernel functions under the platform. Second,: the self-adaptive cyclic blocking with feedback provided by the method is suitable for medium-large-scale kernels, avoids manual intervention, can adaptively adjust the block specification, can effectively reduce the space and time cost of vectorization search, and is also suitable for other scenes needing cyclic optimization. Third,: the vector cooperation vectorization provided by the method of the embodiment can fully play the roles of the scalar component and the vector component in the SIMD+VLIW architecture, and improves the utilization efficiency of hardware and the execution efficiency of the generated kernel function. Fourth,: the fine-granularity instruction level code optimization provided by the method can generate kernel function codes which are closely arranged, fully utilize instruction level parallelism, and exert the characteristics and potential of a VLIW architecture. And meanwhile, the method also provides back-end agile development for different suppliers, and can break the brace and realize the self-adaptive code generation of various back-ends.
In addition, the embodiment also provides a code optimization generating system facing the SIMD and VLIW architecture, which comprises a microprocessor and a memory which are connected with each other, wherein the microprocessor is programmed or configured to execute the code optimization generating method facing the SIMD and VLIW architecture. Furthermore, the present embodiment also provides a computer-readable storage medium having stored therein a computer program for being programmed or configured by a microprocessor to perform a code optimization generation method for SIMD and VLIW architectures.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the present invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims (10)

1. The code optimization generation method for the SIMD and VLIW architectures is characterized by comprising the following steps of:
s1, dividing an input original circulation program into blocks to obtain a vector DSL program;
s2, vectorizing the vector DSL program to obtain an inline assembler;
and S3, performing instruction-level code optimization on the inline assembler to obtain assembly codes.
2. The code optimization generation method for SIMD and VLIW architecture according to claim 1, characterized in that step S1 comprises:
s1.1, initializing the block scale, the circulation dimension, the data reuse proportion of each circulation dimension, the cost function and the total cost of circulation of an input original circulation program;
s1.2, judging whether the current value of the total cost is not converged, if so, jumping to S1.3; otherwise, performing loop nested segmentation on the original loop program according to the current block size, converting the original loop program into a vector DSL program by using a conversion tool, and jumping to S2;
s1.3, updating and calculating a current data reuse proportion according to a cyclic dimension, and determining a current block size according to the current data reuse proportion and a memory constraint condition;
s1.4, dividing an original circulation program into a sub-nesting circulation with the size equal to the block size and n1/n2 external circulation used for calling the sub-nesting circulation, wherein n1 is the size of the original circulation program, and n2 is the block size;
s1.5, vectorizing the cyclic blocks obtained after segmentation;
s1.6, calculating corresponding single program block cost by adopting a cost function for the vectorized cyclic blocks respectively;
s1.7, multiplying the single-program block cost by the number of the circulating blocks to obtain the current total cost, and skipping to S1.2.
3. The code optimization generation method for SIMD and VLIW architecture according to claim 2, characterized in that when it is determined in step S1.2 whether the current value of the total cost is not converged, the non-converged condition is that the current value of the total cost is greater than or equal to the previous iteration value of the total cost.
4. The code optimization generation method for SIMD and VLIW architectures according to claim 3, wherein the data reuse ratio of each loop dimension in step S1.1 is 1; in step S1.3, when calculating the current data reuse ratio according to the cyclic dimension update, the inverse ratio of the vector width of each cyclic dimension obtained when vectorizing the cyclic block obtained after segmentation according to the previous iteration is included as the data reuse ratio of the cyclic dimension.
5. The code optimization generation method for SIMD and VLIW architecture according to claim 3, wherein the memory constraint condition in step S1.3 includes a data amount that can be read simultaneously, and when determining the current block size according to the current data reuse ratio and the memory constraint condition, it includes satisfying the data amount that can be read simultaneously in the memory constraint condition and that corresponds to the cyclic block obtained according to the block size.
6. The SIMD and VLIW architecture oriented code optimization generation method of claim 1, characterized in that step S2 comprises:
s2.1, searching instruction sequences of various expression modes with the same functions as the circulation blocks in the vector DSL program;
s2.2, calculating the cost values of execution of all instructions in the instruction sequences of various expression modes respectively through a preset cost evaluation function, selecting the instruction sequence of the expression mode with the minimum cost value as the instruction sequence with the optimal cyclic block, and obtaining the inline assembler by the instruction sequence with the optimal cyclic block.
7. The code optimization generation method for SIMD and VLIW architectures according to claim 6, wherein the calculation of the cost value of all instructions in the instruction sequences of the various expressions in step S2.2 means the calculation of the real execution period of each instruction in the instruction sequences on the DSP and the time cost of performing calculation by the space beat superposition between the instructions.
8. The SIMD and VLIW architecture oriented code optimization generation method of claim 1, characterized in that step S3 comprises:
s3.1, performing VLIW instruction level scheduling on an instruction sequence of the inline assembler to obtain a rearranged instruction sequence;
s3.2, carrying out soft flow arrangement on the rearranged instruction sequence;
s3.3, generating assembly codes for the instruction sequences after soft pipeline arrangement according to the ISA information of the DSP chip.
9. A SIMD and VLIW architecture oriented code optimization generation system comprising a microprocessor and a memory interconnected, characterized in that said microprocessor is programmed or configured to perform the SIMD and VLIW architecture oriented code optimization generation method according to any of the claims 1 to 8.
10. A computer readable storage medium having a computer program stored therein, characterized in that the computer program is for being programmed or configured by a microprocessor to perform the SIMD and VLIW architecture oriented code optimization generation method of any of claims 1 to 8.
CN202310223369.8A 2023-03-09 2023-03-09 Code optimization generation method and system oriented to SIMD and VLIW architecture Pending CN116450138A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310223369.8A CN116450138A (en) 2023-03-09 2023-03-09 Code optimization generation method and system oriented to SIMD and VLIW architecture

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310223369.8A CN116450138A (en) 2023-03-09 2023-03-09 Code optimization generation method and system oriented to SIMD and VLIW architecture

Publications (1)

Publication Number Publication Date
CN116450138A true CN116450138A (en) 2023-07-18

Family

ID=87124618

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310223369.8A Pending CN116450138A (en) 2023-03-09 2023-03-09 Code optimization generation method and system oriented to SIMD and VLIW architecture

Country Status (1)

Country Link
CN (1) CN116450138A (en)

Similar Documents

Publication Publication Date Title
KR101137403B1 (en) Fast vector masking algorithm for conditional data selection in simd architectures
KR20150059894A (en) Method of scheduling loops for processor having a plurality of funtional units
Gao A Code Mapping Scheme for Dataflow Software Pipelining
JP2015191346A (en) Compile program, compile method, and compile device
Chaudhuri et al. SAT-based compilation to a non-vonNeumann processor
US20120047350A1 (en) Controlling simd parallel processors
Katel et al. High performance GPU code generation for matrix-matrix multiplication using MLIR: some early results
CN112558977B (en) Polyhedron optimization method oriented to heterogeneous many-core rear end based cost model
She et al. OpenCL code generation for low energy wide SIMD architectures with explicit datapath
KR102161055B1 (en) Method and Apparatus for instruction scheduling using software pipelining
CN116450138A (en) Code optimization generation method and system oriented to SIMD and VLIW architecture
CN113721899B (en) GPDSP-oriented lightweight high-efficiency assembly code programming method and system
Allen Compiling high-level languages to DSPs: Automating the implementation path
JP2004021890A (en) Data processor
Wirsch et al. Improved Condition Handling in CGRAs with Complex Loop Support
Gebrewahid et al. Support for data parallelism in the CAL actor language
Pham-Quoc et al. Adaptable VLIW processor: The reconfigurable technology approach
Ben-Asher et al. Towards a source level compiler: Source level modulo scheduling
Yang et al. Exploiting loop-dependent stream reuse for stream processors
Kultala et al. Operation set customization in retargetable compilers
Richter-Gottfried et al. FPGA-aware Transformations of LLVM-IR
Hahn et al. Automated Loop Fusion for Image Processing
Aliaj et al. FarSlayer: Turnkey Acceleration of Legacy Software on Commodity FPGA Cards
Dietz et al. Much ado about almost nothing: Compilation for nanocontrollers
Yuan et al. Compiling Loops with Branches on Static-Scheduling CGRA

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination