CN109933327B - OpenCL compiler design method and system based on code fusion compiling framework - Google Patents

OpenCL compiler design method and system based on code fusion compiling framework Download PDF

Info

Publication number
CN109933327B
CN109933327B CN201910106880.3A CN201910106880A CN109933327B CN 109933327 B CN109933327 B CN 109933327B CN 201910106880 A CN201910106880 A CN 201910106880A CN 109933327 B CN109933327 B CN 109933327B
Authority
CN
China
Prior art keywords
kernel
code
abstract syntax
syntax tree
host
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910106880.3A
Other languages
Chinese (zh)
Other versions
CN109933327A (en
Inventor
刘颖
黄磊
伍明川
崔慧敏
冯晓兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Jiahe Beijing Technology Co ltd
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN201910106880.3A priority Critical patent/CN109933327B/en
Publication of CN109933327A publication Critical patent/CN109933327A/en
Application granted granted Critical
Publication of CN109933327B publication Critical patent/CN109933327B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Devices For Executing Special Programs (AREA)

Abstract

The invention relates to an OpenCL compiler design method and a system based on a code fusion compiling framework, comprising the following steps: providing a host-kernel code fusion compiling framework based on a shared memory, and realizing the fusion of different end codes on an AST layer which is the intermediate representation of a compiler; the WII-CFG graph is used for depicting the instruction execution behaviors among threads after a Kernel code is instantiated into a plurality of threads, namely analyzing the program execution behaviors sensitive to platform characteristics in a working group; the data flow analysis of the host-kernel code combination is used for discovering the data flow relation crossing the host end or the kernel end and the data flow relation among threads so as to analyze the data correlation between the two end codes; based on the foregoing analysis, targeted code optimization is implemented and assembly code is generated to end the compilation process. The method can be oriented to different accelerating devices, simultaneously develop analysis aiming at the host code and the Kernel code, and fully explore the optimization opportunities among threads, so that the OpenCL program obtains good performance portability.

Description

OpenCL compiler design method and system based on code fusion compiling framework
Technical Field
The invention relates to the technical field of development and optimization of compilers, in particular to a compilation framework design and compiler implementation method and system for OpenCL language and heterogeneous platforms.
Background
In recent years, heterogeneous architectures have become mainstream architectures, and no matter the TOP three of the global TOP500 super computer list is the heterogeneous platforms, the number of the heterogeneous platforms exceeds 100, or the architecture of processor core + acceleration device appears widely on servers, personal computers and terminal devices. Heterogeneous computing systems are usually composed of a CPU and one or more accelerators interconnected on-chip or on a motherboard, where the CPU is responsible for complex control, scheduling, etc. tasks, and the accelerators are responsible for massively parallel computing or computing tasks in the professional domain. In terms of heterogeneous parallel programming models, OpenCL published by CUDA and Khronos Group, introduced by NVIDIA corporation, is two kinds of heterogeneous parallel programming models mainstream today, wherein the latter is a cross-platform parallel programming model applicable to various acceleration devices, and has a wider application range than the former.
Object code heterogeneity is an important challenge for compiling tool design under a heterogeneous parallel computing framework. The heterogeneous program code is divided into a Host (Host) end and a Device (Device) end, and correspondingly runs on a CPU end and an accelerating Device end, wherein the former is responsible for data initialization, data exchange and accelerating Device control, and the latter is responsible for parallel execution of core computation, namely, compiling target platforms of the two parts of code are inconsistent, and optimization targets of the two parts of code are different. The existing heterogeneous parallel program adopts a separate compiling mode, and independently compiles and optimizes codes running on different devices. In the separate compiling mode, compiling tools of different devices are independent of each other, and fully optimized codes can be generated for different devices. Most of the successfully commercialized compiling systems, such as NVIDIA CUDA compiling system nvcc (NVIDIA Compiler collection), AMD OpenCL compiling/running framework, are designed based on this separately compiled method.
However, the split compilation mode ignores associations between heterogeneous codes. For example, in an OpenCL program, the host code and kernel code are completely independent and do not need to share compile-time information. In practice, however, the host code interacts with the acceleration device by calling the OpenCL API to control the execution of the kernel code. When the kernel code is compiled, the compiler cannot acquire the relevant information of the host code, such as the information of the input parameters, the layout information of the array, the information of the workgroup (workgroup), and the like, so that the optimization opportunity of the kernel code is limited, and the quality of the generated code is not improved. For code compilation and optimization of heterogeneous parallel computing frameworks, "split" and "fusion" are issues that are always considered by compilers. On one hand, when the final code runs on a heterogeneous platform, the codes at different ends need to be separately compiled and added with additional complex mechanisms (including a link mechanism, a running mechanism and the like); on the other hand, there is correlation between codes at different ends, and it is necessary to know the information to perform deep optimization on the codes. From the viewpoint of deep optimization of code, fusion compilation is necessary.
In order to achieve the aims of deep optimization of heterogeneous codes and improvement of OpenCL performance portability, the method provides a host-kernel code fusion compiled OpenCL compiler design method, provides optimized program codes in a source-source conversion mode, aims to fuse and compile the host codes and the kernel codes to achieve full program-level analysis and optimization, explores optimization opportunities in threads and among threads, and provides cross-platform good performance portability for programs aiming at the characteristic that OpenCL program performance portability is poor. Different from the previous work, the method provides a host-kernel code fusion compiling framework and a construction method thereof, and provides two compiling infrastructures based on the compiling framework, namely a WII-CFG (Web information-computer graphics and graph) graph of a modeling work item execution sequence and related platform features and data flow analysis of host-kernel code combination, so as to guide the targeted optimization of the kernel code. The compiler design related to the method comprises four main parts: (1) providing a host-kernel code fusion compiling framework based on a shared memory, and realizing the fusion of different end codes on an AST layer which is the intermediate representation of a compiler; (2) the WII-CFG (word-Item Interleaving CFG) is used for depicting the instruction execution behavior among threads after a Kernel code is instantiated into numerous threads, namely the program execution behavior sensitive to the platform characteristics in an analysis working group; (3) the data flow analysis of the host-Kernel code combination is used for discovering the data flow relation crossing the host end or the Kernel end and the data flow relation between threads so as to analyze the data correlation between the host end code and the Kernel code; (4) based on the foregoing analysis, targeted code optimization is implemented and assembly code is generated to end the compilation process.
Disclosure of Invention
For OpenCL programs, poor performance portability is an important issue that is of great concern, and therefore a compiler design method based on a host-kernel code fusion compilation framework is proposed, which includes two compilation infrastructures, namely a WII-CFG graph and a host-kernel code combined data flow analysis, and aims to provide a deep optimization foundation and good performance portability for OpenCL programs. In order to analyze the optimization opportunities among threads (or word-Item), the method carries out analysis and optimization aiming at the threads in a workgroup (word-Group).
Specifically, the invention discloses an OpenCL compiler design method based on a code fusion compiling framework, which comprises the following steps:
step 1, obtaining an OpenCL source program, compiling a host end code in the source program into a host abstract syntax tree, obtaining a kernel code file of a kernel start function in the abstract syntax tree, compiling the kernel code file to obtain a kernel abstract syntax tree, storing the kernel abstract syntax tree into a shared memory, retrieving and reconstructing all the kernel abstract syntax trees from the shared memory to obtain a fusion abstract syntax tree fusing the host abstract syntax tree and the kernel abstract syntax tree;
step 2, obtaining respective control flow diagrams of the host abstract syntax tree and the kernel abstract syntax tree based on the fusion abstract syntax tree, adding control flow diagrams of which the instruction at the function calling side and the instruction at the function returning side are connected to obtain an inline control flow diagram, obtaining an execution sequence of the instructions in the work items of the kernel on a corresponding target platform according to the WII function of the target platform characteristic, and depicting the execution sequence in the inline control flow diagram to obtain a WII-CFG diagram;
step 3, obtaining a corresponding relation between a host end variable and a kernel variable as a first analysis result by analyzing function transmission parameters of the kernel code and parameters called by a data transmission OpenCL API function between a host end and a device end, and performing data flow analysis on the WII-CFG image to obtain a second analysis result;
step 4, optimizing the kernel code in the fusion abstract syntax tree according to the first analysis result and the second analysis result to obtain an optimized abstract syntax tree;
and 5, translating the optimized abstract syntax tree by a compiler and outputting the optimized host code and kernel code as a compiling result.
The OpenCL compiler design method based on the code fusion compilation framework includes the following steps: and obtaining the WII function of the target platform according to the thread execution mode of the target platform of the kernel code, wherein the WII function is used for calculating the execution sequence of the instructions of the workitems in the kernel on the target platform.
The OpenCL compiler design method based on the code fusion compilation framework includes:
analyzing the corresponding relation between the incoming actual parameters and the form parameters of the kernel function and the parameters called by the data transmission OpenCL API function between the host end and the equipment end to obtain the corresponding relation between the host end variables and the kernel variables as a first analysis result, and performing data flow analysis on the WII-CFG graph to obtain a second analysis result which comprises a definition-use chain and an active period between different variables of the host end code and the kernel code.
The OpenCL compiler design method based on the code fusion compilation framework includes, in step 4, the optimization specifically including:
a thread merging step, namely identifying redundant operation among threads according to a definition-use chain in the second analysis result, and merging a plurality of threads executing the redundant operation into a coarse-grained thread so as to reduce code redundancy among the threads;
a data layout step, selecting a layout from the intra-thread continuity or the inter-thread continuity according to the definition-use chain in the first analysis result and the second analysis result and the thread organization execution mode of the target platform, and implementing code conversion;
and vectorizing inter-thread and intra-thread codes according to the active period and the definition-use chain in the second analysis result.
The OpenCL compiler design method based on the code fusion compilation framework further includes: and 6, calling a local compiler to compile and then run the compiling result according to the OpenCL compiling process.
The invention also discloses an OpenCL compiler design system based on the code fusion compilation framework, which comprises the following steps:
the method comprises the steps that a module 1 obtains an OpenCL source program, a host end code in the source program is compiled into a host abstract syntax tree, a kernel code file of a kernel starting function in the abstract syntax tree is obtained, the kernel code file is compiled to obtain a kernel abstract syntax tree, the kernel abstract syntax tree is stored in a shared memory, all the kernel abstract syntax trees are retrieved from the shared memory and reconstructed, and a fusion abstract syntax tree fusing the host abstract syntax tree and the kernel abstract syntax tree is obtained;
the module 2 obtains respective control flow diagrams of the host abstract syntax tree and the kernel abstract syntax tree based on the fusion abstract syntax tree, increases the control flow diagrams of a function calling side and a function returning side instruction connection to obtain an inline control flow diagram, obtains an execution sequence of instructions in a work item of a kernel on a corresponding target platform according to a WII function of a target platform characteristic, and describes the execution sequence in the inline control flow diagram to obtain a WII-CFG (drawing-in-control-flow graph);
the module 3 obtains a corresponding relation between a host end variable and a kernel variable as a first analysis result by analyzing function transmission parameters of the kernel code and parameters called by a data transmission OpenCL API function between a host end and a device end, and performs data flow analysis on the WII-CFG image to obtain a second analysis result;
the module 4 optimizes the kernel code in the fusion abstract syntax tree according to the first analysis result and the second analysis result to obtain an optimized abstract syntax tree;
and the module 5 is used for translating the optimized abstract syntax tree by a compiler and outputting an optimized host code and an optimized kernel code as a compiling result.
The OpenCL compiler design system based on the code fusion compilation framework, wherein the module 2 includes: and obtaining the WII function of the target platform according to the thread execution mode of the target platform of the kernel code, wherein the WII function is used for calculating the execution sequence of the instructions of the workitems in the kernel on the target platform.
The OpenCL compiler design system based on the code fusion compilation framework includes:
analyzing the corresponding relation between the incoming actual parameters and the form parameters of the kernel function and the parameters called by the data transmission OpenCL API function between the host end and the equipment end to obtain the corresponding relation between the host end variables and the kernel variables as a first analysis result, and performing data flow analysis on the WII-CFG graph to obtain a second analysis result which comprises a definition-use chain and an active period between different variables of the host end code and the kernel code.
The OpenCL compiler design system based on the code fusion compilation framework, wherein the module 4 includes the optimization specifically including:
the thread merging module is used for identifying the redundant operation among the threads according to the definition-use chain in the second analysis result and merging a plurality of threads executing the redundant operation into a coarse-grained thread so as to reduce the code redundancy among the threads;
a data layout module, which selects a layout from the thread continuity or the thread continuity and implements code conversion according to the first analysis result, the definition-use chain and the thread organization execution mode of the target platform;
and the vectorization module is used for vectorizing inter-thread and intra-thread codes according to the active period and the definition-use chain in the second analysis result.
The OpenCL compiler design system based on the code fusion compilation framework further includes: and the module 6 calls a local compiler to compile and then runs the compiling result according to the OpenCL compiling process.
The technical effects of the invention comprise:
the OpenCL compiler design method provided by the invention covers an improved compilation framework, an extended analysis technology and a targeted optimization means, can be used for carrying out analysis on a host code and a Kernel code simultaneously aiming at different accelerating devices, and fully explores the optimization opportunities among threads, so that the OpenCL program obtains good performance portability.
Drawings
FIG. 1 is a WII function chart for each platform;
FIG. 2 is a WII-CFG diagram;
FIG. 3 is a diagram of the variable correspondence between the host side and the Kernel code;
FIG. 4 is a flow chart of a compilation process.
Detailed Description
In order to solve the above technical problem, an embodiment of the present invention includes:
A. host-kernel code fusion: first, the host-side code generates an intermediate representation, abstract syntax tree ast (hoststat), via a compiler. Then, traversing the AST, when encountering a kernel boot function (such as a clCreateProgramWithSource function), acquiring a kernel code file name, calling a compiler to compile a kernel code file by a promoter process to obtain KernelAST, storing the KernelAST into a shared memory, and ending the subprocess. And thirdly, retrieving and reconstructing ASTs of all kernel codes from the shared memory buffer, so that the integration of HostAST and KernelAST is realized.
B. Control flow analysis based on WII-CFG graphs: the method aims to construct a WII-CFG graph of the fused code facing a specific target platform (of Kernel code), and provides a basis for subsequent data flow analysis and code optimization. First, an inline Control Flow Graph (CFG) is constructed based on the integrated AST, which indicates that CFGs of the hostlast (host abstract syntax tree) and the KernelAST (kernel abstract syntax tree) are constructed in the same manner as the conventional CFG, and a CFG in which a calledge and a return edge are connected is added. And then, obtaining a WII (Work-Item Interleaving) function of the platform according to a thread execution mode on a target platform of the Kernel code, namely that the threads in a WorkGroup (WorkGroup) are executed one by one in a serialization mode or a plurality of threads are executed in a parallel mode, wherein the WII (Work-Item Interleaving) function is used for calculating the execution sequence of a certain instruction in a certain workitem (Work-Item) of the Kernel on the corresponding target platform. Thirdly, the CFG is refined based on the WII function, and the execution sequence of the Kernel instruction is described on the CFG, so that a WII-CFG graph is obtained.
C. Federated dataflow analysis: first, the data correlation (or data correspondence) between the host code and the Kernel code is analyzed. By analyzing the correspondence between the incoming real arguments of the Kernel function and the argument, and analyzing the parameters of the data transmission related to these incoming real arguments (i.e. related OpenCL API function calls, such as clenquewrite buffer, clenquereadbuffer, etc.), the correspondence between the host-side variable and the Kernel variable can be obtained as a first analysis result (considered as an alias in the present invention). Secondly, a traditional data flow analysis method is adopted on the WII-CFG diagram, and data flow analysis of host end-device end code combination is developed, wherein the data flow analysis comprises alias relations between different variables of the host end code and the Kernel code and between variables of different threads, definition-use chains, active period analysis and the like.
D. Code optimization: and optimizing the code by using the analysis result, and improving the performance of the Kernel code. First, thread merging optimization aims to merge several threads into one coarse-grained thread, reducing code redundancy among threads. The definition-use chain of the variable among different threads obtained by the data flow analysis can identify the redundant operation among the threads, and is just the optimization object of the thread combination. Second, data layout optimization aims at optimizing one layout from two data layouts, either intra-thread or inter-thread, and implementing code transformation, according to the thread organization execution manner of the target platform. The alias relationship and the definition-use chain between the host end code and the Kernel code obtained by the data flow analysis can be used for guiding code transformation of legal data layout. Again, aggressive vectorization optimization aims at vectorizing inter-thread, intra-thread code. Its transcoding involves the definition of relevant variables, the statement changes used, and also depends on the exact definition from the dataflow analysis-the usage chain and the active period analysis results.
E. Code generation and post-compilation process: and separating the host code and the kernel code from the optimized fusion AST, and outputting the optimized host code and the optimized kernel code (namely the optimized OpenCL program source code) after the optimized fusion AST is translated by the compiler. These subsequent codes may be compiled by calling a native compiler, generating binary, and then run in a conventional OpenCL compilation process.
In order to make the aforementioned features and effects of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.
The whole flow chart of the invention as shown in fig. 4 comprises:
step 1, generating a fused AST. Namely, an OpenCL source program is input, and the AST fused with the host-kernel code is obtained after fusion and compilation. First, the host-side code generates an intermediate representation, abstract syntax tree ast (hoststat), via the compiler. Then, traversing the AST, when encountering a kernel starting function (such as a clCreateProgramWithSource function), acquiring the file name of a kernel code, and opening up a shared memory space for the process to communicate with a sub-process thereof; and then, the subprocess calls a compiler to compile the kernel code file to obtain KernelAST, and stores the KernelAST in the shared memory and ends the subprocess. And thirdly, retrieving all ASTs of the Kernel codes from the shared memory space, so far, the process can simultaneously access the Host AST and the Kernel AST, and the fusion of the two ASTs is realized. The shared memory is one of the ways of inter-process communication, the shared memory allows two or more processes to share a given memory area, the memory area can be mapped to the address space of the process by two or more processes, information written into the shared memory by one process can be read out by other processes using the shared memory through a simple memory reading operation, and thus, the inter-process communication is realized.
And 2, analyzing the control flow based on the WII-CFG graph. The method aims to construct a WII-CFG graph of the fused code facing a specific target platform (of Kernel code), and provides a basis for subsequent data flow analysis and code optimization. The method comprises the following steps:
2.1) construction of an inline CFG map based on the aforementioned fused AST. An inline control flow graph (linked-CFG graph) is constructed based on the integrated AST, which shows respective CFGs of the Host AST and the Kernel AST (the construction method is constructed with the traditional CFG), meanwhile, the Kernel code is started (or called) by a Host end code, and a called-called relation exists, so that a calledge and a return edge are added to connect the Host-CFG and the Kernel-CFG.
2.2) obtaining a WII (word-Item Interleaving) function according to the characteristics of the target platform, and calculating the execution sequence of a certain instruction in a certain word-Item of the kernel on the corresponding target platform. The execution sequence of the word-Item in the same word-Group is platform dependent, and the most common execution sequence comprises two types, namely serialization execution and data parallelization execution, wherein the execution sequence of the word-Item is sequentially executed by each word-Item (after the execution of the word-Item 0 is finished, the execution of the word-Item 1. is started), and the word-Item is represented by a CPU of AMD company, a TileGX series many-core chip of Tilera company and a domestic Shenwei core chip (SW 26010); the latter is that several instructions in adjacent word-Item are executed in parallel (after the inst 0 of each word-Item 0.. word-Item is executed in parallel, the inst 1 of each word-Item 0.. word-Item is executed, and then the inst 2..) is executed, such as an NVIDIA-GPU chip in SIMT mode, an Intel CPU in which adjacent thread instructions are automatically converted into vector instructions to be executed during running, and a XeonPhi chip. Specifically, as shown in fig. 1, tid represents a thread number, and the thread tid has at most three dimensions tid (0), tid (1), and tid (2) from OpenCL Specification, where tid is a global thread id calculated from tid (0), tid (1), and tid (2).
2.3) refining the CFG based on the WII function, namely performing simple instantiation extension on the Kernel CFG, instantiating a Kernel static instruction into an instruction in a thread related to a thread tid, and marking the execution sequence of the thread instruction according to the WII function to obtain a WII-CFG graph. FIG. 2 is a WII-CFG diagram, as shown in FIG. 2, wherein (a) is an inline CFG diagram; (b) is a WII-CFG graph on a serialized Kernel target platform; (c) for the WII-CFG graph (parallelism is 2) on the data parallelization Kernel target platform, the instruction execution sequence obtained by the inline CFG graph according to the WII function is refined, and the WII-CFG graph facing the serialization execution platform and the data parallelization execution platform is obtained (as shown in figure 2).
And 3, analyzing the combined data stream. Firstly, analyzing and obtaining a corresponding relation between a host terminal and Kernel variables, and then performing traditional data flow analysis on a WII-CFG (world Wide Web-like graph), wherein the traditional data flow analysis specifically comprises the following steps:
3.1) obtaining the corresponding relation (also called alias relation here) between the host side variable (comprising array variable or array pointer) and the variable in Kernel. By analyzing the corresponding relation between the incoming actual parameters and the form parameters of the Kernel function and the parameters called by the OpenCL API function related to data transmission, including data transmission related functions such as clEnqueWriteBuffer (), clEnqueReadBuffer (), clEnqueMapBuffer (), clSetKernelArg () functions and the like, the actual parameters mainly face the incoming actual parameters in the Kernel code, and therefore the corresponding host end variables are obtained.
For example. Analyzing the source code (as shown in fig. 3 (a)), then:
(1) analyzing the introduced actual parameters and the form parameters of the Kernel function, the following corresponding relations can be obtained:
d_f<->ker(0th)(=f);d_p<->ker(1th)(=p);
d_n<->ker(2th)(=n);nN<->ker(3th)(=N);
nA<->ker(4th)(=A);
(2) analyzing the parameters of the data transmission related API function, the following corresponding relationship can be obtained:
d_n<->h_n;d_p<->h_p;h_f<->d_f;
thereby obtaining the variable corresponding relation between the host terminal and the Kernel code. As shown in fig. 3 (b).
Where a correspondence, such as d _ f < - > ker (0th) (═ f), means that the incoming argument d _ f is equivalent to the argument ker (0th) (═ f), i.e. the symbol < - > means "equivalent".
And 3.2) carrying out host end-equipment end combined data flow analysis on the WII-CFG by adopting a traditional data flow analysis method, wherein the host end-equipment end combined data flow analysis comprises alias relations among variables in codes of a host end and equipment ends and among variables among different threads, definition-use chains, active period analysis and the like, and the optimization of data layout and the like is conveniently carried out subsequently.
Still taking fig. 3 as an example, based on the analysis result of 3.1), the corresponding relationship between n (in the Kernel code) and h _ n, d _ n is known, and then the definition point is actually the assignment of h _ n in the host end code through data flow analysis. Such data flow analysis results facilitate subsequent optimization analysis and code transformation.
And 4, optimizing the code. And optimizing the code by using the analysis result, and improving the execution performance of the Kernel code. Three kinds of targeted optimization for improving the performance portability are added:
4.1) thread merging optimization. The redundant operation among the threads can be identified by variable definition-use chains among different threads obtained through the data flow analysis. For the code local redundancy existing among the threads, the code performance can be improved by selectively merging the adjacent cf threads in a certain dimension (assuming that the merging in the j dimension is selected, and the Work-Group comprises (0) local (1) local (2)) Work-items, namely cf < (j) local and cf > -1) threads, and removing redundant calculation or access or synchronous operation without influencing the parallelism correlation performance. And correspondingly modifying the host end code and the Kernel code.
4.2) optimizing the data layout. According to the aforementioned 2.2), the features of the acceleration device are roughly divided into two types of serialization execution and data parallelization execution, so that one type of layout is selected from two types of data layouts, namely intra-thread continuity (suitable for serialization execution) or inter-thread continuity (suitable for data parallelization execution) according to the device features, and the definition and the use of relevant arrays or variables in the host-side code and the Kernel code are modified correspondingly (information obtained by data flow analysis, including alias relationships between variables of the host-side code and the Kernel code and definition-use chains, are utilized).
Still taking the code of fig. 3 as an example, when the acceleration device faces the data parallelization execution, the Kernel code should adopt the data layout that is continuous between threads, and the use of n (in the Kernel code) (idx ═ n [ tid + j × a ] statement) in the source code is continuous between threads, so the original data layout does not need to be changed. When the acceleration device for serialization execution is oriented, the Kernel code should adopt continuous data layout in the thread, and N (in the Kernel code) in the source code should perform data layout optimization, so that the use statement of N (idx ═ N [ tid × N + j ]) is changed, and the actual definition statement of N is also correspondingly changed for program correctness (h _ N [ i + j nA ] (neighbor borIter [ i ] [ j ] (h _ N [ i × N + j ] (neighbor borIter [ i ] [ j ])).
4.3) aggressive vectorization optimization. In actual execution, the Kernel code is instantiated into a plurality of threads to be executed concurrently, and from the vectorization optimization perspective, quantization opportunities exist among threads and in the threads. According to the SIMD instruction width of specific hardware, the Kernel code is automatically vectorized between threads firstly and then in the threads. Its transcoding involves the definition of relevant variables, the statement changes used, and also depends on the exact definition from the dataflow analysis-the usage chain and the active period analysis results.
And 5, generating a code and performing post-compiling. And separating the host code and the kernel code from the optimized fusion AST, and outputting the optimized host code and the optimized kernel code (namely the optimized OpenCL program source code) after the optimized fusion AST is translated by the compiler. These subsequent codes can be compiled and then run by calling a native compiler according to a conventional OpenCL compiling process.
The following are system examples corresponding to the above method examples, and this embodiment can be implemented in cooperation with the above embodiments. The related technical details mentioned in the above embodiments are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the above-described embodiments.
The invention also discloses an OpenCL compiler design system based on the code fusion compilation framework, which comprises the following steps:
the method comprises the steps that a module 1 obtains an OpenCL source program, a host end code in the source program is compiled into a host abstract syntax tree, a kernel code file of a kernel starting function in the abstract syntax tree is obtained, the kernel code file is compiled to obtain a kernel abstract syntax tree, the kernel abstract syntax tree is stored in a shared memory, all the kernel abstract syntax trees are retrieved from the shared memory and reconstructed, and a fusion abstract syntax tree fusing the host abstract syntax tree and the kernel abstract syntax tree is obtained;
the module 2 obtains respective control flow diagrams of the host abstract syntax tree and the kernel abstract syntax tree based on the fusion abstract syntax tree, increases the control flow diagrams of a function calling side and a function returning side instruction connection to obtain an inline control flow diagram, obtains an execution sequence of instructions in a work item of a kernel on a corresponding target platform according to a WII function of a target platform characteristic, and describes the execution sequence in the inline control flow diagram to obtain a WII-CFG (drawing-in-control-flow graph);
the module 3 obtains a corresponding relation between a host end variable and a kernel variable as a first analysis result by analyzing function transmission parameters of the kernel code and parameters called by a data transmission OpenCL API function between a host end and a device end, and performs data flow analysis on the WII-CFG image to obtain a second analysis result;
the module 4 optimizes the kernel code in the fusion abstract syntax tree according to the first analysis result and the second analysis result to obtain an optimized abstract syntax tree;
and the module 5 inputs the optimized abstract syntax tree into a compiler, and outputs the optimized host code and kernel code after translation as a compiling result.
The OpenCL compiler design system based on the code fusion compilation framework, wherein the module 2 includes: and obtaining the WII function of the target platform according to the thread execution mode of the target platform of the kernel code, wherein the WII function is used for calculating the execution sequence of the instructions of the workitems in the kernel on the target platform.
The OpenCL compiler design system based on the code fusion compilation framework includes:
analyzing the corresponding relation between the incoming actual parameters and the form parameters of the kernel function and the parameters called by the data transmission OpenCL API function between the host end and the equipment end to obtain the corresponding relation between the host end variables and the kernel variables as a first analysis result, and performing data flow analysis on the WII-CFG graph to obtain a second analysis result which comprises a definition-use chain and an active period between different variables of the host end code and the kernel code.
The OpenCL compiler design system based on the code fusion compilation framework, wherein the module 4 includes the optimization specifically including:
the thread merging module is used for identifying the redundant operation among the threads according to the definition-use chain in the second analysis result and merging a plurality of threads executing the redundant operation into a coarse-grained thread so as to reduce the code redundancy among the threads; the data layout module selects one layout from the intra-thread continuity or the inter-thread continuity according to the definition-use chain in the first analysis result and the second analysis result and the thread organization execution mode of the target platform, and implements code conversion; and the vectorization module is used for vectorizing inter-thread and intra-thread codes according to the active period and the definition-use chain in the second analysis result.
The OpenCL compiler design system based on the code fusion compilation framework further includes: and the module 6 calls a local compiler to compile and then runs the compiling result according to the OpenCL compiling process.
The technical effects of the invention comprise:
1. the host-kernel code fuses the compilation framework. In the case of OpenCL, the definition and use of arrays or variables is often beyond the scope of kernel code, and the host code also specifies the organization parameters (i.e., how many word-items are included) of Work items (Work-groups). Thus, deep analysis and optimization of OpenCL programs requires a converged compilation framework based on host-side code and kernel code.
The technical effects are as follows: in the analysis stage of the compiler, the intermediate representation of the host end code and the intermediate representation of the Kernel code can be obtained simultaneously, and analysis can be expanded simultaneously.
2. And fusing a control flow graph WII-CFG. Hardware architecture and runtime thread organization execution patterns vary across different acceleration devices, which results in instructions from different threads (i.e., Work-items) differing in execution order due to acceleration device differences. Aiming at the target acceleration equipment, a corresponding WII function is obtained and used for calculating the execution sequence of instructions in the threads, and then a WII-CFG graph is used for expressing a host code CFG and a Kernel code CFG and expressing the instruction execution sequence of different threads.
The technical effects are as follows: can be used as an infrastructure for analyzing the execution behavior of codes among threads on different accelerating devices. By expanding the traditional CFG graph, the instruction execution sequence of different thread instances of the host end code CFG, the Kernel code CFG and the Kernel can be simultaneously represented, the runtime characteristic of the acceleration equipment is embodied, and the optimization opportunity among threads can be discovered.
3. Host-kernel code joint dataflow analysis. The method is an extension based on the traditional data flow analysis technology, and the extension has two aspects: 1) and analyzing the parameters of the transmission parameter and the data transmission API of the OpenCL to obtain the corresponding relation between the host code variable and the equipment end code variable. 2) And carrying out data flow analysis of host end-equipment end code combination based on WII-CFG (Wireless Fidelity-computational fluid dynamics), wherein the data flow analysis comprises alias relations among variables in codes of different ends and among variables of different threads, definition-use chains, active period analysis and the like. And the optimization among threads is facilitated.
The technical effects are as follows: the method can perform data flow analysis beyond the range of the host codes or the range of Kernel codes, can perform variable definition-use analysis facing multi-thread codes, and is convenient for developing optimization related to data and calculation among threads.
Although the present invention has been described with reference to the above embodiments, the embodiments are merely illustrative and not restrictive, and various changes and modifications may be made by those skilled in the art without departing from the spirit and scope of the invention.

Claims (10)

1. An OpenCL compiler design method based on a code fusion compilation framework is characterized by comprising the following steps:
step 1, obtaining an OpenCL source program, compiling a host end code in the source program into a host abstract syntax tree, obtaining a kernel code file of a kernel start function in the abstract syntax tree, compiling the kernel code file to obtain a kernel abstract syntax tree, storing the kernel abstract syntax tree into a shared memory, retrieving and reconstructing all the kernel abstract syntax trees from the shared memory to obtain a fusion abstract syntax tree fusing the host abstract syntax tree and the kernel abstract syntax tree;
step 2, obtaining respective control flow diagrams of the host abstract syntax tree and the kernel abstract syntax tree based on the fusion abstract syntax tree, adding a control flow diagram of which a function calling side and a function returning side are connected to obtain an inline control flow diagram, obtaining an execution sequence of instructions in a work item of a kernel on a corresponding target platform according to a WII function of a target platform characteristic, and depicting the execution sequence in the inline control flow diagram to obtain a WII-CFG diagram;
step 3, obtaining a corresponding relation between a host end variable and a kernel variable as a first analysis result by analyzing function transmission parameters of the kernel code and parameters called by a data transmission OpenCLAPI function between a host end and a device end, and performing data flow analysis on the WII-CFG image to obtain a second analysis result;
step 4, optimizing the kernel code in the fusion abstract syntax tree according to the first analysis result and the second analysis result to obtain an optimized abstract syntax tree;
and 5, translating the optimized abstract syntax tree by a compiler and outputting the optimized host code and kernel code as a compiling result.
2. The OpenCL compiler design method based on a code fusion compilation framework of claim 1, wherein step 2 comprises: and obtaining the WII function of the target platform according to the thread execution mode of the target platform of the kernel code, wherein the WII function is used for calculating the execution sequence of the instructions of the workitems in the kernel on the target platform.
3. The OpenCL compiler design method according to claim 1 or 2, wherein the step 3 specifically includes:
analyzing the corresponding relation between the transmitted real parameters and the shape parameters of the kernel function and the parameters called by the OpenCLAPI function for data transmission between the host end and the equipment end to obtain the corresponding relation between the host end variables and the kernel variables as a first analysis result, and performing data flow analysis on the WII-CFG graph to obtain a second analysis result which comprises a definition-use chain and an active period between different variables of the host end code and the kernel code.
4. The OpenCL compiler design method based on the code fusion compilation framework as claimed in claim 3, wherein the optimizing in step 4 specifically includes:
a thread merging step, namely identifying redundant operation among threads according to a definition-use chain in the second analysis result, and merging a plurality of threads executing the redundant operation into a coarse-grained thread so as to reduce code redundancy among the threads;
a data layout step, selecting a layout from the intra-thread continuity or the inter-thread continuity according to the definition-use chain in the first analysis result and the second analysis result and the thread organization execution mode of the target platform, and implementing code conversion;
and vectorizing inter-thread and intra-thread codes according to the active period and the definition-use chain in the second analysis result.
5. The OpenCL compiler design method based on a code fusion compilation framework of claim 1, further comprising: and 6, calling a local compiler to compile and then run the compiling result according to the OpenCL compiling process.
6. An OpenCL compiler design system based on a code fusion compilation framework, comprising:
the method comprises the steps that a module 1 obtains an OpenCL source program, a host end code in the source program is compiled into a host abstract syntax tree, a kernel code file of a kernel starting function in the abstract syntax tree is obtained, the kernel code file is compiled to obtain a kernel abstract syntax tree, the kernel abstract syntax tree is stored in a shared memory, all the kernel abstract syntax trees are retrieved from the shared memory and reconstructed, and a fusion abstract syntax tree fusing the host abstract syntax tree and the kernel abstract syntax tree is obtained;
the module 2 obtains respective control flow diagrams of the host abstract syntax tree and the kernel abstract syntax tree based on the fusion abstract syntax tree, increases the control flow diagrams of a function calling side and a function returning side instruction connection to obtain an inline control flow diagram, obtains an execution sequence of instructions in a work item of a kernel on a corresponding target platform according to a WII function of a target platform characteristic, and describes the execution sequence in the inline control flow diagram to obtain a WII-CFG (drawing-in-control-flow graph);
the module 3 obtains a corresponding relation between a host end variable and a kernel variable as a first analysis result by analyzing function transmission parameters of the kernel code and parameters called by a data transmission OpenCLAPI function between a host end and a device end, and performs data flow analysis on the WII-CFG image to obtain a second analysis result;
the module 4 optimizes the kernel code in the fusion abstract syntax tree according to the first analysis result and the second analysis result to obtain an optimized abstract syntax tree;
and the module 5 inputs the optimized abstract syntax tree into a compiler, and outputs the optimized host code and kernel code after translation as a compiling result.
7. The OpenCL compiler design system based on a code fusion compilation framework of claim 6, wherein module 2 comprises: and obtaining the WII function of the target platform according to the thread execution mode of the target platform of the kernel code, wherein the WII function is used for calculating the execution sequence of the instructions of the workitems in the kernel on the target platform.
8. The OpenCL compiler design system according to claim 6 or 7, wherein the module 3 specifically includes:
analyzing the corresponding relation between the transmitted real parameters and the shape parameters of the kernel function and the parameters called by the OpenCLAPI function for data transmission between the host end and the equipment end to obtain the corresponding relation between the host end variables and the kernel variables as a first analysis result, and performing data flow analysis on the WII-CFG graph to obtain a second analysis result which comprises a definition-use chain and an active period between different variables of the host end code and the kernel code.
9. The OpenCL compiler design system based on a code fusion compilation framework of claim 8, wherein the module 4 includes the optimization specifically including:
the thread merging module is used for identifying the redundant operation among the threads according to the definition-use chain in the second analysis result and merging a plurality of threads executing the redundant operation into a coarse-grained thread so as to reduce the code redundancy among the threads;
a data layout module, which selects a layout from the thread continuity or the thread continuity and implements code conversion according to the first analysis result, the definition-use chain and the thread organization execution mode of the target platform;
and the vectorization module is used for vectorizing inter-thread and intra-thread codes according to the active period and the definition-use chain in the second analysis result.
10. The OpenCL compiler design system based on a code fusion compilation framework of claim 6, further comprising: and the module 6 calls a local compiler to compile and then runs the compiling result according to the OpenCL compiling process.
CN201910106880.3A 2019-02-02 2019-02-02 OpenCL compiler design method and system based on code fusion compiling framework Active CN109933327B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910106880.3A CN109933327B (en) 2019-02-02 2019-02-02 OpenCL compiler design method and system based on code fusion compiling framework

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910106880.3A CN109933327B (en) 2019-02-02 2019-02-02 OpenCL compiler design method and system based on code fusion compiling framework

Publications (2)

Publication Number Publication Date
CN109933327A CN109933327A (en) 2019-06-25
CN109933327B true CN109933327B (en) 2021-01-08

Family

ID=66985577

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910106880.3A Active CN109933327B (en) 2019-02-02 2019-02-02 OpenCL compiler design method and system based on code fusion compiling framework

Country Status (1)

Country Link
CN (1) CN109933327B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112527304B (en) * 2019-09-19 2022-10-04 无锡江南计算技术研究所 Self-adaptive node fusion compiling optimization method based on heterogeneous platform
CN112527262B (en) * 2019-09-19 2022-10-04 无锡江南计算技术研究所 Automatic vector optimization method for non-uniform width of deep learning framework compiler
CN112579088A (en) * 2019-09-27 2021-03-30 无锡江南计算技术研究所 Heterogeneous hybrid programming-oriented one-stop program compiling method
CN111966397A (en) * 2020-07-22 2020-11-20 哈尔滨工业大学 Automatic transplanting and optimizing method for heterogeneous parallel programs
CN112083956B (en) * 2020-09-15 2022-12-09 哈尔滨工业大学 Heterogeneous platform-oriented automatic management system for complex pointer data structure
CN116185426B (en) * 2023-04-17 2023-09-19 北京大学 Compiling optimization method and system based on code fusion and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102360306A (en) * 2011-10-19 2012-02-22 上海交通大学 Method for extracting and optimizing information of cyclic data flow charts in high-level language codes
CN103677952A (en) * 2013-12-18 2014-03-26 华为技术有限公司 Coder decoder generating device and method
CN104036141A (en) * 2014-06-16 2014-09-10 上海大学 Open computing language (OpenCL)-based red-black tree acceleration algorithm
CN106843993A (en) * 2016-12-26 2017-06-13 中国科学院计算技术研究所 A kind of method and system of resolving inversely GPU instructions
CN106940815A (en) * 2017-02-13 2017-07-11 西安交通大学 A kind of programmable convolutional neural networks Crypto Coprocessor IP Core
CN109032572A (en) * 2017-06-08 2018-12-18 阿里巴巴集团控股有限公司 A method of the JAVA program technic based on bytecode is inline

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2438545A2 (en) * 2009-06-02 2012-04-11 Vector Fabrics B.V. Improvements in embedded system development
CN104820613B (en) * 2015-05-27 2018-03-27 北京思朗科技有限责任公司 A kind of Compilation Method of heterogeneous polynuclear program

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102360306A (en) * 2011-10-19 2012-02-22 上海交通大学 Method for extracting and optimizing information of cyclic data flow charts in high-level language codes
CN103677952A (en) * 2013-12-18 2014-03-26 华为技术有限公司 Coder decoder generating device and method
CN104036141A (en) * 2014-06-16 2014-09-10 上海大学 Open computing language (OpenCL)-based red-black tree acceleration algorithm
CN106843993A (en) * 2016-12-26 2017-06-13 中国科学院计算技术研究所 A kind of method and system of resolving inversely GPU instructions
CN106940815A (en) * 2017-02-13 2017-07-11 西安交通大学 A kind of programmable convolutional neural networks Crypto Coprocessor IP Core
CN109032572A (en) * 2017-06-08 2018-12-18 阿里巴巴集团控股有限公司 A method of the JAVA program technic based on bytecode is inline

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
OpenCL as a unified programming model for heterogeneous CPU/GPU clusters(Conference Paper);Kim,J等;《ACM SIGPLAN Notices》;20121231;第47卷(第8期);正文第299页-第300页 *
pocl: A Performance-Portable OpenCL Implementation;Jaaskelainen等;《INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING》;20151231;第43卷(第5期);正文第752页-第785页 *
异构并行编程模型研究与进展;刘颖等;《软件学报》;20141231;第25卷(第7期);正文第1459页-第1475页 *
异构架构下基于放松重用距离的多平台数据布局优化;刘颖等;《软件学报》;20161231;第27卷(第8期);正文第2168页-第2184页 *
异构集群下的MapReduce编程环境;吴承勇等;《科技创新导报》;20161231;第13卷(第9期);正文第170页 *
面向神威·太湖之光的国产异构众核处理器OpenCL编译系统;伍明川等;《计算机学报》;20181231;第41卷(第10期);正文第2236页-第2250页 *

Also Published As

Publication number Publication date
CN109933327A (en) 2019-06-25

Similar Documents

Publication Publication Date Title
CN109933327B (en) OpenCL compiler design method and system based on code fusion compiling framework
Martinez et al. CU2CL: A CUDA-to-OpenCL translator for multi-and many-core architectures
Moldovan et al. AG: Imperative-style Coding with Graph-based Performance
Ziogas et al. Productivity, portability, performance: Data-centric Python
US20130326204A1 (en) Configuration-Preserving Preprocessor and Configuration-Preserving Parser
Viñas et al. Exploiting heterogeneous parallelism with the Heterogeneous Programming Library
Weber et al. MATOG: array layout auto-tuning for CUDA
Vinas et al. Improving OpenCL programmability with the heterogeneous programming library
Mendonça et al. Automatic insertion of copy annotation in data-parallel programs
Cedersjö et al. Tÿcho: A framework for compiling stream programs
US8762974B1 (en) Context-sensitive compiler directives
Lueh et al. C-for-metal: High performance simd programming on intel gpus
Acosta et al. Towards a Unified Heterogeneous Development Model in Android TM
Saà-Garriga et al. OMP2MPI: Automatic MPI code generation from OpenMP programs
Gardner et al. Characterizing the challenges and evaluating the efficacy of a CUDA-to-OpenCL translator
Tiotto et al. Experiences building an mlir-based sycl compiler
CN116861359A (en) Operator fusion method and system for deep learning reasoning task compiler
Nguyen et al. Retargetable optimizing compilers for quantum accelerators via a multilevel intermediate representation
Lin et al. Enable OpenCL compiler with Open64 infrastructures
Acosta et al. Performance analysis of paralldroid generated programs
Bispo et al. Challenges and Opportunities in C/C++ Source-To-Source Compilation
Acosta et al. Paralldroid: Performance analysis of gpu executions
Benoit et al. Using an intermediate representation to map workloads on heterogeneous parallel systems
Athrij Vectorizing Memory Access on HammerBlade Architecture
Agostini et al. AXI4MLIR: User-Driven Automatic Host Code Generation for Custom AXI-Based Accelerators

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20231226

Address after: Room 1305, 13th Floor, No.1 Zhongguancun Street, Haidian District, Beijing, 100086

Patentee after: Zhongke Jiahe (Beijing) Technology Co.,Ltd.

Address before: 100080 No. 6 South Road, Zhongguancun Academy of Sciences, Beijing, Haidian District

Patentee before: Institute of Computing Technology, Chinese Academy of Sciences