CN109933327A - OpenCL compiler method and system based on code fusion compiler framework - Google Patents

OpenCL compiler method and system based on code fusion compiler framework Download PDF

Info

Publication number
CN109933327A
CN109933327A CN201910106880.3A CN201910106880A CN109933327A CN 109933327 A CN109933327 A CN 109933327A CN 201910106880 A CN201910106880 A CN 201910106880A CN 109933327 A CN109933327 A CN 109933327A
Authority
CN
China
Prior art keywords
code
kernel
thread
compiler
syntax tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910106880.3A
Other languages
Chinese (zh)
Other versions
CN109933327B (en
Inventor
刘颖
黄磊
伍明川
崔慧敏
冯晓兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Jiahe Beijing Technology Co ltd
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN201910106880.3A priority Critical patent/CN109933327B/en
Publication of CN109933327A publication Critical patent/CN109933327A/en
Application granted granted Critical
Publication of CN109933327B publication Critical patent/CN109933327B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Devices For Executing Special Programs (AREA)

Abstract

The present invention relates to a kind of OpenCL compiler method and systems based on code fusion compiler framework, it include: to provide host-kernel code fusion compiler framework based on shared drive, in the intermediate representation of compiler --- the fusion of different end codes is realized on AST layer;WII-CFG figure is used to portray after Kernel code is instantiated into numerous threads, the instruction execution behavior between thread, that is, analyzes the program process performing of working group's inner platform feature-sensitive;The united data-flow analysis of host-kernel code, for excavating the data flow relation crossed between host side or the data flow relation and thread at kernel end, to analyze the data dependence between the code of both ends;Implement targetedly code optimization based on Such analysis, and generates assembly code and then terminate compilation process.The present invention can carry out analysis simultaneously for host side code, Kernel code, sufficiently excavate cross-thread and optimize chance, so that OpenCL program obtains good performance portability towards different acceleration equipments.

Description

OpenCL compiler method and system based on code fusion compiler framework
Technical field
The present invention relates to the research and development of compiler and optimisation technique field, more particularly to one kind towards OpenCL language and different The compiler framework design and compiler method and system of structure platform.
Background technique
Isomery framework has become mainstream framework in recent years, either before ranking on global TOP500 supercomputer list Three be all heterogeneous platform, heterogeneous platform quantity are more than, or occur extensively from processor core+acceleration equipment framework on 100 On server, PC, terminal device, this point is all embodied.Heterogeneous computing system is usually by CPU and one or more Acceleration equipment is connected with each other composition on piece or mainboard, and wherein CPU is responsible for the work such as complicated control, scheduling, and accelerates to set Calculating task that is standby then being responsible for Large-scale parallel computing or professional domain.For isomerism parallel programming model, NVIDIA company The OpenCL of the CUDA and Khronos Group publication of release is two kinds of isomerism parallel programming models of current mainstream, wherein after Person can be applied to the cross-platform parallel programming model of a variety of acceleration equipments, has and is more widely applied range than the former.
Under heterogeneous Computing frame, object code isomery is a significant challenge of compilation tool design.Isomery program Code is divided into the end host (Host) and equipment (Device) holds two parts, the end CPU and acceleration equipment end is correspondingly operated in, from function The former is responsible for data initialization, data exchange and acceleration equipment control, the latter and is responsible for parallel execution core calculations, that is, this on energy The compiling target platform of two parts code be it is inconsistent, its optimization aim is also not quite similar.Current isomerism parallel program is adopted With the mode of separate compilation, independently compiles and optimize the code of operation on different devices.Under separate compilation mode, difference is set Standby compilation tool is mutually indepedent, can generate abundant optimized code for distinct device.Such as NVIDIA CUDA compiling system It unites nvcc (NVIDIA Compiler Collection), most of successful commercializations such as AMD OpenCL compiling/operation frame Compiling system, be all based on this separate compilation method design.
However separate compilation mode has ignored the association between isomery code.Such as in OpenCL program, mainframe code It is completely independent with kernel code, without shared compile-time message.But in fact, mainframe code is controlled by calling OpenCL API The operation of kernel code processed, to be interacted with acceleration equipment.And due to being to be completely independent compiling when compiling kernel code, compiling Device will be unable to know the relevant information of mainframe code, such as the incoming information of parameter, the layout information of array, working group (workgroup) information etc. limits the optimization chance of kernel code, is unfavorable for improving the quality for generating code.For isomery The code compilation of parallel computation frame and optimization, " separation " are that compiler will consider the problems of always with " fusion ".On the one hand most Whole code operates on heterogeneous platform, needs separately to compile the code at different ends and be added additional complex mechanism (including chain Pick system, operating mechanism etc.);On the other hand, there are correlation between the code at different ends, need to know that these information could be right Code carries out depth optimization.From the depth optimization angle of code, fusion compiling is very necessary.
For the target of isomery code depth optimization and improvement OpenCL performance portability, this method provides a kind of master Machine-kernel code fusion compiling OpenCL compiler method, provides the program generation after optimization using conversion regime in a steady stream Code, it is intended to which fusion compiling mainframe code and kernel code are excavated in thread and thread with realizing the analysis and optimization of Whole Process grade Between optimization chance, while cross-platform formedness is provided for program for the portable poor feature of OpenCL program feature It can be portable.It is with previous work difference, proposes host-kernel code fusion compiler framework and its construction method, and base Two compiler infrastructures are proposed in this --- modeling work item execution sequence, platform features relevant WII-CFG figure and main The united data-flow analysis of machine-kernel code, for instructing the specific aim of kernel code to optimize.The compiler that this method is related to is set Meter includes four major parts: (1) providing host-kernel code based on shared drive and merge compiler framework, in compiler Between indicate --- the fusion of different end codes is realized on AST layer;(2) WII-CFG schemes (Work-Item Interleaving CFG) it is used to portray after Kernel code is instantiated into numerous threads, the instruction execution behavior between thread, that is, analyzes work The program process performing of work group inner platform feature-sensitive;(3) the united data-flow analysis of host-kernel code, for excavate across Data flow relation more between the data flow relation and thread of host side or kernel end, to analyze host side code and Kernel Data dependence between code;(4) targetedly code optimization is implemented based on Such analysis, and generates assembly code and then ties Beam compilation process.
Summary of the invention
For OpenCL program, performance portability difference is the major issue being widely noticed, and we have proposed bases thus In host-kernel code fusion compiler framework compiler method, include WII-CFG figure and host-kernel code joint Two compiler infrastructures of data-flow analysis, it is intended to for OpenCL program provide depth optimization basis and good performance can Transplantability.In order to analyze the optimization chance between thread (or Work-Item), this method is directed to a working group (Work- Group thread deployment analysis and optimization in).
Specifically, the invention discloses a kind of OpenCL compiler method based on code fusion compiler framework, Including:
Step 1 obtains OpenCL source program, is host abstract syntax tree by host side code compilation in the source program, obtains The kernel code file for taking kernel run function in the abstract syntax tree compiles the kernel code file and obtains kernel abstract syntax Tree, and it is deposited into shared drive, all kernel abstract syntax tree are fetched and reconstructed from the shared drive, and obtaining fusion should The fusion abstract syntax tree of host abstract syntax tree and the kernel abstract syntax tree;
Step 2 obtains the host abstract syntax tree and the kernel abstract syntax tree respectively based on the fusion abstract syntax tree Controlling stream graph, and increase function call while and function return while instruct the controlling stream graph of both connections, obtain inline control and flow It is suitable to obtain execution of the instruction on respective objects platform in the work item of kernel according to the WII function of target platform feature for figure Sequence, connection controlling stream graph portrays this and executes sequence inside, obtains WII-CFG figure;
Step 3 passes data transmission OpenCL between ginseng and host side and equipment end by the function for analyzing kernel code The parameter that api function calls obtains the corresponding relationship between host side variable and kernel variable as the first analysis as a result, and right WII-CFG figure carries out data-flow analysis, obtains the second analysis result;
Step 4, according to this first analysis result and this second analysis as a result, to kernel code in the fusion abstract syntax tree It optimizes, obtains optimization abstract syntax tree;
Step 5 will export mainframe code and kernel code after optimizing after the compiled device translation of the optimization abstract syntax tree As compiling result.
The OpenCL compiler method based on code fusion compiler framework, wherein step 2 includes: in Thread executive mode on the target platform of core code obtains the WII function of the target platform, for calculating work item in kernel Instruction execute sequence on the target platform.
The OpenCL compiler method based on code fusion compiler framework, wherein the step 3 specifically includes:
Corresponding relationship and host side and equipment end between the incoming argument variable and parameter variable of analysis kernel function Between the parameter called of data transmission OpenCL api function, obtain corresponding relationship between host side variable and kernel variable and make For first analysis as a result, to the WII-CFG scheme carry out data-flow analysis, obtain the second analysis as a result, include host side code and Definition-between the different variables of kernel code uses chain and active period.
The OpenCL compiler method based on code fusion compiler framework wherein includes that this is excellent in step 4 Change specifically includes:
Thread merges step, uses chain according to the definition-in the second analysis result, identifies the redundant operation of cross-thread, The multiple threads for executing the redundant operation are merged into a coarseness thread, to reduce the code redundancy of cross-thread;
Data layout step, it is flat using chain and target according to the definition-in the first analysis result, the second analysis result The thread tissue executive mode of platform, from continuous in thread or cross-thread it is continuous in a kind of preferentially layout, and implement code conversion;
Vectorization step, according to this second analysis result in active period and definition-use chain, vectorization cross-thread and line Code in journey.
The OpenCL compiler method based on code fusion compiler framework, wherein further include: step 6 is incited somebody to action The compiling result is run after calling local compiler compiling according to OpenCL compilation process.
The invention also discloses a kind of OpenCL compiler systems based on code fusion compiler framework, wherein wrapping It includes:
Module 1 obtains OpenCL source program, is host abstract syntax tree by host side code compilation in the source program, obtains The kernel code file for taking kernel run function in the abstract syntax tree compiles the kernel code file and obtains kernel abstract syntax Tree, and it is deposited into shared drive, all kernel abstract syntax tree are fetched and reconstructed from the shared drive, and obtaining fusion should The fusion abstract syntax tree of host abstract syntax tree and the kernel abstract syntax tree;
Module 2 obtains the host abstract syntax tree and the kernel abstract syntax tree respectively based on the fusion abstract syntax tree Controlling stream graph, and increase function call while and function return while instruct the controlling stream graph of both connections, obtain inline control and flow It is suitable to obtain execution of the instruction on respective objects platform in the work item of kernel according to the WII function of target platform feature for figure Sequence, connection controlling stream graph portrays this and executes sequence inside, obtains WII-CFG figure;
Module 3 passes data transmission OpenCL between ginseng and host side and equipment end by the function for analyzing kernel code The parameter that api function calls obtains the corresponding relationship between host side variable and kernel variable as the first analysis as a result, and right WII-CFG figure carries out data-flow analysis, obtains the second analysis result;
Module 4, according to this first analysis result and this second analysis as a result, to kernel code in the fusion abstract syntax tree It optimizes, obtains optimization abstract syntax tree;
Module 5 will export mainframe code and kernel code after optimizing after the compiled device translation of the optimization abstract syntax tree As compiling result.
The OpenCL compiler system based on code fusion compiler framework, wherein module 2 includes: in Thread executive mode on the target platform of core code obtains the WII function of the target platform, for calculating work item in kernel Instruction execute sequence on the target platform.
The OpenCL compiler system based on code fusion compiler framework, wherein the module 3 specifically includes:
Corresponding relationship and host side and equipment end between the incoming argument variable and parameter variable of analysis kernel function Between the parameter called of data transmission OpenCL api function, obtain corresponding relationship between host side variable and kernel variable and make For first analysis as a result, to the WII-CFG scheme carry out data-flow analysis, obtain the second analysis as a result, include host side code and Definition-between the different variables of kernel code uses chain and active period.
The OpenCL compiler system based on code fusion compiler framework wherein includes that this is excellent in module 4 Change specifically includes:
Thread merging module uses chain according to the definition-in the second analysis result, identifies the redundant operation of cross-thread, The multiple threads for executing the redundant operation are merged into a coarseness thread, to reduce the code redundancy of cross-thread;
Data layout module is executed according to the first analysis result, this definition-using the thread tissue of chain and target platform Mode, from continuous in thread or cross-thread it is continuous in a kind of preferentially layout, and implement code conversion;
Vectorization module, according to this second analysis result in active period and definition-use chain, vectorization cross-thread and line Code in journey.
The OpenCL compiler system based on code fusion compiler framework, wherein further include: module 6 is incited somebody to action The compiling result is run after calling local compiler compiling according to OpenCL compilation process.
Technical effect of the invention includes:
OpenCL compiler method of the invention, cover improved compiler framework, extension analytical technology and be directed to Property optimization means, analysis can be carried out simultaneously for host side code, Kernel code towards different acceleration equipments, sufficiently hair It digs cross-thread and optimizes chance, so that OpenCL program obtains good performance portability.
Detailed description of the invention
Fig. 1 is each platform WII function chart;
Fig. 2 is WII-CFG figure;
Variable corresponding relationship chart of the Fig. 3 between host side and Kernel code;
Fig. 4 is compilation process flow chart.
Specific embodiment
In order to solve the above-mentioned technical problem, embodiment of the present invention includes:
A. host-kernel code fusion: firstly, the compiled device of host side code generates intermediate representation-abstract syntax tree AST(HostAST).Then, the AST is traversed, when encountering kernel run function (such as clCreateProgramWithSource Function) when, know kernel code filename, and start subprocess and compiler compiling kernel code file is called to obtain KernelAST is stored in shared drive and terminates subprocess.Again, it is fetched from shared drive buffer area and reconstructs all kernels The AST of code, so that HostAST and KernelAST realize fusion.
B. the control flow analysis based on WII-CFG figure: purpose is towards specific (Kernel code) target platform The WII-CFG figure of building fusion post code, provides basis for subsequent data-flow analysis and code optimization.Firstly, based on above-mentioned Fused AST constructs inline controlling stream graph (CFG, control flow graph), and which show HostAST, (host is taken out As syntax tree) and KernelAST (kernel abstract syntax tree) respective CFG (construction method is constructed with traditional CFG), and increase Calledge (function call is existed) and return edge (when function return) is connected to the CFG of the two.Then, according to Kernel Thread executive mode on the target platform of code, i.e. thread in working group (WorkGroup) are executed one by one in a manner of serializing, Or several threads execute in a parallel fashion, WII (Work-Item Interleaving) function of the platform are obtained, based on It calculates certain instruction in some work item (Work-Item) of kernel and executes sequence on respective objects platform.Again, it is based on WII function refines CFG, and that portrays Kernel instruction on it executes sequence, thus obtains WII-CFG figure.
C. united data-flow analysis: firstly, analyzing the data dependence between host side code and Kernel code (or corresponding relationship of data).Corresponding pass between incoming argument variable and parameter variable by analyzing Kernel function System, and analyze the transmission of these incoming argument variables relevant data (i.e. relevant OpenCL api function calls, such as ClEnqueueWriteBuffer, clEnqueueReadBuffer etc.) parameter, host side variable and Kernel can be obtained Corresponding relationship between variable is as the first analysis result (being considered as alias relationship in the present invention).Secondly, being adopted on WII-CFG figure With traditional dataflow analysis method, carry out host side-united data-flow analysis of equipment end code, including host side code and The alias relationship between variable, definition-between the different variables of Kernel code and between different threads use chain, active period Analysis etc..
D. it code optimization: utilizes the result of Such analysis to carry out code optimization, improves Kernel code performance.Firstly, line Journey merges optimization and is intended to for several threads to be merged into a coarseness thread, reduces the code redundancy of cross-thread.Through aforementioned data The definition-of variable uses chain between the different threads that flow point is analysed, and may recognize that the redundant operation of cross-thread, and also exactly thread closes And optimization object.Secondly, data layout optimization is intended to the thread tissue executive mode according to target platform, from two kinds of data cloth A kind of office --- preferentially layout during continuous in thread or cross-thread is continuous, and implement code conversion.Through aforementioned data-flow analysis Alias relationship and definition-between obtained host side code and the variable of Kernel code use chain, and it is legal to can be used for instructing Data layout code conversion.Again, radically vectorization optimization be intended to in cross-thread, thread code carry out vectorization it is excellent Change.Its code conversion be related to the definition of correlated variables, the sentence used change, also depend on data-flow analysis obtain it is accurate Definition-analyze result using chain and active period.
E. code building and rear compilation process: therefrom isolating mainframe code and kernel code to the fusion AST after optimization, Mainframe code and kernel code (the OpenCL program source code after optimizing) after our compiler translation after output optimization. These subsequent codes routinely OpenCL compilation process can call local compiler compiling, generate binary and then run.
To allow features described above and effect of the invention that can illustrate more clearly understandable, special embodiment below, and cooperate Bright book attached drawing is described in detail below.
Whole flow process figure of the present invention includes: as shown in Figure 4
Step 1, the AST of fusion is generated.OpenCL source program is inputted, obtains host-kernel generation after carrying out fusion compiling The AST of code fusion.Firstly, the compiled device of host side code generates intermediate representation --- abstract syntax tree AST (HostAST).So Afterwards, the AST is traversed, when encountering kernel run function (such as clCreateProgramWithSource function), knows kernel Code file name, and open up shared memory space and communicated for this process with its subprocess;Then, starting subprocess calls compiling Device compiling kernel code file obtains KernelAST, is stored in shared drive and terminates subprocess.Again, from shared memory space In fetch the AST of all kernel codes, so far this process can access Host AST and Kernel AST simultaneously, realize two The fusion of AST.Wherein shared drive is one of mode of interprocess communication, and shared drive allows two or more processes shared one A given memory block, this section of memory block can be mapped in the address space of itself by two or more processes, The information of one process write-in shared drive, can be used the process of this shared drive by other, simple interior by one Read operation reading is deposited, thus the communication between realizing process.
Step 2, the control flow analysis based on WII-CFG figure.Purpose is towards specific (Kernel code) target Platform construction merges the WII-CFG figure of post code, provides basis for subsequent data-flow analysis and code optimization.Include:
2.1) inline CFG figure is constructed based on fusion AST above-mentioned.Inline control is constructed based on above-mentioned fused AST Flow graph (inlined-CFG figure) processed, which show the respective CFG of Host AST and Kernel AST, (construction method is the same as traditional CFG building), meanwhile, Kernel code is by the starting of host side code (or calling), and there is passes that is called and calling System then increases calledge (function call is existed) and return edge (when function return) for Host-CFG and Kernel- CFG is connected.
2.2) WII (Work-Item Interleaving) function is obtained according to target platform feature, for calculating Certain instruction executes sequence on respective objects platform in some Work-Item of kernel.In the same Work-Group Work-Item execution sequence is Platform Dependent, it is most common there are two types of --- serializing executes and data parallelization executes, preceding Person be each Work-Item successively execute (Work-Item0 is finished, just start execute Work-Item1.), with Advanced Micro Devices The TileGX series many-core chip of CPU, Tilera company, domestic Shen prestige many-core chip (SW26010) be representative;The latter is phase Instruction is executed in parallel (Work-Item0 in adjacent several Work-Item.The respective insn0 of Work-Itemi is held parallel After row, Work-Item0 is just executed.The respective insn1 of Work-Itemi, then executes respective insn2.), such as The NVIDIA-GPU chip of SIMT mode, adjacent thread instruction are changed into the Intel of vector instruction execution automatically at runtime CPU, XeonPhi chip.Specifically as shown in Figure 1, wherein tid indicates thread number, by OpenCL Specification, thread Tid at most there are three dimension tid (0), tid (1), tid (2), tid be tid (0), tid (1), tid (2) be calculated it is complete Exchange line journey id.
2.3) CFG is refined based on WII function, i.e., simple instantiation extension is carried out on Kernel CFG, by Kernel Static instruction example is melted into the instruction in thread relevant to thread tid, and indicates its execution to thread instruction according to WII function Sequentially, WII-CFG figure is obtained.As shown in Fig. 2, Fig. 2 is WII-CFG figure, wherein (a) is that inline CFG schemes;It (b) is serializing Kernel target platform on WII-CFG figure;(c) (degree of parallelism is schemed for WII-CFG on the Kernel target platform of data parallel For 2), inline CFG figure is refined according to the instruction execution sequence that WII function obtains, obtain executing platform towards serializing And data-oriented parallelization execute platform WII-CFG scheme (as shown in Fig. 2).
Step 3, united data-flow analysis.Firstly, analysis obtains corresponding relationship between host side and the variable of Kernel, Then traditional data-flow analysis is carried out on WII-CFG figure, is specifically included:
3.1) corresponding relationship (this of variable in host side variable (including aray variable or array pointer) and Kernel is obtained In also referred to as alias relationship).Phase is transmitted by the corresponding relationship and data of the incoming argument and parameter of analyzing Kernel function The parameter that the OpenCL api function of pass calls, including clEnqueueWriteBuffer (), clEnqueueReadBuffer The data such as (), clEnqueueMapBuffer (), clSetKernelArg () function transmit correlation function, are mainly directed towards Thus the argument variable being passed in Kernel code knows its corresponding host side variable.
It illustrates.Analyze source code (shown in such as Fig. 3 (a)), then:
(1) the incoming argument and parameter for analyzing Kernel function, can obtain following corresponding relationship:
D_f<->ker (0th) (=f);D_p<->ker (1th) (=p);
D_n<->ker (2th) (=n);NN<->ker (3th) (=N);
NA<->ker (4th) (=A);
(2) analysis data transmit the parameter of related api function, can obtain following corresponding relationship:
d_n<->h_n;d_p<->h_p;h_f<->d_f;
To obtain the variable corresponding relationship of host side and Kernel code.As shown in Fig. 3 (b).
Wherein corresponding relationship such as d_f<->ker (0th) (=f) means that incoming argument d_f is equivalent to parameter ker (0th) (=f), i.e. symbol<->mean " being equivalent to ".
3.2) traditional dataflow analysis method is used on WII-CFG figure, carries out the united data of host side-equipment end Flow point analysis, including host side, equipment end code in alias relationship between variable between variable and between different threads, fixed Justice-is using chain, active period analysis etc., convenient for optimizations such as subsequent Develop Data layouts.
Still by taking Fig. 3 as an example, know that n (in Kernel code) is corresponding with h_n, d_n based on analysis result 3.1) Relationship, then practical through its defining point known to data-flow analysis is assignment in host side code to h_n.Such data-flow analysis As a result it is conducive to subsequent optimization analysis and code conversion.
Step 4, code optimization.Code optimization is carried out using the result of Such analysis, improves the feasibility of Kernel code Energy.The specific aim optimization of main newly-increased three kinds of improvement performance portability:
4.1) thread merges optimization.Variable-definition-between the different threads obtained through aforementioned data-flow analysis uses chain, can Identify the redundant operation of cross-thread.It, can be by selectively merging for code local redundancy existing for these cross-threads (it is assumed that selection merges in j dimension, Work-Group includes (localsize (0) * local to the adjacent cf thread of certain dimension (1) (2) * local) a Work-Item, then have: cf≤localsize (j) and cf >=1) a thread, do not influencing degree of parallelism Under correlated performance, the calculating or memory access or simultaneously operating of redundancy are removed, code performance is improved.In host side code and Kernel generation Code makes corresponding modification.
4.2) data layout optimization.According to it is aforementioned 2.2), by the feature of acceleration equipment be broadly divided into serializing execute sum number Two kinds are executed according to parallelization, thus also according to equipment feature from two kinds of data layouts --- continuous in thread (suitable serializing is held Row) or cross-thread continuous (data parallelization is suitble to execute) in preferentially go out a kind of layout, and accordingly modification host side code and Related array or definition and use (information obtained using data-flow analysis, including host side generation of variable in Kernel code Code, Kernel code variable between alias relationship and definition-use chain).
Still by taking Fig. 3 code as an example, when the acceleration equipment executed towards data parallelization, Kernel code should use cross-thread Continuous data layout, and the use (idx=n [tid+j*A] sentence) of the n (in Kernel code) in source code has been line It is continuous between journey, therefore original data layout need not be changed.When the acceleration equipment executed towards serializing, Kernel code should be adopted With data layout continuous in thread, and n (in Kernel code) Ying Jinhang data layout optimization in source code, change n's It (is changed to: idx=n [tid*N+j]) using sentence, while for program correctness, also accordingly modification is (main for practical definition statement The h_n [i+j*nA] of generator terminal=neighborIter [i] [j] sentence is changed to: h_n [i*nN+j]=neighborIter [i] [j])。
4.3) radically vectorization optimizes.Numerous threads will be instantiated into the practical execution of Kernel code concomitantly to hold Row, for vectorization optimization angle, in quantization chance all oriented in cross-thread, thread.According to the SIMD instruction of particular hardware Width, the automatic vectorization to the advanced row cross-thread code of Kernel code, then carry out the automatic vectorization in thread.Its code Transformation is related to the definition of correlated variables, the sentence used changes, and also depending on the precise definition-that data-flow analysis obtains makes Result is analyzed with chain and active period.
Step 5, code building and rear compilation process.Mainframe code and kernel are therefrom isolated to the fusion AST after optimization Code, mainframe code and kernel code (the OpenCL program after optimizing after our compiler translation after output optimization Source code).These subsequent codes routinely OpenCL compilation process can call local compiler compiling then to run.
The following are system embodiment corresponding with above method embodiment, present embodiment can be mutual with above embodiment Cooperation is implemented.The relevant technical details mentioned in above embodiment are still effective in the present embodiment, in order to reduce repetition, Which is not described herein again.Correspondingly, the relevant technical details mentioned in present embodiment are also applicable in above embodiment.
The invention also discloses a kind of OpenCL compiler systems based on code fusion compiler framework, wherein wrapping It includes:
Module 1 obtains OpenCL source program, is host abstract syntax tree by host side code compilation in the source program, obtains The kernel code file for taking kernel run function in the abstract syntax tree compiles the kernel code file and obtains kernel abstract syntax Tree, and it is deposited into shared drive, all kernel abstract syntax tree are fetched and reconstructed from the shared drive, and obtaining fusion should The fusion abstract syntax tree of host abstract syntax tree and the kernel abstract syntax tree;
Module 2 obtains the host abstract syntax tree and the kernel abstract syntax tree respectively based on the fusion abstract syntax tree Controlling stream graph, and increase function call while and function return while instruct the controlling stream graph of both connections, obtain inline control and flow It is suitable to obtain execution of the instruction on respective objects platform in the work item of kernel according to the WII function of target platform feature for figure Sequence, connection controlling stream graph portrays this and executes sequence inside, obtains WII-CFG figure;
Module 3 passes data transmission OpenCL between ginseng and host side and equipment end by the function for analyzing kernel code The parameter that api function calls obtains the corresponding relationship between host side variable and kernel variable as the first analysis as a result, and right WII-CFG figure carries out data-flow analysis, obtains the second analysis result;
Module 4, according to this first analysis result and this second analysis as a result, to kernel code in the fusion abstract syntax tree It optimizes, obtains optimization abstract syntax tree;
The optimization abstract syntax tree is input to compiler by module 5, the mainframe code and kernel after translation after output optimization Code is as compiling result.
The OpenCL compiler system based on code fusion compiler framework, wherein module 2 includes: in Thread executive mode on the target platform of core code obtains the WII function of the target platform, for calculating work item in kernel Instruction execute sequence on the target platform.
The OpenCL compiler system based on code fusion compiler framework, wherein the module 3 specifically includes:
Corresponding relationship and host side and equipment end between the incoming argument variable and parameter variable of analysis kernel function Between the parameter called of data transmission OpenCL api function, obtain corresponding relationship between host side variable and kernel variable and make For first analysis as a result, to the WII-CFG scheme carry out data-flow analysis, obtain the second analysis as a result, include host side code and Definition-between the different variables of kernel code uses chain and active period.
The OpenCL compiler system based on code fusion compiler framework wherein includes that this is excellent in module 4 Change specifically includes:
Thread merging module uses chain according to the definition-in the second analysis result, identifies the redundant operation of cross-thread, The multiple threads for executing the redundant operation are merged into a coarseness thread, to reduce the code redundancy of cross-thread;Data cloth Office's module is held according to the definition-in the first analysis result, the second analysis result using the thread tissue of chain and target platform Line mode, from continuous in thread or cross-thread it is continuous in a kind of preferentially layout, and implement code conversion;Vectorization module, root According in the second analysis result active period and definition-use chain, code in vectorization cross-thread and thread.
The OpenCL compiler system based on code fusion compiler framework, wherein further include: module 6 is incited somebody to action The compiling result is run after calling local compiler compiling according to OpenCL compilation process.
Technical effect of the invention includes:
1, host-kernel code merges compiler framework.For OpenCL program, the definition of array or variable and use it Between often beyond kernel code range, mainframe code further specify work item (Work-Group) organizational parameter (that is, How many a Work-Item included).Then, depth analysis and optimization OpenCL program, it is necessary to Intrusion Detection based on host end code and kernel The fusion compiler framework of code.
Technical effect: in the analysis phase of compiler, host side code intermediate representation and Kernel code can be obtained simultaneously Intermediate representation, and energy while deployment analysis.
2, fused controlling flow graph WII-CFG.The thread tissue side of execution when hardware structure and operation on different acceleration equipments Formula is variant, this causes the instruction from different threads (i.e. Work-Item) sequentially to have because of acceleration equipment difference in execution It is different.For target acceleration equipment, we obtain execute sequence of the corresponding WII function for instructing in computational threads, and then with WII-CFG graph expression host side code CFG, Kernel code CFG, while expressing the instruction execution sequence of different threads.
Technical effect: it can be used as the infrastructure for analyzing cross-thread code process performing on different acceleration equipments.Pass through expansion Tradition CFG figure is opened up, can indicate the different threads example of host side code CFG and Kernel code CFG and Kernel simultaneously Instruction execution sequence, feature when embodying the operation of acceleration equipment can excavate cross-thread optimization chance.
3, the united data-flow analysis of host-kernel code.It is the extension based on traditional data stream analysis techniques, extension There are two aspects: 1) being analyzed by the parameter of biography ginseng and data transmission API to OpenCL, obtain mainframe code variable and equipment Hold the corresponding relationship between code variables.2) based on the data-flow analysis of WII-CFG, carry out host side-equipment end code joint Data-flow analysis, it is the alias relationship between variable in the code including different ends between variable and between different threads, fixed Justice-uses chain, active period analysis etc..Conducive to the optimization for carrying out cross-thread.
Technical effect: can be performed for more than the data-flow analysis of mainframe code range or Kernel code range, can carry out face Variable-definition-use analysis to multi-threaded code, convenient for carrying out inter-thread data and calculating relevant optimization.
Although the present invention is disclosed with above-described embodiment, specific examples are only used to explain the present invention, is not used to limit The present invention, any those skilled in the art of the present technique without departing from the spirit and scope of the invention, can make some change and complete It is kind, therefore the scope of the present invention is subject to claims and its equivalency range person.

Claims (10)

1. a kind of OpenCL compiler method based on code fusion compiler framework characterized by comprising
Step 1 obtains OpenCL source program, is host abstract syntax tree by host side code compilation in the source program, and obtaining should The kernel code file of kernel run function in abstract syntax tree compiles the kernel code file and obtains kernel abstract syntax tree, And it is deposited into shared drive, all kernel abstract syntax tree are fetched and reconstructed from the shared drive, obtain merging the master The fusion abstract syntax tree of machine abstract syntax tree and the kernel abstract syntax tree;
Step 2 obtains the host abstract syntax tree and the respective control of kernel abstract syntax tree based on the fusion abstract syntax tree Flow graph processed, and increase the controlling stream graph that function call connects the two when returning with function, inline controlling stream graph is obtained, according to mesh The WII function for marking platform features obtains instruction in the work item of kernel and executes sequence on respective objects platform, in interior joint control Flow graph processed portrays this and executes sequence, obtains WII-CFG figure;
Step 3 passes data transmission OpenCLAPI between ginseng and host side and equipment end by the function for analyzing kernel code The parameter of function call obtains the corresponding relationship between host side variable and kernel variable as the first analysis as a result, and to this WII-CFG figure carries out data-flow analysis, obtains the second analysis result;
Step 4, according to this first analysis result and this second analysis as a result, being carried out to kernel code in the fusion abstract syntax tree Optimization obtains optimization abstract syntax tree;
Step 5, by the mainframe code after the optimization abstract syntax tree compiled device translation after output optimization and kernel code as Compile result.
2. the OpenCL compiler method as described in claim 1 based on code fusion compiler framework, which is characterized in that Step 2 includes: thread executive mode on the target platform according to kernel code, obtains the WII function of the target platform, is used for The instruction for calculating work item in kernel executes sequence on the target platform.
3. the OpenCL compiler method as claimed in claim 1 or 2 based on code fusion compiler framework, feature exist In the step 3 specifically includes:
It analyzes between corresponding relationship and host side and the equipment end between the incoming argument variable and parameter variable of kernel function Data transmit the parameter of OpenCLAPI function call, obtain corresponding relationship between host side variable and kernel variable as the One analysis obtains the second analysis as a result, including host side code and kernel as a result, to WII-CFG figure progress data-flow analysis Definition-between the different variables of code uses chain and active period.
4. the OpenCL compiler method as claimed in claim 3 based on code fusion compiler framework, which is characterized in that Include that the optimization specifically includes in step 4:
Thread merges step, uses chain according to the definition-in the second analysis result, identifies the redundant operation of cross-thread, will hold Multiple threads of the row redundant operation are merged into a coarseness thread, to reduce the code redundancy of cross-thread;
Data layout step uses chain and target platform according to the definition-in the first analysis result, the second analysis result Thread tissue executive mode, from continuous in thread or cross-thread it is continuous in a kind of preferentially layout, and implement code conversion;
Vectorization step, according in the second analysis result active period and definition-use chain, in vectorization cross-thread and thread Code.
5. the OpenCL compiler method as described in claim 1 based on code fusion compiler framework, which is characterized in that Further include: step 6 is run after the compiling result is called local compiler compiling according to OpenCL compilation process.
6. a kind of OpenCL compiler system based on code fusion compiler framework characterized by comprising
Module 1 obtains OpenCL source program, is host abstract syntax tree by host side code compilation in the source program, and obtaining should The kernel code file of kernel run function in abstract syntax tree compiles the kernel code file and obtains kernel abstract syntax tree, And it is deposited into shared drive, all kernel abstract syntax tree are fetched and reconstructed from the shared drive, obtain merging the master The fusion abstract syntax tree of machine abstract syntax tree and the kernel abstract syntax tree;
Module 2 obtains the host abstract syntax tree and the respective control of kernel abstract syntax tree based on the fusion abstract syntax tree Flow graph processed, and increase function call while and function return while instruct connection the two controlling stream graph, obtain inline controlling stream graph, root According to the WII function of target platform feature, obtains instruction in the work item of kernel and execute sequence on respective objects platform, inside Connection controlling stream graph portrays this and executes sequence, obtains WII-CFG figure;
Module 3 passes data transmission OpenCLAPI between ginseng and host side and equipment end by the function for analyzing kernel code The parameter of function call obtains the corresponding relationship between host side variable and kernel variable as the first analysis as a result, and to this WII-CFG figure carries out data-flow analysis, obtains the second analysis result;
Module 4, according to this first analysis result and this second analysis as a result, being carried out to kernel code in the fusion abstract syntax tree Optimization obtains optimization abstract syntax tree;
The optimization abstract syntax tree is input to compiler by module 5, the mainframe code and kernel code after translation after output optimization As compiling result.
7. the OpenCL compiler system as claimed in claim 6 based on code fusion compiler framework, which is characterized in that Module 2 includes: thread executive mode on the target platform according to kernel code, obtains the WII function of the target platform, is used for The instruction for calculating work item in kernel executes sequence on the target platform.
8. the OpenCL compiler system based on code fusion compiler framework as claimed in claims 6 or 7, feature exist In the module 3 specifically includes:
It analyzes between corresponding relationship and host side and the equipment end between the incoming argument variable and parameter variable of kernel function Data transmit the parameter of OpenCLAPI function call, obtain corresponding relationship between host side variable and kernel variable as the One analysis obtains the second analysis as a result, including host side code and kernel as a result, to WII-CFG figure progress data-flow analysis Definition-between the different variables of code uses chain and active period.
9. the OpenCL compiler system as claimed in claim 8 based on code fusion compiler framework, which is characterized in that Include that the optimization specifically includes in module 4:
Thread merging module uses chain according to the definition-in the second analysis result, identifies the redundant operation of cross-thread, will hold Multiple threads of the row redundant operation are merged into a coarseness thread, to reduce the code redundancy of cross-thread;
Data layout module uses the thread tissue side of execution of chain and target platform according to the first analysis result, this definition- Formula, from continuous in thread or cross-thread it is continuous in a kind of preferentially layout, and implement code conversion;
Vectorization module, according in the second analysis result active period and definition-use chain, in vectorization cross-thread and thread Code.
10. the OpenCL compiler system as claimed in claim 6 based on code fusion compiler framework, feature exist In, further includes: module 6 is run after the compiling result is called local compiler compiling according to OpenCL compilation process.
CN201910106880.3A 2019-02-02 2019-02-02 OpenCL compiler design method and system based on code fusion compiling framework Active CN109933327B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910106880.3A CN109933327B (en) 2019-02-02 2019-02-02 OpenCL compiler design method and system based on code fusion compiling framework

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910106880.3A CN109933327B (en) 2019-02-02 2019-02-02 OpenCL compiler design method and system based on code fusion compiling framework

Publications (2)

Publication Number Publication Date
CN109933327A true CN109933327A (en) 2019-06-25
CN109933327B CN109933327B (en) 2021-01-08

Family

ID=66985577

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910106880.3A Active CN109933327B (en) 2019-02-02 2019-02-02 OpenCL compiler design method and system based on code fusion compiling framework

Country Status (1)

Country Link
CN (1) CN109933327B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111966397A (en) * 2020-07-22 2020-11-20 哈尔滨工业大学 Automatic transplanting and optimizing method for heterogeneous parallel programs
CN112083956A (en) * 2020-09-15 2020-12-15 哈尔滨工业大学 Heterogeneous platform-oriented automatic management system for complex pointer data structure
CN112527262A (en) * 2019-09-19 2021-03-19 无锡江南计算技术研究所 Automatic vector optimization method for non-uniform width of deep learning framework compiler
CN112527304A (en) * 2019-09-19 2021-03-19 无锡江南计算技术研究所 Self-adaptive node fusion compiling optimization method based on heterogeneous platform
CN112579088A (en) * 2019-09-27 2021-03-30 无锡江南计算技术研究所 Heterogeneous hybrid programming-oriented one-stop program compiling method
CN116185426A (en) * 2023-04-17 2023-05-30 北京大学 Compiling optimization method and system based on code fusion and electronic equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102360306A (en) * 2011-10-19 2012-02-22 上海交通大学 Method for extracting and optimizing information of cyclic data flow charts in high-level language codes
US20120144376A1 (en) * 2009-06-02 2012-06-07 Vector Fabrics B.V. Embedded system development
CN103677952A (en) * 2013-12-18 2014-03-26 华为技术有限公司 Coder decoder generating device and method
CN104036141A (en) * 2014-06-16 2014-09-10 上海大学 Open computing language (OpenCL)-based red-black tree acceleration algorithm
CN104820613A (en) * 2015-05-27 2015-08-05 中国科学院自动化研究所 Compiling method for heterogeneous multi-core routine
CN106843993A (en) * 2016-12-26 2017-06-13 中国科学院计算技术研究所 A kind of method and system of resolving inversely GPU instructions
CN106940815A (en) * 2017-02-13 2017-07-11 西安交通大学 A kind of programmable convolutional neural networks Crypto Coprocessor IP Core
CN109032572A (en) * 2017-06-08 2018-12-18 阿里巴巴集团控股有限公司 A method of the JAVA program technic based on bytecode is inline

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120144376A1 (en) * 2009-06-02 2012-06-07 Vector Fabrics B.V. Embedded system development
CN102360306A (en) * 2011-10-19 2012-02-22 上海交通大学 Method for extracting and optimizing information of cyclic data flow charts in high-level language codes
CN103677952A (en) * 2013-12-18 2014-03-26 华为技术有限公司 Coder decoder generating device and method
CN104036141A (en) * 2014-06-16 2014-09-10 上海大学 Open computing language (OpenCL)-based red-black tree acceleration algorithm
CN104820613A (en) * 2015-05-27 2015-08-05 中国科学院自动化研究所 Compiling method for heterogeneous multi-core routine
CN106843993A (en) * 2016-12-26 2017-06-13 中国科学院计算技术研究所 A kind of method and system of resolving inversely GPU instructions
CN106940815A (en) * 2017-02-13 2017-07-11 西安交通大学 A kind of programmable convolutional neural networks Crypto Coprocessor IP Core
CN109032572A (en) * 2017-06-08 2018-12-18 阿里巴巴集团控股有限公司 A method of the JAVA program technic based on bytecode is inline

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
JAASKELAINEN等: "pocl: A Performance-Portable OpenCL Implementation", 《INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING》 *
KIM,J等: "OpenCL as a unified programming model for heterogeneous CPU/GPU clusters(Conference Paper)", 《ACM SIGPLAN NOTICES》 *
伍明川等: "面向神威·太湖之光的国产异构众核处理器OpenCL编译系统", 《计算机学报》 *
何王全等: "面向国产异构众核系统的Parallel C语言设计与实现", 《软件学报》 *
刘颖等: "异构并行编程模型研究与进展", 《软件学报》 *
刘颖等: "异构架构下基于放松重用距离的多平台数据布局优化", 《软件学报》 *
吴承勇等: "异构集群下的MapReduce编程环境", 《科技创新导报》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112527262A (en) * 2019-09-19 2021-03-19 无锡江南计算技术研究所 Automatic vector optimization method for non-uniform width of deep learning framework compiler
CN112527304A (en) * 2019-09-19 2021-03-19 无锡江南计算技术研究所 Self-adaptive node fusion compiling optimization method based on heterogeneous platform
CN112527262B (en) * 2019-09-19 2022-10-04 无锡江南计算技术研究所 Automatic vector optimization method for non-uniform width of deep learning framework compiler
CN112527304B (en) * 2019-09-19 2022-10-04 无锡江南计算技术研究所 Self-adaptive node fusion compiling optimization method based on heterogeneous platform
CN112579088A (en) * 2019-09-27 2021-03-30 无锡江南计算技术研究所 Heterogeneous hybrid programming-oriented one-stop program compiling method
CN111966397A (en) * 2020-07-22 2020-11-20 哈尔滨工业大学 Automatic transplanting and optimizing method for heterogeneous parallel programs
CN112083956A (en) * 2020-09-15 2020-12-15 哈尔滨工业大学 Heterogeneous platform-oriented automatic management system for complex pointer data structure
CN116185426A (en) * 2023-04-17 2023-05-30 北京大学 Compiling optimization method and system based on code fusion and electronic equipment
CN116185426B (en) * 2023-04-17 2023-09-19 北京大学 Compiling optimization method and system based on code fusion and electronic equipment

Also Published As

Publication number Publication date
CN109933327B (en) 2021-01-08

Similar Documents

Publication Publication Date Title
CN109933327A (en) OpenCL compiler method and system based on code fusion compiler framework
US8799871B2 (en) Computation of elementwise expression in parallel
Nugteren et al. Introducing'Bones' a parallelizing source-to-source compiler based on algorithmic skeletons
JP2022031507A (en) Development method for operator in deep learning framework, development device and electronic apparatus
Grass et al. MUSA: a multi-level simulation approach for next-generation HPC machines
JPH08202545A (en) Object-oriented system and method for generation of target language code
Viñas et al. Exploiting heterogeneous parallelism with the Heterogeneous Programming Library
Ziogas et al. Productivity, portability, performance: Data-centric Python
Bratvold Skeleton-based parallelisation of functional programs
Weber et al. MATOG: array layout auto-tuning for CUDA
US8762974B1 (en) Context-sensitive compiler directives
Palyart et al. MDE4HPC: an approach for using model-driven engineering in high-performance computing
CN109901840A (en) A kind of isomery compiling optimization method that cross-thread redundancy is deleted
Castro-Perez et al. Compiling first-order functions to session-typed parallel code
US20170206068A1 (en) Program optimization based on directives for intermediate code
Auler et al. ACCGen: An automatic ArchC compiler generator
Lyons et al. Lightweight multilingual software analysis
Acosta et al. Paralldroid: Performance analysis of gpu executions
Hornung et al. A case for improved C++ compiler support to enable performance portability in large physics simulation codes
Jacob et al. Raising the Level of Abstraction of GPU-programming.
Syschikov et al. Visual development environment for OpenVX
Czejdo et al. Practical Approach to Introducing Parallelism in Sequential Programs
TWI776338B (en) Compiler adapted in graph processing unit and non-transitory computer-readable medium
Posadas et al. Accelerating host-compiled simulation by modifying ir code: Industrial application in the spatial domain
Mosaner Machine-Learning-Based Optimization Heuristics in Dynamic Compilers/submitted by DI Raphael Moaner, BSc

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20231226

Address after: Room 1305, 13th Floor, No.1 Zhongguancun Street, Haidian District, Beijing, 100086

Patentee after: Zhongke Jiahe (Beijing) Technology Co.,Ltd.

Address before: 100080 No. 6 South Road, Zhongguancun Academy of Sciences, Beijing, Haidian District

Patentee before: Institute of Computing Technology, Chinese Academy of Sciences

TR01 Transfer of patent right