CN111966397A - Automatic transplanting and optimizing method for heterogeneous parallel programs - Google Patents

Automatic transplanting and optimizing method for heterogeneous parallel programs Download PDF

Info

Publication number
CN111966397A
CN111966397A CN202010710022.2A CN202010710022A CN111966397A CN 111966397 A CN111966397 A CN 111966397A CN 202010710022 A CN202010710022 A CN 202010710022A CN 111966397 A CN111966397 A CN 111966397A
Authority
CN
China
Prior art keywords
parallel
function
state
variable
openmp
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010710022.2A
Other languages
Chinese (zh)
Inventor
张伟哲
王法瑞
何慧
郭浩男
刘亚维
张玥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202010710022.2A priority Critical patent/CN111966397A/en
Publication of CN111966397A publication Critical patent/CN111966397A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/76Adapting program code to run in a different environment; Porting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/51Source to source

Abstract

An automatic transplanting and optimizing method for heterogeneous parallel programs belongs to the heterogeneous parallel program development technology. The invention aims to realize automatic transplantation of a CPU parallel program, reduce the workload of developers and improve the program performance, thereby solving the problems of parallel instruction conversion, data transmission management and optimization. The technical points are as follows: constructing a framework of an automatic heterogeneous parallel program transplanting system, wherein the automatic heterogeneous parallel program transplanting system is used for automatically translating an OpenMP CPU parallel program into an OpenMP off heterogeneous parallel program; the consistency state conversion is formalized, and on the premise of ensuring the data consistency, the transmission operation is optimized and the redundant data transmission is reduced; designing a runtime library, wherein the runtime library is used for providing automatic data transmission management and optimization functions and maintaining the consistency state of each variable memory area; and designing a source-to-source translator, wherein the translator is used for automatically converting the parallel instructions and automatically inserting the runtime API. The method can automatically identify the CPU parallel instruction and convert the CPU parallel instruction into the accelerator parallel instruction, and improves the program performance.

Description

Automatic transplanting and optimizing method for heterogeneous parallel programs
Technical Field
The invention relates to an automatic transplanting and optimizing method for a heterogeneous parallel program, and belongs to the heterogeneous parallel program development technology.
Background
With the great demands on computing power of different applications such as artificial intelligence, image Processing, multi-physical-field simulation, quantum simulation, climate simulation and the like, heterogeneous platforms based on various accelerators have replaced a Central Processing Unit (CPU) to become the main source of computing power. In the Field of high-performance computing, a GPU (Graphics Processing Unit) is mainly used as an accelerator, and on a mobile platform, a GPU, a DSP (Digital Signal Processing) or an FPGA (Field Programmable Gate Array) is mainly used as an accelerator. The accelerator provides great computing power and brings great challenges to application development and transplantation.
The standard of CPU parallel programming is OpenMP (Open Multi-Processing) model, and heterogeneous parallel programming requires using heterogeneous parallel programming models such as CUDA (computer Unified Device Architecture), OpenCL (Open Computing Language), OpenACC (Open access Accelerators), and OpenMP streaming (OpenMP extension). Compiling efficient heterogeneous parallel programs often requires developers to know the characteristics of heterogeneous platforms and master parallel programming models; even with the ability of developers, porting CPU parallel programs to heterogeneous platforms is a very time consuming and error prone task. Therefore, an automatic migration tool is needed to automatically migrate the OpenMP CPU parallel program to the heterogeneous platform.
The prior art with the reference number CN104035781A (CN104035781B) provides a method for rapidly developing heterogeneous parallel programs, which relates to performance analysis of CPU serial programs and migration of heterogeneous parallel programs: firstly, performing performance and algorithm analysis on a CPU serial program, and positioning the performance bottleneck and the parallelism of the program; then inserting an OpenACC pre-compiling instruction on the basis of the original code to obtain a heterogeneous parallel code which can be executed on a heterogeneous parallel environment; and compiling and executing the codes according to the specified parameters of the software and hardware platform, and determining whether further optimization is needed according to the program running result. The prior art can parallelize the existing program efficiently, so that the program makes full use of the computing capacity of a heterogeneous system, and the method is high in practicability and easy to popularize. However, two major challenges in the implantation process: parallel instruction conversion, data transfer management and optimization have not been addressed.
Disclosure of Invention
The technical problem to be solved by the invention is as follows:
the invention aims to provide an automatic transplanting and optimizing method for heterogeneous parallel programs, which aims to realize automatic transplanting of CPU parallel programs, reduce the workload of developers and improve the performance of the programs, thereby solving the problems of parallel instruction conversion, data transmission management and optimization.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a heterogeneous parallel program automatic transplanting and optimizing method is realized by the following steps:
step 1, constructing a framework of an automatic heterogeneous parallel program transplanting system
The heterogeneous parallel program automatic transplanting system is called an OAO system for short, and is used for automatically translating an OpenMP CPU parallel program into an OpenMP off heterogeneous parallel program, and automatically managing and optimizing data transmission between the CPU and an accelerator by combining a runtime system; the OAO system framework mainly comprises a source-to-source translator and a runtime library;
the runtime library contains three types of APIs: coherency state tracking, data transmission, coherency state conversion; a consistency state tracking API captures a variable memory area and initializes a consistency state; the data transmission API dynamically determines and executes transmission operation according to the current consistency state and the consistency constraint which needs to be met, and meanwhile, the consistency state is updated; the state conversion API updates the consistency state according to the read-write operation type;
the source-to-source translator translates the OpenMP CPU parallel code into an OpenMP off heterogeneous parallel code and inserts a proper runtime API; it consists of 3 modules: the system comprises a data transmission API (application programming interface) inserting module, a state conversion API inserting module and a parallel instruction translation module; the parallel instruction translation module translates the OpenMP CPU parallel instruction into an OpenMP offload heterogeneous parallel instruction to obtain an OpenMP offload kernel; the data transmission API inserting module and the state conversion API inserting module are respectively inserted into two corresponding runtime APIs;
the OAO system works as follows: the OpenMP CPU parallel code is translated from source to obtain an OpenMP off-streaming heterogeneous parallel code containing a runtime API, the OpenMP off-streaming code runs on a heterogeneous platform after being compiled, an OpenMP off-streaming kernel runs on an accelerator, and other programs run on a CPU; the runtime library manages data transmission between the CPU and the accelerator through the inserted API, ensures data consistency, dynamically optimizes transmission, and reduces redundant data transmission;
step 2, consistency state conversion formalization
For the heterogeneous platform, the variable has copies in a CPU memory and an accelerator memory, and the consistency state is used for describing the effectiveness of the CPU copy and the accelerator copy of the variable; deriving a simplest state conversion function from the current consistency state and the consistency state constraint which needs to be met, and determining the simplest transmission operation type which needs to be executed by the function through the corresponding relation so as to optimize transmission operation and reduce redundant data transmission on the premise of ensuring data consistency;
step 3, design of runtime library
The runtime library is used for providing automatic data transmission management and optimization functions, maintaining the consistency state of each variable memory region and giving three types of API functions and corresponding descriptions of the runtime library;
step 4 Source to Source translator design
The source-to-source translator translates OpenMP CPU parallel codes into OpenMP off heterogeneous parallel codes and inserts an API (application program interface) in a proper mode on the basis of a static analysis function of a Clang/LLVM (class/level-to-class virtual machine) compiling framework, and optimizes data transmission on the premise of ensuring data consistency;
the source-to-source translator collects information such as a serial domain, a parallel domain, variable reference and the like through static analysis; then establishing a serial-parallel control flow diagram of the program and binding variable reference information with a corresponding parallel domain serial domain; then, inserting data transmission API insertion and state conversion API insertion by taking a serial domain and a parallel domain as granularity; and finally translating the OpenMP CPU parallel instruction into an OpenMP Offloading parallel instruction.
The invention has the following beneficial technical effects:
the invention provides an Automatic transplanting and optimizing method (OAO for short) from an OpenMP CPU parallel program to an OpenMP off-streaming heterogeneous parallel program, which can realize an Automatic transplanting system based on the OAO method, and can improve the program performance while reducing the workload of developers. The OAO method can automatically solve two major problems in the transplantation process: parallel instruction conversion, data transmission management and optimization. The method can automatically identify the CPU parallel instruction and convert the CPU parallel instruction into the accelerator parallel instruction. The method can dynamically optimize data transmission in the program execution process by using the runtime system, thereby reducing redundant data transmission and improving the program performance while ensuring data consistency.
For a large number of existing OpenMP CPU parallel programs, developers can directly use an OAO system to realize automatic heterogeneous transplantation, and the performance of the programs is improved by using the computing power of a heterogeneous platform. For new applications, developers can continue to use the familiar OpenMP model and then use the OAO system for automatic heterogeneous migration to achieve program performance improvements.
Drawings
FIG. 1 is a flow diagram of a heterogeneous parallel program automatic migration system (OAO system) framework;
FIG. 2 is a coherency state transition diagram;
FIG. 3 is a series-parallel control flow diagram;
FIG. 4 is a screenshot of an OpenMP CPU parallel program;
FIG. 5 is a screenshot of a manually written OpenMP offload heterogeneous parallel program;
FIG. 6 is a screenshot of an OpenMP offload heterogeneous parallel program resulting from source-to-source translation;
FIG. 7 is a histogram of acceleration ratios (K40 plateau) for other versions relative to the OMP version;
FIG. 8 is a histogram of acceleration ratios (2080Ti plateau) for other versions versus the OMP version;
FIG. 9 is a bar graph of the percentage of transmitted data volume saved by the OAO version relative to other versions;
figure 10 is a histogram of the percent transit time saved (2080Ti plateau) for the OAO version relative to the other versions.
English and acronyms in fig. 7-10 are well known terms in the art.
Detailed Description
The implementation of the invention is illustrated below with reference to the accompanying figures 1 to 10:
1. heterogeneous parallel program automatic transplanting system framework
The heterogeneous parallel program automatic transplanting system (OAO system for short) can automatically translate OpenMP CPU parallel programs into OpenMP off heterogeneous parallel programs, and automatically manage and optimize data transmission between the CPU and the accelerator by combining a runtime system. The OAO system framework is shown in FIG. 1 and consists essentially of two parts (shaded in the figure): source-to-source translator, runtime library.
The runtime library provides automatic data transmission management and optimization functions, maintains the consistency state of each variable memory area, and comprises three types of Application Programming Interfaces (APIs): coherency state tracking, data transfer, coherency state translation. The coherency state tracking API captures variable memory regions and initializes coherency states. And the data transmission API dynamically determines and executes transmission operation according to the current consistency state and the consistency constraint which needs to be met, and simultaneously updates the consistency state. And updating the consistency state by the state conversion API according to the read-write operation type.
The source-to-source translator translates the OpenMP CPU parallel code into an OpenMP off heterogeneous parallel code and inserts a proper runtime API; it consists of 3 modules: the system comprises a data transmission API insertion module, a state conversion API insertion module and a parallel instruction translation module. The parallel instruction translation module translates the OpenMP CPU parallel instruction into an OpenMP offload heterogeneous parallel instruction to obtain an OpenMP offload kernel. The data transmission API inserting module and the state conversion API inserting module are respectively inserted into two corresponding runtime APIs.
The OAO system works as follows: the OpenMP CPU parallel code is translated from source to obtain an OpenMP off-streaming heterogeneous parallel code containing a runtime API. The OpenMP offload code is compiled and then runs on a heterogeneous platform, wherein an OpenMP offload kernel runs on an accelerator, and other programs run on a CPU. The runtime library manages data transmission between the CPU and the accelerator through the inserted API, ensures data consistency and performs dynamic transmission optimization, and reduces redundant data transmission.
2. Coherency state transition formalization
Data transmission management and optimization are the key points in the process of transplanting the OpenMP CPU program to a heterogeneous platform, and the part formalizes variable consistency state, consistency state conversion, consistency state constraint and the like and provides theoretical support for the design of a subsequent runtime library and a source-to-source translator.
For a heterogeneous platform, variables may have copies in both CPU memory and accelerator memory, and a coherency state is used to describe the validity of the CPU copy and accelerator copy of the variable, which is defined as follows:
definition 1. coherency State is a 3bit binary number. Where Bit0 indicates that an accelerator copy exists for the variable is (Bit0 ═ 1)/no (Bit0 ═ 0); bit1 indicates that the CPU copy is (Bit1 ═ 1)/no (Bit1 ═ 0) valid; bit2 indicates that the accelerator copy is (Bit2 ═ 1)/no (Bit2 ═ 0) valid.
TABLE 1 all possible coherency states
Figure BDA0002596206070000051
All possible coherency states are given as shown in table 1 according to definition 1. Data transfers between the CPU and the accelerator, and read and write operations on the CPU and the accelerator, change coherency states, using a coherency state transfer function to represent changes in the coherency state, defined as follows:
definition 2. the form of the coherency state transfer function TransFunc is shown in equation (1), where InVld and Vld are a pair of 3bit binary numbers, where the operator is a Boolean operator. If a bit in InVld is set to 0, the corresponding bit in inState can be converted to 0; if a bit in Vld is set to 1, the corresponding bit in inState can be converted to 1; setting different values for both can convert any inState to any outState. TransFunc is also abbreviated as form (2).
Figure BDA0002596206070000052
TransFunc={InVld,Vld} (2)
According to definition 2, all possible coherency state transition functions and their corresponding data transfer or read-write operations are given as shown.
TABLE 2 all possible coherency state transition operations
Figure BDA0002596206070000053
Figure BDA0002596206070000061
A graph of the transition relationships between the various coherency states is given according to definition 1 and definition 2, as shown in figure 2.
To ensure that the program is correct, read and write operations on the CPU and accelerator have certain requirements on the coherency state, which are expressed using coherency state constraints, which are defined as follows:
definition 3. coherency state constraint Constr consists of a pair of 3bit binary numbers, the form of which is shown in (3). The default value for Constr is 111,000, indicating no requirement for a coherency state. A bit of ConInVld may be set to 0 or a bit of ConInVld may be set to 1 to represent different constraint requirements on the coherency state, as shown in table 3.
Constr={ConInVld,ConVld} (3)
TABLE 3 ConVld and ConInVld meanings
Figure BDA0002596206070000062
According to definition 3 and table 3, the coherency state constraints required for different read and write operations on the CPU and accelerator are given, as shown in table 4.
TABLE 4 all possible consistency constraints
Figure BDA0002596206070000063
Figure BDA0002596206070000071
From the current coherency State and the coherency State constraint Constr { connvld, ConVld } that needs to be satisfied, the simplest State transition function can be derived as follows:
the derivation formula of the simplest state transition function is shown in formula (4).
MinTrFunc(State)=State·InVld+Vld
Wherein
Figure BDA0002596206070000072
The simplest State transition function represents the simplest non-redundant data transfer operation that must be performed in order to satisfy the coherency State constraint Constr, starting from coherency State. Therefore, by deriving MinTrFunc, the simplest transmission operation type to be executed can be determined according to the corresponding relationship (the first 6 types) in table 2. Therefore, on the premise of ensuring data consistency, transmission operation is optimized, and redundant data transmission is reduced.
3. Runtime library design
Runtime library API functions, as shown in table 5, can be divided into three categories: coherency state tracking (first 6), data transfer (oaoda trans), coherency state transition (OAOStTrans); as will be described separately below. TABLE 5 runtime library API function
Figure BDA0002596206070000073
Figure BDA0002596206070000081
3.1 coherency State tracking API
The runtime library uses the variable memory region as a granularity for coherency state tracking and data transfer. In C/C + +, the variable memory region is a continuous memory region, and the source of the variable memory region may be a local variable definition, a global variable definition, a malloc operation, a new operation, and the like. To record and track variable coherency states, the formalized representation of the memory regions and memory environments defining the variables is as follows:
defining 4, the variable memory region MemBlk is a quadruple as shown in a formula (5); wherein: begin indicates memory start address, Length memory region Length, ElemSize element size, State indicates coherency State.
MemBlk={Begin,Length,ElemSize,State} (5)
Definition 5. the memory environment MemEnv is a set of all variable memory regions, as shown in equation (6).
MemEnv={MemBlk1,…,MemBlkn} (6)
The runtime library defines the MemEnv as a global variable and maintains the MemEnv during the execution of the whole OpenMP off heterogeneous parallel program. When a variable is referenced, the MemEnv can be searched using the corresponding pointer ptr, and what satisfies equation (7) is the referenced MemBlk.
Begin≤ptr≤Begin+Length-1 (7)
The coherency state tracking API may be inserted into the source code in an appropriate manner during the source-to-source translation process. The OAOSaveArrayInfo function is inserted after the local variable declaration or at the beginning of the main function (for global variables) to receive variable information. And replacing the malloc function by the OAOMalloc function, collecting memory allocation information, and performing memory allocation. After the OAONewInfo is inserted into the new operation, the memory allocation information is collected. The above three functions use the collected information to create a new corresponding MemBlk and initialize the State to HOST _ ONLY.
The OAODeleteArrayInfo is inserted at the end of the variable scope, at the end of the main function (for global variables), the corresponding MemBlk is deleted. OAOFree replaces free function, releases memory area and deletes corresponding MemBlk. After the OAODeleteInfo is inserted into the delete operation, the memory region is released while the corresponding MemBlk is deleted.
In addition, the runtime is specially optimized for NVIDIA equipment, and when the required allocated memory is larger than 128KB, cudaMallocost () is used for replacing malloc (), so that the memory is allocated.
3.2 data transfer API
As analyzed in table 4, the variable read/write operation requires that the variable satisfy some coherency state constraint, and therefore, before accessing a variable, a data transfer API, i.e., an OAODataTrans function, needs to be called to perform the simplest data transfer operation to satisfy the corresponding constraint. This section illustrates the data transfer API principle, the insertion method of which will be described in section 4. The oaoda trans function uses the following algorithm 1 to determine the simplest state transition function required to satisfy the coherency state constraint Constr and performs the corresponding simplest data transfer operation while updating the coherency state. Finding a corresponding variable memory area MemBlk through a variable pointer ptr in lines 1-2 of the algorithm 1; line 3 deduces a simplest State conversion function MinTrFunc according to consistency constraint Constr and the current consistency State which need to be met by a formula (4); 4, executing the data transmission operation corresponding to MinTrFunc in the table 2; line 5 updates the State using MinTrFunc according to equation (1).
Figure BDA0002596206070000091
3.3 coherency State transition API
As analyzed in table 2 (the last three rows), read and write operations may change the coherency State of a variable, and therefore after a variable is accessed, a coherency State translation API, i.e., an OAOStTrans function, needs to be called to update the coherency State stored in the variable memory region MemBlk. This section illustrates the coherency state translation API principle, the insertion method of which will be described in section 4. The OAOStTrans function completes the coherency state transition process using algorithm 2 below. Finding a corresponding variable memory area MemBlk through a variable pointer ptr in the 1-2 rows of the algorithm 2; line 3 updates the coherency State contained in MemBlk using the State transition function StTrans according to equation (1).
Figure BDA0002596206070000092
4 Source to source translator design
The source-to-source translator is mainly based on a static analysis function of a Clang/LLVM (C/C + + language compiler) compiling framework, translates OpenMP CPU parallel codes into OpenMP off heterogeneous parallel codes, inserts an API (application program interface) in running in a proper mode, and optimizes data transmission on the premise of ensuring data consistency.
The source-to-source translator collects information such as a serial domain (definition 6), a parallel domain (definition 7), variable references and the like through static analysis; then establishing a serial-parallel control flow diagram of the program (definition 8) and binding variable reference information with a corresponding parallel domain serial domain (definition 9-10); then, inserting data transmission API insertion and state conversion API insertion by taking a serial domain and a parallel domain as granularity; and finally translating the OpenMP CPU parallel instruction into an OpenMP Offloading parallel instruction.
Definition 6 the serial domain SEQ is a piece of code that is outside the range of # pragma omp parallel, internally unbranched, serially executed, SEQ, which is also referred to as a serial node in the serial-parallel control flow graph.
Definition 7 parallel domain OMP is a piece of code in the range of a # pragma OMP parallel that executes in parallel, also called parallel node in the string-parallel control flow graph.
The definition of defining 8 series-parallel control flow graph SPGraph is shown as formula (8), and the SPGraph is a special control flow graph of a certain function, wherein nodes are in a serial domain or a parallel domain; an example of a serial-parallel control flow graph is shown in fig. 3.
Figure BDA0002596206070000102
Definition 9 the definition of the variable reference list RefList is, as shown in equation (9), a reference list of a variable in a serial domain or a parallel domain.
Figure BDA0002596206070000101
Definition 10 the definition of the variable reference information table NodeVarRef is a set of all variable information in a certain serial domain or parallel domain as shown in equation (10). Each serial or parallel domain is bound to its corresponding NodeVarRef, as shown in FIG. 3.
NodeVarRef={RefList1,L,RefListl} (10)
Function calls in either the serial or parallel domains require special handling. Function calls in the serial domain are separated individually into a separate serial domain. For function arguments of the incoming copy, its RefList ═ R }. For an incoming pointer or a referenced function argument, its RefList ═ R }, if there is no write operation in the called function; or RefList { RW }, if there is a write operation in the function being called.
For function calls in the parallel domain, it is considered an access to the function arguments. For function arguments of the incoming copy, { R } is inserted in its RefList at the appropriate location. For an incoming pointer or a referenced function argument, { R } is inserted in its RefList at an appropriate position if there is no write operation in the called function; or { RW }, if there is a write operation in the function being called.
Based on the program abstract representation, the design of the three main functions of the source-to-source translator is given.
4.1 data transfer API insertion
As previously mentioned, most serial and parallel domains require the insertion of the required data transfer API before. But for functions called in the parallel domain (OMP call functions for short), no runtime API can be inserted. Since the OMP call function will run on the accelerator and the runtime API cannot. The data consistency of such functions is guaranteed by the runtime API before and after the function call.
The data transfer API insertion algorithm is designed as follows. The source-to-source compiler processes each non-OMP call function using algorithm 3, which inserts the required data transfer API before the serial and parallel domains thereof. Lines 01-02 of the algorithm 3 are double loops, and the following algorithm steps are applied to each applied variable in each serial domain and each parallel domain in the control flow graph; line 03 uses static analysis to obtain the pointer ptr of the referenced variable Var. 04-15 lines are treated for different situations. Lines 04-05 insert OAODataTrans (ptr, ConOMPR) statements before the parallel domain. Lines 06-11 handle serial domains separated from function calls in two cases, and if the called function is an OMP call function, an OAODataTrans (ptr, ConSEQR) statement is inserted in front of the serial domain (lines 07-08); otherwise, for function parameter variables of the incoming copy, an OAODataTrans (ptr, ConSEQR) statement is inserted before the serial field. Line 15 for other cases, an OAODataTrans (ptr, ConSEQR) statement is inserted before the serial field.
Figure BDA0002596206070000111
Figure BDA0002596206070000121
4.2 State transition API insertion
After a variable is accessed through either the serial or parallel domains, the coherency state may change, requiring the insertion of the required state transition API to update the coherency state saved at runtime. The state transition API insertion algorithm is designed as follows. The source-to-source compiler processes each non-OMP call function using the following algorithm 4, inserting the required state transition API after the serial and parallel domains thereof. Lines 01-02 of the algorithm 4 are a double loop, and the following algorithm steps are applied to each applied variable in each serial domain and each parallel domain in the control flow graph; line 03 uses static analysis to obtain the pointer ptr of the referenced variable Var. 04-16 lines are treated for different situations. If Node is a parallel domain and the RefList corresponding to the variable Var contains W (write operation), OAOStTrans (ptr, TroMPW) statements (lines 04-06) are inserted after the parallel domain. If Node is the serial domain and is separate from the function call, and the called function is the OMP call function and W (write operation) is contained in the RefList corresponding to the variable Var, OAOStTrans (ptr, TrSEQW) statements (lines 08-11) are inserted after the serial domain. If the two cases are not the case, and W (write operation) is contained in the RefList corresponding to the variable Var, an OAOStTrans (ptr, TrSEQW) statement is inserted after the serial field.
Figure BDA0002596206070000122
Figure BDA0002596206070000131
4.3 parallel instruction translation
The migration framework aims at the OpenMP work sharing parallel mode, and the parallel instruction corresponding relation of the mode on a CPU and an accelerator is shown in a table 6. The source-to-source compiler translates the OpenMP CPU parallel instruction into an OpenMP Offloading parallel instruction using the following algorithm 5, thereby obtaining an OpenMP Offloading computation kernel. Line 01 of Algorithm 5 is a loop, applying the following algorithm steps to each parallel domain; the 02 line obtains OpenMP CPU parallel instructions through static analysis; lines 03-04 replace the OpenMP CPU parallel instruction with a corresponding OpenMP Offloading parallel instruction according to table 6.
TABLE 6 parallel instruction correspondences
Figure BDA0002596206070000132
The technical effects of the present invention are explained as follows:
1. source to source translation results example
The OAO automatic transplanting system provided by the patent is used for translating the OpenMP CPU parallel program shown in the figure 4 to obtain the OpenMP Offloading heterogeneous parallel program shown in the figure 6. Based on the OpenMP CPU parallel program shown in fig. 4, an OpenMP Offloading heterogeneous parallel program is manually written, as shown in fig. 5. It can be observed that the following transmissions in the manually written program of fig. 5 are redundant transmissions: the "from" transmission of line 05 to v1, v2, v3, the "to" transmission of line 13 to v3, and the "from" transmission of line 13 to v 4. The automatically translated program of FIG. 5 avoids the redundant operations described above using the OAO runtime library. From this example, the OAO system can successfully translate the OpenMP CPU parallel program into the OpenMP Offloading heterogeneous parallel program.
2. Method evaluation
2.1 Experimental methods
Polybench and Rodinia are common benchmark sets used in heterogeneous computing. DawnCC is currently the most advanced source-to-source translator that generates the OpenMP Offloading program. We evaluated the performance of the OAO system using Polybench and rodia, and used DawnCC as a control.
To compare the optimization capabilities of DawnCC and OAO for interprocess data transfer, we added a test program FDTD-2D-FUNC. FDTD-2D-FUNC is based on FDTD-2D in Polybench; each calculation kernel in the FDTD-2D is packaged into a subfunction, and variables required by the kernel are transmitted through function parameters; this constructs the inter-process data transfer.
First we generate 4 versions of the program as follows:
the OMP version: OpenMP CPU parallel version program;
manual version: manually translating an OMP version to obtain an OpenMP offload program;
version DawnCC: an OpenMP offload program obtained by translating the OMP version by using DawnCC;
OAO version: the OpenMP offload program obtained by translating the OMP version by OAO is used.
Two experimental software and hardware platforms are shown in table 7.
Table 7 experiment software and hardware platform
Figure BDA0002596206070000141
2.2 Performance evaluation
The acceleration ratio of the different versions of the OpenMP Offloading program relative to the OMP version is shown in fig. 7 and 8, where an "X" indicates a program in which DawnCC cannot transition correctly.
It can be seen that OAO can handle all 23 test programs, whereas DawnCC can only handle 15 test programs. The OAO version has performance improvement on 9 programs of a K40 platform and 15 programs of a 2080Ti platform, and the highest acceleration ratio is 32 times; and the OAO version has the best performance on all test programs of all platforms relative to the OMP version and the manual version.
2.3 data Transmission optimization evaluation
The data transmission times of the different versions of OpenMP Offloading program are compared as shown in table 8, where "-" indicates a program that the DawnCC cannot correctly process. It is clear that the OAO version has the best data transfer optimization effect on all test programs, i.e. the number of data transfers is the least.
FDTD-2D and FDTD-2D-FUNC, OAO and DawnCC can be observed to optimize FDTD-2D; however, DawnCC does not optimize FDTD-2D-FUNC well; and the OAO can optimize both transmission times to the optimal 5 times. This phenomenon illustrates that OAO is able to optimize inter-process data transfer, while DawnCC is not.
Table 8 data transmission times comparison
Figure BDA0002596206070000151
Figure BDA0002596206070000161
The percentage of the amount of transmission data saved by the OAO version relative to the manual version and the DawnCC version is shown in fig. 9. The OAO version transmitted less data than the manual version in all tested programs, especially FDTD-2D-FUNC and FDTD-2D (approximately 100%). The OAO version has significant transmission data volume savings over the DawnCC version over 6 test programs, especially FDTD-2D-FUNC (close to 100%).
Figure 10 shows the percentage of transmission time saved by the OAO version relative to the manual version and the DawnCC version. The OAO version has significant transmission time savings over all test programs, relative to the other two versions.
The data and analysis prove that the OAO system can automatically translate the OpenMP CPU parallel program into an OpenMP Offloading heterogeneous parallel program and optimize data transmission; the OAO version achieves a significant performance improvement over the manual version. Compared with DawnCC, the OAO system can process more input programs, perform more extensive data transmission optimization and have better performance on all test programs.

Claims (5)

1. A heterogeneous parallel program automatic transplanting and optimizing method is characterized in that the method is realized by the following steps:
step 1, constructing a framework of an automatic heterogeneous parallel program transplanting system
The heterogeneous parallel program automatic transplanting system is called an OAO system for short, and is used for automatically translating an OpenMP CPU parallel program into an OpenMP off heterogeneous parallel program, and automatically managing and optimizing data transmission between the CPU and an accelerator by combining a runtime system; the OAO system framework mainly comprises a source-to-source translator and a runtime library;
the runtime library contains three types of APIs: coherency state tracking, data transmission, coherency state conversion; the consistency state tracking API captures a variable memory area and initializes a consistency state; the data transmission API dynamically determines and executes transmission operation according to the current consistency state and the consistency constraint which needs to be met, and meanwhile, the consistency state is updated; the state conversion API updates the consistency state according to the read-write operation type;
the source-to-source translator translates the OpenMP CPU parallel code into an OpenMP off heterogeneous parallel code and inserts a proper runtime API; it consists of 3 modules: the system comprises a data transmission API (application programming interface) inserting module, a state conversion API inserting module and a parallel instruction translation module; the parallel instruction translation module translates the OpenMP CPU parallel instruction into an OpenMP offload heterogeneous parallel instruction to obtain an OpenMP offload kernel; the data transmission API inserting module and the state conversion API inserting module are respectively inserted into two corresponding runtime APIs;
the OAO system works as follows: the OpenMP CPU parallel code is translated from source to obtain an OpenMP off-streaming heterogeneous parallel code containing a runtime API, the OpenMP off-streaming code runs on a heterogeneous platform after being compiled, an OpenMP off-streaming kernel runs on an accelerator, and other programs run on a CPU; the runtime library manages data transmission between the CPU and the accelerator through the inserted API, ensures data consistency, dynamically optimizes transmission, and reduces redundant data transmission;
step 2, consistency state conversion formalization
For the heterogeneous platform, the variable has copies in a CPU memory and an accelerator memory, and the consistency state is used for describing the effectiveness of the CPU copy and the accelerator copy of the variable; deducing a simplest state conversion function according to the current consistency state and the constraint of the consistency state to be met, and determining the type of the simplest transmission operation to be executed according to the function through the corresponding relation so as to optimize the transmission operation and reduce the redundant data transmission on the premise of ensuring the data consistency;
step 3, design of runtime library
The runtime library is used for providing automatic data transmission management and optimization functions, maintaining the consistency state of each variable memory region and giving three types of API functions and corresponding descriptions of the runtime library;
step 4 Source to Source translator design
The source-to-source translator translates OpenMP CPU parallel codes into OpenMP off heterogeneous parallel codes and inserts an API (application program interface) in a proper mode on the basis of a static analysis function of a Clang/LLVM (class/level-to-class virtual machine) compiling framework, and optimizes data transmission on the premise of ensuring data consistency;
the source-to-source translator collects information such as a serial domain, a parallel domain, variable reference and the like through static analysis; then establishing a serial-parallel control flow diagram of the program and binding variable reference information with a corresponding parallel domain serial domain; then, inserting data transmission API insertion and state conversion API insertion by taking a serial domain and a parallel domain as granularity; and finally translating the OpenMP CPU parallel instruction into an OpenMP Offloading parallel instruction.
2. The method for automatically migrating and optimizing the heterogeneous parallel program according to claim 1, wherein in step 2, the implementation process of the consistency state conversion formalization is as follows:
the following definitions are given:
definition 1. coherency State is a 3bit binary number; where Bit0 indicates that an accelerator copy exists for the variable is (Bit0 ═ 1)/no (Bit0 ═ 0); bit1 indicates that the CPU copy is (Bit1 ═ 1)/no (Bit1 ═ 0) valid; bit2 indicates that the accelerator copy is (Bit2 ═ 1)/no (Bit2 ═ 0) valid; according to definition 1, all possible coherency states can be derived, as shown in table 1:
TABLE 1 all possible coherency states
Figure FDA0002596206060000021
Data transfers between the CPU and the accelerator and read and write operations on the CPU and the accelerator change coherency states, the change in coherency state being represented using a coherency state transfer function, which is defined as follows:
definition 2. the form of the consistency state transfer function TransFunc is shown in formula (1), where InVld and Vld are a pair of 3-bit binary numbers, where the operator is a boolean operator; if a bit in InVld is set to 0, the corresponding bit in inState can be converted to 0; if a bit in Vld is set to 1, the corresponding bit in inState can be converted to 1; therefore, setting different values for the two can convert any inState into any outState; TransFunc is also abbreviated as form (2);
Figure FDA0002596206060000022
TransFunc={InVld,Vld} (2)
according to definition 2, all possible coherency state transition functions and their corresponding data transfer or read/write operations can be obtained, as shown in table 2:
TABLE 2 all possible coherency state transition operations
Figure FDA0002596206060000023
Figure FDA0002596206060000031
According to definition 1 and definition 2, a conversion relationship between various consistency states can be obtained;
to ensure that the program is correct, read and write operations on the CPU and accelerator have certain requirements on coherency states, which are represented using coherency state constraints, which are defined as follows:
defining 3. the coherency state constraint Constr consists of a pair of 3-bit binary numbers, the form of which is shown in (3); constr has a default value of 111,000, indicating no requirement for a coherency state; a bit of ConInVld may be set to 0, or a bit of ConVld may be set to 1 to represent different constraint requirements for the coherency state, as shown in Table 3:
Constr={ConInVld,ConVld} (3)
TABLE 3 ConVld and ConInVld meanings
Figure FDA0002596206060000032
According to definition 3 and table 3, the coherency state constraints required for different read and write operations on the CPU and accelerator are given, as shown in table 4:
table 4 required coherency state constraints
Figure FDA0002596206060000041
From the current coherency State and the coherency State constraint Constr ═ ConInVld, ConVld }, one can deduce the simplest State transition function as follows:
the derivation formula of the simplest state transition function is shown in formula (4):
MinTrFunc(State)=State·InVld+Vld
wherein
Figure FDA0002596206060000042
MinTrFunc determines the simplest transfer operation type to be executed according to the first 6 corresponding relations in Table 2.
3. The method for automatically migrating and optimizing the heterogeneous parallel program according to claim 2, wherein in the step 3, the specific process of runtime library design is as follows:
runtime library API functions, as shown in table 5, can be divided into three categories: a consistency state tracking API function, a data transmission API function OAODataTrans and a consistency state conversion API function OAOStTrans; the first 6 in table 5 are coherency state tracking API functions;
TABLE 5 runtime library API function
Figure FDA0002596206060000043
Figure FDA0002596206060000051
3.1 coherency State tracking API
The runtime library takes the variable memory area as the granularity of consistency state tracking and data transmission, the variable memory area in C/C + + is a continuous memory area, the source of the variable memory area is local variable definition, global variable definition, malloc operation and new operation, and for recording and tracking the variable consistency state, the formalization of defining the variable memory area and the memory environment is expressed as follows:
defining 4, the variable memory region MemBlk is a quadruple as shown in a formula (5); wherein: begin represents the initial address of the memory, represents the Length of the memory area of the Length, represents the size of an ElemSize element, and represents the State of consistency;
MemBlk={Begin,Length,ElemSize,State} (5)
definition 5. the memory environment MemEnv is a set of all variable memory regions, as shown in equation (6):
MemEnv={MemBlk1,L,MemBlkn} (6)
defining the MemEnv as a global variable by the runtime library, and maintaining the MemEnv during the execution of the whole OpenMP off heterogeneous parallel program; when a variable is referenced, the MemEnv can be searched using the corresponding pointer ptr, and what satisfies equation (7) is the referenced MemBlk:
Begin≤ptr≤Begin+Length-1 (7)
the consistency state tracking API is inserted into the source code in a proper mode in the source-to-source translation process, and the OAOSaveArrayInfo function is inserted behind a local variable declaration or at the beginning of a main function and is used for receiving variable information; replacing the malloc function with the OAOMalloc function, collecting memory allocation information, and performing memory allocation; after OAonewInfo is inserted into new operation, memory allocation information is collected, the three functions use the collected information to establish corresponding MemBlk, and State is initialized to HOST _ ONLY;
inserting OAODeleteArrayInfo into the end of a variable action domain, deleting the corresponding MemBlk at the end of a main function; OAOFree replaces free function, releases memory area and deletes corresponding MemBlk; after OAODeleteInfo is inserted into delete operation, releasing the memory area and deleting the corresponding MemBlk;
when the memory needing to be allocated is larger than 128KB, using cudaMallocost () to replace malloc (), and allocating the memory;
3.2 data transfer API
Based on the fact that variable read-write operation requires that the variable meet certain consistency state constraint, a data transmission API (application program interface), namely an OAODataTrans function, needs to be called before the variable is accessed to execute the simplest data transmission operation to meet corresponding constraint; the oaoda trans function uses the following algorithm 1 to determine the simplest state transition function needed to satisfy the coherency state constraint Constr and performs the corresponding simplest data transfer operation while updating the coherency state; finding a corresponding variable memory area MemBlk through a variable pointer ptr in lines 1-2 of the algorithm 1; line 3 deduces a simplest State conversion function MinTrFunc according to consistency constraint Constr and the current consistency State which need to be met by a formula (4); 4, executing the data transmission operation corresponding to MinTrFunc in the table 2; line 5 updates State using MinTrFunc according to equation (1);
the algorithm 1 is as follows:
Figure FDA0002596206060000061
3.3 coherency State transition API
After a variable is accessed, a consistency State conversion API (application programming interface), namely an OAOStTrans function, needs to be called to update the consistency State stored in the variable memory region MemBlk, wherein the OAOStTrans function completes the consistency State conversion process by using the following algorithm 2; finding a corresponding variable memory area MemBlk through a variable pointer ptr in lines 1-2 of the algorithm 2; line 3 updates the coherency State State contained in MemBlk using the State transition function StTrans according to equation (1);
Figure FDA0002596206060000062
4. the method for automatically migrating and optimizing heterogeneous parallel programs according to claim 3, wherein in the step 4, the specific process of designing the source-to-source translator is as follows:
defining a 6 serial domain SEQ to be a segment of code outside the range of # pragma omp parallel, internally unbranched, serially executed, SEQ, also referred to as a serial node in the serial-parallel control flow graph;
defining 7 a parallel domain OMP as a code segment in a range of # pragma OMP parallel and executed in parallel, wherein the code segment is also called a parallel node in a serial-parallel control flow graph;
the definition of defining 8 series-parallel control flow graph SPGraph is shown as formula (8), and the SPGraph is a special control flow graph of a certain function, wherein nodes are in a serial domain or a parallel domain;
Figure FDA0002596206060000071
definition 9 the definition of the variable reference list RefList is as shown in equation (9), and is a reference list of a variable in a serial domain or a parallel domain;
Figure FDA0002596206060000072
definition 10 the definition of the variable reference information table NodeVarRef is, as shown in equation (10), a set of all variable information in a certain serial domain or parallel domain, each serial domain or parallel domain is bound with its corresponding NodeVarRef,
NodeVarRef={RefList1,L,RefListl} (10)
function calls in the serial domain or the parallel domain require special processing, the function calls in the serial domain are separated into an independent serial domain, and for the function arguments of the incoming copy, its RefList is { R }; for an incoming pointer or a referenced function argument, its RefList ═ R }, if there is no write operation in the called function; or RefList ═ { RW }, if there is a write operation in the called function;
for a function call in the parallel domain, which is considered an access to a function argument, { R } is inserted in its RefList for the function argument of the incoming copy; for an incoming pointer or a referenced function argument, { R } is inserted in its RefList at an appropriate position if there is no write operation in the called function; or { RW }, if there is a write operation in the function being called.
5. The method according to claim 4, wherein the design of three main functions of the source-to-source translator is given based on the program abstract representation of claim 4, and specifically comprises:
4.1 data transfer API insertion
The data transfer API insertion algorithm is designed as follows: the source-to-source compiler processes each non-OMP call function using algorithm 3, inserting the required data transfer API in front of the serial and parallel domains thereof; lines 01-02 of the algorithm 3 are a double loop, and the following algorithm steps are applied to each applied variable in each serial domain and each parallel domain in the control flow graph; 03 using static analysis to obtain the pointer ptr of the referenced variable Var; 04-15 lines of treatment are performed on different conditions respectively; inserting OAODataTrans (ptr, ConOMPR) statements in front of the parallel domain from line 04 to line 05; lines 06-11 process the serial domain separated from the function call in two cases, if the called function is an OMP call function, an OAODataTrans (ptr, ConSEQR) statement is inserted in front of the serial domain, and the serial refers to lines 07-08; otherwise, inserting an OAODataTrans (ptr, ConSEQR) statement in front of the serial domain for the function parameter variable of the incoming copy; line 15 for other cases, an OAODataTrans (ptr, ConSEQR) statement is inserted before the serial field;
Figure FDA0002596206060000081
4.2 State transition API insertion
The state transition API insertion algorithm is designed as follows: the source-to-source compiler processes each non-OMP call function using algorithm 4, inserting the required state transition API behind the serial and parallel domains thereof; line 01 of Algorithm 4 is a loop, applying the following algorithm steps to each parallel domain; the 02 line obtains OpenMP CPU parallel instructions through static analysis; lines 03-04 replace the OpenMP CPU parallel instruction with a corresponding OpenMP Offloading parallel instruction according to table 6;
Figure FDA0002596206060000091
4.3 parallel instruction translation
The migration framework aims at an OpenMP work sharing parallel mode, the parallel instruction corresponding relation of the mode on a CPU and an accelerator is shown in a table 6, a source-to-source compiler uses the following algorithm 5 to translate an OpenMP CPU parallel instruction into an OpenMP off parallel instruction, and accordingly an OpenMP off computing kernel is obtained; line 01 of Algorithm 5 is a loop, applying the following algorithm steps to each parallel domain; the 02 line obtains OpenMP CPU parallel instructions through static analysis; lines 03-04 replace the OpenMP CPU parallel instruction with a corresponding OpenMP Offloading parallel instruction according to table 6;
TABLE 6 parallel instruction correspondences
Figure FDA0002596206060000101
Figure FDA0002596206060000102
CN202010710022.2A 2020-07-22 2020-07-22 Automatic transplanting and optimizing method for heterogeneous parallel programs Pending CN111966397A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010710022.2A CN111966397A (en) 2020-07-22 2020-07-22 Automatic transplanting and optimizing method for heterogeneous parallel programs

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010710022.2A CN111966397A (en) 2020-07-22 2020-07-22 Automatic transplanting and optimizing method for heterogeneous parallel programs

Publications (1)

Publication Number Publication Date
CN111966397A true CN111966397A (en) 2020-11-20

Family

ID=73364426

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010710022.2A Pending CN111966397A (en) 2020-07-22 2020-07-22 Automatic transplanting and optimizing method for heterogeneous parallel programs

Country Status (1)

Country Link
CN (1) CN111966397A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114816417A (en) * 2022-04-18 2022-07-29 北京凝思软件股份有限公司 Cross compiling method and device, computing equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114816417A (en) * 2022-04-18 2022-07-29 北京凝思软件股份有限公司 Cross compiling method and device, computing equipment and storage medium

Similar Documents

Publication Publication Date Title
US8316359B2 (en) Application of optimization techniques to intermediate representations for code generation
US9471291B2 (en) Multi-processor code for modification for storage areas
US11243816B2 (en) Program execution on heterogeneous platform
US8533698B2 (en) Optimizing execution of kernels
US8612732B2 (en) Retargetting an application program for execution by a general purpose processor
US7810077B2 (en) Reifying generic types while maintaining migration compatibility
US11900113B2 (en) Data flow processing method and related device
US11593398B2 (en) Language interoperable runtime adaptable data collections
US20150186165A1 (en) Emulating pointers
Horwat Concurrent Smalltalk on the message-driven processor
CN111966397A (en) Automatic transplanting and optimizing method for heterogeneous parallel programs
CN105447285A (en) Method for improving OpenCL hardware execution efficiency
CN113515412A (en) Nonvolatile memory check point generation method and device and electronic equipment
Wang et al. Automatic translation of data parallel programs for heterogeneous parallelism through OpenMP offloading
Swatman et al. Managing heterogeneous device memory using C++ 17 memory resources
Ohno et al. Supporting dynamic data structures in a shared-memory based GPGPU programming framework
Naborskyy et al. Using reversible computation techniques in a parallel optimistic simulation of a multi-processor computing system
US20050251795A1 (en) Method, system, and program for optimizing code
US11762641B2 (en) Allocating variables to computer memory
Cui Directive-Based Data Partitioning and Pipelining and Auto-Tuning for High-Performance GPU Computing
Di Biagio et al. Improved programming of gpu architectures through automated data allocation and loop restructuring
US10802809B2 (en) Predicting physical memory attributes by compiler analysis of code blocks
Abdolrashidi Improving Data-Dependent Parallelism in GPUs Through Programmer-Transparent Architectural Support
Alam et al. A Survey: Software-Managed On-Chip Memories.
Horwat A concurrent smalltalk compiler for the message-driven processor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination