CN111966397A

CN111966397A - Automatic transplanting and optimizing method for heterogeneous parallel programs

Info

Publication number: CN111966397A
Application number: CN202010710022.2A
Authority: CN
Inventors: 张伟哲; 王法瑞; 何慧; 郭浩男; 刘亚维; 张玥
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2020-07-22
Filing date: 2020-07-22
Publication date: 2020-11-20

Abstract

An automatic transplanting and optimizing method for heterogeneous parallel programs belongs to the heterogeneous parallel program development technology. The invention aims to realize automatic transplantation of a CPU parallel program, reduce the workload of developers and improve the program performance, thereby solving the problems of parallel instruction conversion, data transmission management and optimization. The technical points are as follows: constructing a framework of an automatic heterogeneous parallel program transplanting system, wherein the automatic heterogeneous parallel program transplanting system is used for automatically translating an OpenMP CPU parallel program into an OpenMP off heterogeneous parallel program; the consistency state conversion is formalized, and on the premise of ensuring the data consistency, the transmission operation is optimized and the redundant data transmission is reduced; designing a runtime library, wherein the runtime library is used for providing automatic data transmission management and optimization functions and maintaining the consistency state of each variable memory area; and designing a source-to-source translator, wherein the translator is used for automatically converting the parallel instructions and automatically inserting the runtime API. The method can automatically identify the CPU parallel instruction and convert the CPU parallel instruction into the accelerator parallel instruction, and improves the program performance.

Description

Automatic transplanting and optimizing method for heterogeneous parallel programs

Technical Field

The invention relates to an automatic transplanting and optimizing method for a heterogeneous parallel program, and belongs to the heterogeneous parallel program development technology.

Background

With the great demands on computing power of different applications such as artificial intelligence, image Processing, multi-physical-field simulation, quantum simulation, climate simulation and the like, heterogeneous platforms based on various accelerators have replaced a Central Processing Unit (CPU) to become the main source of computing power. In the Field of high-performance computing, a GPU (Graphics Processing Unit) is mainly used as an accelerator, and on a mobile platform, a GPU, a DSP (Digital Signal Processing) or an FPGA (Field Programmable Gate Array) is mainly used as an accelerator. The accelerator provides great computing power and brings great challenges to application development and transplantation.

The standard of CPU parallel programming is OpenMP (Open Multi-Processing) model, and heterogeneous parallel programming requires using heterogeneous parallel programming models such as CUDA (computer Unified Device Architecture), OpenCL (Open Computing Language), OpenACC (Open access Accelerators), and OpenMP streaming (OpenMP extension). Compiling efficient heterogeneous parallel programs often requires developers to know the characteristics of heterogeneous platforms and master parallel programming models; even with the ability of developers, porting CPU parallel programs to heterogeneous platforms is a very time consuming and error prone task. Therefore, an automatic migration tool is needed to automatically migrate the OpenMP CPU parallel program to the heterogeneous platform.

The prior art with the reference number CN104035781A (CN104035781B) provides a method for rapidly developing heterogeneous parallel programs, which relates to performance analysis of CPU serial programs and migration of heterogeneous parallel programs: firstly, performing performance and algorithm analysis on a CPU serial program, and positioning the performance bottleneck and the parallelism of the program; then inserting an OpenACC pre-compiling instruction on the basis of the original code to obtain a heterogeneous parallel code which can be executed on a heterogeneous parallel environment; and compiling and executing the codes according to the specified parameters of the software and hardware platform, and determining whether further optimization is needed according to the program running result. The prior art can parallelize the existing program efficiently, so that the program makes full use of the computing capacity of a heterogeneous system, and the method is high in practicability and easy to popularize. However, two major challenges in the implantation process: parallel instruction conversion, data transfer management and optimization have not been addressed.

Disclosure of Invention

The technical problem to be solved by the invention is as follows:

the invention aims to provide an automatic transplanting and optimizing method for heterogeneous parallel programs, which aims to realize automatic transplanting of CPU parallel programs, reduce the workload of developers and improve the performance of the programs, thereby solving the problems of parallel instruction conversion, data transmission management and optimization.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a heterogeneous parallel program automatic transplanting and optimizing method is realized by the following steps:

step 1, constructing a framework of an automatic heterogeneous parallel program transplanting system

The heterogeneous parallel program automatic transplanting system is called an OAO system for short, and is used for automatically translating an OpenMP CPU parallel program into an OpenMP off heterogeneous parallel program, and automatically managing and optimizing data transmission between the CPU and an accelerator by combining a runtime system; the OAO system framework mainly comprises a source-to-source translator and a runtime library;

the runtime library contains three types of APIs: coherency state tracking, data transmission, coherency state conversion; a consistency state tracking API captures a variable memory area and initializes a consistency state; the data transmission API dynamically determines and executes transmission operation according to the current consistency state and the consistency constraint which needs to be met, and meanwhile, the consistency state is updated; the state conversion API updates the consistency state according to the read-write operation type;

the source-to-source translator translates the OpenMP CPU parallel code into an OpenMP off heterogeneous parallel code and inserts a proper runtime API; it consists of 3 modules: the system comprises a data transmission API (application programming interface) inserting module, a state conversion API inserting module and a parallel instruction translation module; the parallel instruction translation module translates the OpenMP CPU parallel instruction into an OpenMP offload heterogeneous parallel instruction to obtain an OpenMP offload kernel; the data transmission API inserting module and the state conversion API inserting module are respectively inserted into two corresponding runtime APIs;

the OAO system works as follows: the OpenMP CPU parallel code is translated from source to obtain an OpenMP off-streaming heterogeneous parallel code containing a runtime API, the OpenMP off-streaming code runs on a heterogeneous platform after being compiled, an OpenMP off-streaming kernel runs on an accelerator, and other programs run on a CPU; the runtime library manages data transmission between the CPU and the accelerator through the inserted API, ensures data consistency, dynamically optimizes transmission, and reduces redundant data transmission;

step 2, consistency state conversion formalization

For the heterogeneous platform, the variable has copies in a CPU memory and an accelerator memory, and the consistency state is used for describing the effectiveness of the CPU copy and the accelerator copy of the variable; deriving a simplest state conversion function from the current consistency state and the consistency state constraint which needs to be met, and determining the simplest transmission operation type which needs to be executed by the function through the corresponding relation so as to optimize transmission operation and reduce redundant data transmission on the premise of ensuring data consistency;

step 3, design of runtime library

The runtime library is used for providing automatic data transmission management and optimization functions, maintaining the consistency state of each variable memory region and giving three types of API functions and corresponding descriptions of the runtime library;

step 4 Source to Source translator design

The source-to-source translator translates OpenMP CPU parallel codes into OpenMP off heterogeneous parallel codes and inserts an API (application program interface) in a proper mode on the basis of a static analysis function of a Clang/LLVM (class/level-to-class virtual machine) compiling framework, and optimizes data transmission on the premise of ensuring data consistency;

the source-to-source translator collects information such as a serial domain, a parallel domain, variable reference and the like through static analysis; then establishing a serial-parallel control flow diagram of the program and binding variable reference information with a corresponding parallel domain serial domain; then, inserting data transmission API insertion and state conversion API insertion by taking a serial domain and a parallel domain as granularity; and finally translating the OpenMP CPU parallel instruction into an OpenMP Offloading parallel instruction.

The invention has the following beneficial technical effects:

the invention provides an Automatic transplanting and optimizing method (OAO for short) from an OpenMP CPU parallel program to an OpenMP off-streaming heterogeneous parallel program, which can realize an Automatic transplanting system based on the OAO method, and can improve the program performance while reducing the workload of developers. The OAO method can automatically solve two major problems in the transplantation process: parallel instruction conversion, data transmission management and optimization. The method can automatically identify the CPU parallel instruction and convert the CPU parallel instruction into the accelerator parallel instruction. The method can dynamically optimize data transmission in the program execution process by using the runtime system, thereby reducing redundant data transmission and improving the program performance while ensuring data consistency.

For a large number of existing OpenMP CPU parallel programs, developers can directly use an OAO system to realize automatic heterogeneous transplantation, and the performance of the programs is improved by using the computing power of a heterogeneous platform. For new applications, developers can continue to use the familiar OpenMP model and then use the OAO system for automatic heterogeneous migration to achieve program performance improvements.

Drawings

FIG. 1 is a flow diagram of a heterogeneous parallel program automatic migration system (OAO system) framework;

FIG. 2 is a coherency state transition diagram;

FIG. 3 is a series-parallel control flow diagram;

FIG. 4 is a screenshot of an OpenMP CPU parallel program;

FIG. 5 is a screenshot of a manually written OpenMP offload heterogeneous parallel program;

FIG. 6 is a screenshot of an OpenMP offload heterogeneous parallel program resulting from source-to-source translation;

FIG. 7 is a histogram of acceleration ratios (K40 plateau) for other versions relative to the OMP version;

FIG. 8 is a histogram of acceleration ratios (2080Ti plateau) for other versions versus the OMP version;

FIG. 9 is a bar graph of the percentage of transmitted data volume saved by the OAO version relative to other versions;

figure 10 is a histogram of the percent transit time saved (2080Ti plateau) for the OAO version relative to the other versions.

English and acronyms in fig. 7-10 are well known terms in the art.

Detailed Description

The implementation of the invention is illustrated below with reference to the accompanying figures 1 to 10:

1. heterogeneous parallel program automatic transplanting system framework

The heterogeneous parallel program automatic transplanting system (OAO system for short) can automatically translate OpenMP CPU parallel programs into OpenMP off heterogeneous parallel programs, and automatically manage and optimize data transmission between the CPU and the accelerator by combining a runtime system. The OAO system framework is shown in FIG. 1 and consists essentially of two parts (shaded in the figure): source-to-source translator, runtime library.

The runtime library provides automatic data transmission management and optimization functions, maintains the consistency state of each variable memory area, and comprises three types of Application Programming Interfaces (APIs): coherency state tracking, data transfer, coherency state translation. The coherency state tracking API captures variable memory regions and initializes coherency states. And the data transmission API dynamically determines and executes transmission operation according to the current consistency state and the consistency constraint which needs to be met, and simultaneously updates the consistency state. And updating the consistency state by the state conversion API according to the read-write operation type.

The source-to-source translator translates the OpenMP CPU parallel code into an OpenMP off heterogeneous parallel code and inserts a proper runtime API; it consists of 3 modules: the system comprises a data transmission API insertion module, a state conversion API insertion module and a parallel instruction translation module. The parallel instruction translation module translates the OpenMP CPU parallel instruction into an OpenMP offload heterogeneous parallel instruction to obtain an OpenMP offload kernel. The data transmission API inserting module and the state conversion API inserting module are respectively inserted into two corresponding runtime APIs.

The OAO system works as follows: the OpenMP CPU parallel code is translated from source to obtain an OpenMP off-streaming heterogeneous parallel code containing a runtime API. The OpenMP offload code is compiled and then runs on a heterogeneous platform, wherein an OpenMP offload kernel runs on an accelerator, and other programs run on a CPU. The runtime library manages data transmission between the CPU and the accelerator through the inserted API, ensures data consistency and performs dynamic transmission optimization, and reduces redundant data transmission.

2. Coherency state transition formalization

Data transmission management and optimization are the key points in the process of transplanting the OpenMP CPU program to a heterogeneous platform, and the part formalizes variable consistency state, consistency state conversion, consistency state constraint and the like and provides theoretical support for the design of a subsequent runtime library and a source-to-source translator.

For a heterogeneous platform, variables may have copies in both CPU memory and accelerator memory, and a coherency state is used to describe the validity of the CPU copy and accelerator copy of the variable, which is defined as follows:

definition 1. coherency State is a 3bit binary number. Where Bit0 indicates that an accelerator copy exists for the variable is (Bit0 ═ 1)/no (Bit0 ═ 0); bit1 indicates that the CPU copy is (Bit1 ═ 1)/no (Bit1 ═ 0) valid; bit2 indicates that the accelerator copy is (Bit2 ═ 1)/no (Bit2 ═ 0) valid.

TABLE 1 all possible coherency states

All possible coherency states are given as shown in table 1 according to definition 1. Data transfers between the CPU and the accelerator, and read and write operations on the CPU and the accelerator, change coherency states, using a coherency state transfer function to represent changes in the coherency state, defined as follows:

definition 2. the form of the coherency state transfer function TransFunc is shown in equation (1), where InVld and Vld are a pair of 3bit binary numbers, where the operator is a Boolean operator. If a bit in InVld is set to 0, the corresponding bit in inState can be converted to 0; if a bit in Vld is set to 1, the corresponding bit in inState can be converted to 1; setting different values for both can convert any inState to any outState. TransFunc is also abbreviated as form (2).

TransFunc＝{InVld,Vld} (2)

According to definition 2, all possible coherency state transition functions and their corresponding data transfer or read-write operations are given as shown.

TABLE 2 all possible coherency state transition operations

A graph of the transition relationships between the various coherency states is given according to definition 1 and definition 2, as shown in figure 2.

To ensure that the program is correct, read and write operations on the CPU and accelerator have certain requirements on the coherency state, which are expressed using coherency state constraints, which are defined as follows:

definition 3. coherency state constraint Constr consists of a pair of 3bit binary numbers, the form of which is shown in (3). The default value for Constr is 111,000, indicating no requirement for a coherency state. A bit of ConInVld may be set to 0 or a bit of ConInVld may be set to 1 to represent different constraint requirements on the coherency state, as shown in table 3.

Constr＝{ConInVld,ConVld} (3)

TABLE 3 ConVld and ConInVld meanings

According to definition 3 and table 3, the coherency state constraints required for different read and write operations on the CPU and accelerator are given, as shown in table 4.

TABLE 4 all possible consistency constraints

From the current coherency State and the coherency State constraint Constr { connvld, ConVld } that needs to be satisfied, the simplest State transition function can be derived as follows:

the derivation formula of the simplest state transition function is shown in formula (4).

MinTrFunc(State)＝State·InVld+Vld

Wherein

The simplest State transition function represents the simplest non-redundant data transfer operation that must be performed in order to satisfy the coherency State constraint Constr, starting from coherency State. Therefore, by deriving MinTrFunc, the simplest transmission operation type to be executed can be determined according to the corresponding relationship (the first 6 types) in table 2. Therefore, on the premise of ensuring data consistency, transmission operation is optimized, and redundant data transmission is reduced.

3. Runtime library design

Runtime library API functions, as shown in table 5, can be divided into three categories: coherency state tracking (first 6), data transfer (oaoda trans), coherency state transition (OAOStTrans); as will be described separately below. TABLE 5 runtime library API function

3.1 coherency State tracking API

The runtime library uses the variable memory region as a granularity for coherency state tracking and data transfer. In C/C + +, the variable memory region is a continuous memory region, and the source of the variable memory region may be a local variable definition, a global variable definition, a malloc operation, a new operation, and the like. To record and track variable coherency states, the formalized representation of the memory regions and memory environments defining the variables is as follows:

defining 4, the variable memory region MemBlk is a quadruple as shown in a formula (5); wherein: begin indicates memory start address, Length memory region Length, ElemSize element size, State indicates coherency State.

MemBlk＝{Begin,Length,ElemSize,State} (5)

Definition 5. the memory environment MemEnv is a set of all variable memory regions, as shown in equation (6).

MemEnv＝{MemBlk₁,…,MemBlk_n} (6)

The runtime library defines the MemEnv as a global variable and maintains the MemEnv during the execution of the whole OpenMP off heterogeneous parallel program. When a variable is referenced, the MemEnv can be searched using the corresponding pointer ptr, and what satisfies equation (7) is the referenced MemBlk.

Begin≤ptr≤Begin+Length-1 (7)

The coherency state tracking API may be inserted into the source code in an appropriate manner during the source-to-source translation process. The OAOSaveArrayInfo function is inserted after the local variable declaration or at the beginning of the main function (for global variables) to receive variable information. And replacing the malloc function by the OAOMalloc function, collecting memory allocation information, and performing memory allocation. After the OAONewInfo is inserted into the new operation, the memory allocation information is collected. The above three functions use the collected information to create a new corresponding MemBlk and initialize the State to HOST _ ONLY.

The OAODeleteArrayInfo is inserted at the end of the variable scope, at the end of the main function (for global variables), the corresponding MemBlk is deleted. OAOFree replaces free function, releases memory area and deletes corresponding MemBlk. After the OAODeleteInfo is inserted into the delete operation, the memory region is released while the corresponding MemBlk is deleted.

In addition, the runtime is specially optimized for NVIDIA equipment, and when the required allocated memory is larger than 128KB, cudaMallocost () is used for replacing malloc (), so that the memory is allocated.

3.2 data transfer API

As analyzed in table 4, the variable read/write operation requires that the variable satisfy some coherency state constraint, and therefore, before accessing a variable, a data transfer API, i.e., an OAODataTrans function, needs to be called to perform the simplest data transfer operation to satisfy the corresponding constraint. This section illustrates the data transfer API principle, the insertion method of which will be described in section 4. The oaoda trans function uses the following algorithm 1 to determine the simplest state transition function required to satisfy the coherency state constraint Constr and performs the corresponding simplest data transfer operation while updating the coherency state. Finding a corresponding variable memory area MemBlk through a variable pointer ptr in lines 1-2 of the algorithm 1; line 3 deduces a simplest State conversion function MinTrFunc according to consistency constraint Constr and the current consistency State which need to be met by a formula (4); 4, executing the data transmission operation corresponding to MinTrFunc in the table 2; line 5 updates the State using MinTrFunc according to equation (1).

3.3 coherency State transition API

As analyzed in table 2 (the last three rows), read and write operations may change the coherency State of a variable, and therefore after a variable is accessed, a coherency State translation API, i.e., an OAOStTrans function, needs to be called to update the coherency State stored in the variable memory region MemBlk. This section illustrates the coherency state translation API principle, the insertion method of which will be described in section 4. The OAOStTrans function completes the coherency state transition process using algorithm 2 below. Finding a corresponding variable memory area MemBlk through a variable pointer ptr in the 1-2 rows of the algorithm 2; line 3 updates the coherency State contained in MemBlk using the State transition function StTrans according to equation (1).

4 Source to source translator design

The source-to-source translator is mainly based on a static analysis function of a Clang/LLVM (C/C + + language compiler) compiling framework, translates OpenMP CPU parallel codes into OpenMP off heterogeneous parallel codes, inserts an API (application program interface) in running in a proper mode, and optimizes data transmission on the premise of ensuring data consistency.

The source-to-source translator collects information such as a serial domain (definition 6), a parallel domain (definition 7), variable references and the like through static analysis; then establishing a serial-parallel control flow diagram of the program (definition 8) and binding variable reference information with a corresponding parallel domain serial domain (definition 9-10); then, inserting data transmission API insertion and state conversion API insertion by taking a serial domain and a parallel domain as granularity; and finally translating the OpenMP CPU parallel instruction into an OpenMP Offloading parallel instruction.

Definition 6 the serial domain SEQ is a piece of code that is outside the range of # pragma omp parallel, internally unbranched, serially executed, SEQ, which is also referred to as a serial node in the serial-parallel control flow graph.

Definition 7 parallel domain OMP is a piece of code in the range of a # pragma OMP parallel that executes in parallel, also called parallel node in the string-parallel control flow graph.

The definition of defining 8 series-parallel control flow graph SPGraph is shown as formula (8), and the SPGraph is a special control flow graph of a certain function, wherein nodes are in a serial domain or a parallel domain; an example of a serial-parallel control flow graph is shown in fig. 3.

Definition 9 the definition of the variable reference list RefList is, as shown in equation (9), a reference list of a variable in a serial domain or a parallel domain.

Definition 10 the definition of the variable reference information table NodeVarRef is a set of all variable information in a certain serial domain or parallel domain as shown in equation (10). Each serial or parallel domain is bound to its corresponding NodeVarRef, as shown in FIG. 3.

NodeVarRef＝{RefList₁,L,RefList_l} (10)

Function calls in either the serial or parallel domains require special handling. Function calls in the serial domain are separated individually into a separate serial domain. For function arguments of the incoming copy, its RefList ═ R }. For an incoming pointer or a referenced function argument, its RefList ═ R }, if there is no write operation in the called function; or RefList { RW }, if there is a write operation in the function being called.

For function calls in the parallel domain, it is considered an access to the function arguments. For function arguments of the incoming copy, { R } is inserted in its RefList at the appropriate location. For an incoming pointer or a referenced function argument, { R } is inserted in its RefList at an appropriate position if there is no write operation in the called function; or { RW }, if there is a write operation in the function being called.

Based on the program abstract representation, the design of the three main functions of the source-to-source translator is given.

4.1 data transfer API insertion

As previously mentioned, most serial and parallel domains require the insertion of the required data transfer API before. But for functions called in the parallel domain (OMP call functions for short), no runtime API can be inserted. Since the OMP call function will run on the accelerator and the runtime API cannot. The data consistency of such functions is guaranteed by the runtime API before and after the function call.

The data transfer API insertion algorithm is designed as follows. The source-to-source compiler processes each non-OMP call function using algorithm 3, which inserts the required data transfer API before the serial and parallel domains thereof. Lines 01-02 of the algorithm 3 are double loops, and the following algorithm steps are applied to each applied variable in each serial domain and each parallel domain in the control flow graph; line 03 uses static analysis to obtain the pointer ptr of the referenced variable Var. 04-15 lines are treated for different situations. Lines 04-05 insert OAODataTrans (ptr, ConOMPR) statements before the parallel domain. Lines 06-11 handle serial domains separated from function calls in two cases, and if the called function is an OMP call function, an OAODataTrans (ptr, ConSEQR) statement is inserted in front of the serial domain (lines 07-08); otherwise, for function parameter variables of the incoming copy, an OAODataTrans (ptr, ConSEQR) statement is inserted before the serial field. Line 15 for other cases, an OAODataTrans (ptr, ConSEQR) statement is inserted before the serial field.

4.2 State transition API insertion

After a variable is accessed through either the serial or parallel domains, the coherency state may change, requiring the insertion of the required state transition API to update the coherency state saved at runtime. The state transition API insertion algorithm is designed as follows. The source-to-source compiler processes each non-OMP call function using the following algorithm 4, inserting the required state transition API after the serial and parallel domains thereof. Lines 01-02 of the algorithm 4 are a double loop, and the following algorithm steps are applied to each applied variable in each serial domain and each parallel domain in the control flow graph; line 03 uses static analysis to obtain the pointer ptr of the referenced variable Var. 04-16 lines are treated for different situations. If Node is a parallel domain and the RefList corresponding to the variable Var contains W (write operation), OAOStTrans (ptr, TroMPW) statements (lines 04-06) are inserted after the parallel domain. If Node is the serial domain and is separate from the function call, and the called function is the OMP call function and W (write operation) is contained in the RefList corresponding to the variable Var, OAOStTrans (ptr, TrSEQW) statements (lines 08-11) are inserted after the serial domain. If the two cases are not the case, and W (write operation) is contained in the RefList corresponding to the variable Var, an OAOStTrans (ptr, TrSEQW) statement is inserted after the serial field.

4.3 parallel instruction translation

The migration framework aims at the OpenMP work sharing parallel mode, and the parallel instruction corresponding relation of the mode on a CPU and an accelerator is shown in a table 6. The source-to-source compiler translates the OpenMP CPU parallel instruction into an OpenMP Offloading parallel instruction using the following algorithm 5, thereby obtaining an OpenMP Offloading computation kernel. Line 01 of Algorithm 5 is a loop, applying the following algorithm steps to each parallel domain; the 02 line obtains OpenMP CPU parallel instructions through static analysis; lines 03-04 replace the OpenMP CPU parallel instruction with a corresponding OpenMP Offloading parallel instruction according to table 6.

TABLE 6 parallel instruction correspondences

The technical effects of the present invention are explained as follows:

1. source to source translation results example

The OAO automatic transplanting system provided by the patent is used for translating the OpenMP CPU parallel program shown in the figure 4 to obtain the OpenMP Offloading heterogeneous parallel program shown in the figure 6. Based on the OpenMP CPU parallel program shown in fig. 4, an OpenMP Offloading heterogeneous parallel program is manually written, as shown in fig. 5. It can be observed that the following transmissions in the manually written program of fig. 5 are redundant transmissions: the "from" transmission of line 05 to v1, v2, v3, the "to" transmission of line 13 to v3, and the "from" transmission of line 13 to v 4. The automatically translated program of FIG. 5 avoids the redundant operations described above using the OAO runtime library. From this example, the OAO system can successfully translate the OpenMP CPU parallel program into the OpenMP Offloading heterogeneous parallel program.

2. Method evaluation

2.1 Experimental methods

Polybench and Rodinia are common benchmark sets used in heterogeneous computing. DawnCC is currently the most advanced source-to-source translator that generates the OpenMP Offloading program. We evaluated the performance of the OAO system using Polybench and rodia, and used DawnCC as a control.

To compare the optimization capabilities of DawnCC and OAO for interprocess data transfer, we added a test program FDTD-2D-FUNC. FDTD-2D-FUNC is based on FDTD-2D in Polybench; each calculation kernel in the FDTD-2D is packaged into a subfunction, and variables required by the kernel are transmitted through function parameters; this constructs the inter-process data transfer.

First we generate 4 versions of the program as follows:

the OMP version: OpenMP CPU parallel version program;

manual version: manually translating an OMP version to obtain an OpenMP offload program;

version DawnCC: an OpenMP offload program obtained by translating the OMP version by using DawnCC;

OAO version: the OpenMP offload program obtained by translating the OMP version by OAO is used.

Two experimental software and hardware platforms are shown in table 7.

Table 7 experiment software and hardware platform

2.2 Performance evaluation

The acceleration ratio of the different versions of the OpenMP Offloading program relative to the OMP version is shown in fig. 7 and 8, where an "X" indicates a program in which DawnCC cannot transition correctly.

It can be seen that OAO can handle all 23 test programs, whereas DawnCC can only handle 15 test programs. The OAO version has performance improvement on 9 programs of a K40 platform and 15 programs of a 2080Ti platform, and the highest acceleration ratio is 32 times; and the OAO version has the best performance on all test programs of all platforms relative to the OMP version and the manual version.

2.3 data Transmission optimization evaluation

The data transmission times of the different versions of OpenMP Offloading program are compared as shown in table 8, where "-" indicates a program that the DawnCC cannot correctly process. It is clear that the OAO version has the best data transfer optimization effect on all test programs, i.e. the number of data transfers is the least.

FDTD-2D and FDTD-2D-FUNC, OAO and DawnCC can be observed to optimize FDTD-2D; however, DawnCC does not optimize FDTD-2D-FUNC well; and the OAO can optimize both transmission times to the optimal 5 times. This phenomenon illustrates that OAO is able to optimize inter-process data transfer, while DawnCC is not.

Table 8 data transmission times comparison

The percentage of the amount of transmission data saved by the OAO version relative to the manual version and the DawnCC version is shown in fig. 9. The OAO version transmitted less data than the manual version in all tested programs, especially FDTD-2D-FUNC and FDTD-2D (approximately 100%). The OAO version has significant transmission data volume savings over the DawnCC version over 6 test programs, especially FDTD-2D-FUNC (close to 100%).

Figure 10 shows the percentage of transmission time saved by the OAO version relative to the manual version and the DawnCC version. The OAO version has significant transmission time savings over all test programs, relative to the other two versions.

The data and analysis prove that the OAO system can automatically translate the OpenMP CPU parallel program into an OpenMP Offloading heterogeneous parallel program and optimize data transmission; the OAO version achieves a significant performance improvement over the manual version. Compared with DawnCC, the OAO system can process more input programs, perform more extensive data transmission optimization and have better performance on all test programs.

Claims

1. A heterogeneous parallel program automatic transplanting and optimizing method is characterized in that the method is realized by the following steps:

the runtime library contains three types of APIs: coherency state tracking, data transmission, coherency state conversion; the consistency state tracking API captures a variable memory area and initializes a consistency state; the data transmission API dynamically determines and executes transmission operation according to the current consistency state and the consistency constraint which needs to be met, and meanwhile, the consistency state is updated; the state conversion API updates the consistency state according to the read-write operation type;

step 2, consistency state conversion formalization

For the heterogeneous platform, the variable has copies in a CPU memory and an accelerator memory, and the consistency state is used for describing the effectiveness of the CPU copy and the accelerator copy of the variable; deducing a simplest state conversion function according to the current consistency state and the constraint of the consistency state to be met, and determining the type of the simplest transmission operation to be executed according to the function through the corresponding relation so as to optimize the transmission operation and reduce the redundant data transmission on the premise of ensuring the data consistency;

step 3, design of runtime library

step 4 Source to Source translator design

2. The method for automatically migrating and optimizing the heterogeneous parallel program according to claim 1, wherein in step 2, the implementation process of the consistency state conversion formalization is as follows:

the following definitions are given:

definition 1. coherency State is a 3bit binary number; where Bit0 indicates that an accelerator copy exists for the variable is (Bit0 ═ 1)/no (Bit0 ═ 0); bit1 indicates that the CPU copy is (Bit1 ═ 1)/no (Bit1 ═ 0) valid; bit2 indicates that the accelerator copy is (Bit2 ═ 1)/no (Bit2 ═ 0) valid; according to definition 1, all possible coherency states can be derived, as shown in table 1:

TABLE 1 all possible coherency states

Data transfers between the CPU and the accelerator and read and write operations on the CPU and the accelerator change coherency states, the change in coherency state being represented using a coherency state transfer function, which is defined as follows:

definition 2. the form of the consistency state transfer function TransFunc is shown in formula (1), where InVld and Vld are a pair of 3-bit binary numbers, where the operator is a boolean operator; if a bit in InVld is set to 0, the corresponding bit in inState can be converted to 0; if a bit in Vld is set to 1, the corresponding bit in inState can be converted to 1; therefore, setting different values for the two can convert any inState into any outState; TransFunc is also abbreviated as form (2);

TransFunc＝{InVld,Vld} (2)

according to definition 2, all possible coherency state transition functions and their corresponding data transfer or read/write operations can be obtained, as shown in table 2:

TABLE 2 all possible coherency state transition operations

According to definition 1 and definition 2, a conversion relationship between various consistency states can be obtained;

to ensure that the program is correct, read and write operations on the CPU and accelerator have certain requirements on coherency states, which are represented using coherency state constraints, which are defined as follows:

defining 3. the coherency state constraint Constr consists of a pair of 3-bit binary numbers, the form of which is shown in (3); constr has a default value of 111,000, indicating no requirement for a coherency state; a bit of ConInVld may be set to 0, or a bit of ConVld may be set to 1 to represent different constraint requirements for the coherency state, as shown in Table 3:

Constr＝{ConInVld,ConVld} (3)

TABLE 3 ConVld and ConInVld meanings

According to definition 3 and table 3, the coherency state constraints required for different read and write operations on the CPU and accelerator are given, as shown in table 4:

table 4 required coherency state constraints

From the current coherency State and the coherency State constraint Constr ═ ConInVld, ConVld }, one can deduce the simplest State transition function as follows:

the derivation formula of the simplest state transition function is shown in formula (4):

MinTrFunc(State)＝State·InVld+Vld

wherein

MinTrFunc determines the simplest transfer operation type to be executed according to the first 6 corresponding relations in Table 2.

3. The method for automatically migrating and optimizing the heterogeneous parallel program according to claim 2, wherein in the step 3, the specific process of runtime library design is as follows:

runtime library API functions, as shown in table 5, can be divided into three categories: a consistency state tracking API function, a data transmission API function OAODataTrans and a consistency state conversion API function OAOStTrans; the first 6 in table 5 are coherency state tracking API functions;

TABLE 5 runtime library API function

3.1 coherency State tracking API

The runtime library takes the variable memory area as the granularity of consistency state tracking and data transmission, the variable memory area in C/C + + is a continuous memory area, the source of the variable memory area is local variable definition, global variable definition, malloc operation and new operation, and for recording and tracking the variable consistency state, the formalization of defining the variable memory area and the memory environment is expressed as follows:

defining 4, the variable memory region MemBlk is a quadruple as shown in a formula (5); wherein: begin represents the initial address of the memory, represents the Length of the memory area of the Length, represents the size of an ElemSize element, and represents the State of consistency;

MemBlk＝{Begin,Length,ElemSize,State} (5)

definition 5. the memory environment MemEnv is a set of all variable memory regions, as shown in equation (6):

MemEnv＝{MemBlk₁,L,MemBlk_n} (6)

defining the MemEnv as a global variable by the runtime library, and maintaining the MemEnv during the execution of the whole OpenMP off heterogeneous parallel program; when a variable is referenced, the MemEnv can be searched using the corresponding pointer ptr, and what satisfies equation (7) is the referenced MemBlk:

Begin≤ptr≤Begin+Length-1 (7)

the consistency state tracking API is inserted into the source code in a proper mode in the source-to-source translation process, and the OAOSaveArrayInfo function is inserted behind a local variable declaration or at the beginning of a main function and is used for receiving variable information; replacing the malloc function with the OAOMalloc function, collecting memory allocation information, and performing memory allocation; after OAonewInfo is inserted into new operation, memory allocation information is collected, the three functions use the collected information to establish corresponding MemBlk, and State is initialized to HOST _ ONLY;

inserting OAODeleteArrayInfo into the end of a variable action domain, deleting the corresponding MemBlk at the end of a main function; OAOFree replaces free function, releases memory area and deletes corresponding MemBlk; after OAODeleteInfo is inserted into delete operation, releasing the memory area and deleting the corresponding MemBlk;

when the memory needing to be allocated is larger than 128KB, using cudaMallocost () to replace malloc (), and allocating the memory;

3.2 data transfer API

Based on the fact that variable read-write operation requires that the variable meet certain consistency state constraint, a data transmission API (application program interface), namely an OAODataTrans function, needs to be called before the variable is accessed to execute the simplest data transmission operation to meet corresponding constraint; the oaoda trans function uses the following algorithm 1 to determine the simplest state transition function needed to satisfy the coherency state constraint Constr and performs the corresponding simplest data transfer operation while updating the coherency state; finding a corresponding variable memory area MemBlk through a variable pointer ptr in lines 1-2 of the algorithm 1; line 3 deduces a simplest State conversion function MinTrFunc according to consistency constraint Constr and the current consistency State which need to be met by a formula (4); 4, executing the data transmission operation corresponding to MinTrFunc in the table 2; line 5 updates State using MinTrFunc according to equation (1);

the algorithm 1 is as follows:

3.3 coherency State transition API

After a variable is accessed, a consistency State conversion API (application programming interface), namely an OAOStTrans function, needs to be called to update the consistency State stored in the variable memory region MemBlk, wherein the OAOStTrans function completes the consistency State conversion process by using the following algorithm 2; finding a corresponding variable memory area MemBlk through a variable pointer ptr in lines 1-2 of the algorithm 2; line 3 updates the coherency State State contained in MemBlk using the State transition function StTrans according to equation (1);

4. the method for automatically migrating and optimizing heterogeneous parallel programs according to claim 3, wherein in the step 4, the specific process of designing the source-to-source translator is as follows:

defining a 6 serial domain SEQ to be a segment of code outside the range of # pragma omp parallel, internally unbranched, serially executed, SEQ, also referred to as a serial node in the serial-parallel control flow graph;

defining 7 a parallel domain OMP as a code segment in a range of # pragma OMP parallel and executed in parallel, wherein the code segment is also called a parallel node in a serial-parallel control flow graph;

the definition of defining 8 series-parallel control flow graph SPGraph is shown as formula (8), and the SPGraph is a special control flow graph of a certain function, wherein nodes are in a serial domain or a parallel domain;

definition 9 the definition of the variable reference list RefList is as shown in equation (9), and is a reference list of a variable in a serial domain or a parallel domain;

definition 10 the definition of the variable reference information table NodeVarRef is, as shown in equation (10), a set of all variable information in a certain serial domain or parallel domain, each serial domain or parallel domain is bound with its corresponding NodeVarRef,

NodeVarRef＝{RefList₁,L,RefList_l} (10)

function calls in the serial domain or the parallel domain require special processing, the function calls in the serial domain are separated into an independent serial domain, and for the function arguments of the incoming copy, its RefList is { R }; for an incoming pointer or a referenced function argument, its RefList ═ R }, if there is no write operation in the called function; or RefList ═ { RW }, if there is a write operation in the called function;

for a function call in the parallel domain, which is considered an access to a function argument, { R } is inserted in its RefList for the function argument of the incoming copy; for an incoming pointer or a referenced function argument, { R } is inserted in its RefList at an appropriate position if there is no write operation in the called function; or { RW }, if there is a write operation in the function being called.

5. The method according to claim 4, wherein the design of three main functions of the source-to-source translator is given based on the program abstract representation of claim 4, and specifically comprises:

4.1 data transfer API insertion

The data transfer API insertion algorithm is designed as follows: the source-to-source compiler processes each non-OMP call function using algorithm 3, inserting the required data transfer API in front of the serial and parallel domains thereof; lines 01-02 of the algorithm 3 are a double loop, and the following algorithm steps are applied to each applied variable in each serial domain and each parallel domain in the control flow graph; 03 using static analysis to obtain the pointer ptr of the referenced variable Var; 04-15 lines of treatment are performed on different conditions respectively; inserting OAODataTrans (ptr, ConOMPR) statements in front of the parallel domain from line 04 to line 05; lines 06-11 process the serial domain separated from the function call in two cases, if the called function is an OMP call function, an OAODataTrans (ptr, ConSEQR) statement is inserted in front of the serial domain, and the serial refers to lines 07-08; otherwise, inserting an OAODataTrans (ptr, ConSEQR) statement in front of the serial domain for the function parameter variable of the incoming copy; line 15 for other cases, an OAODataTrans (ptr, ConSEQR) statement is inserted before the serial field;

4.2 State transition API insertion

The state transition API insertion algorithm is designed as follows: the source-to-source compiler processes each non-OMP call function using algorithm 4, inserting the required state transition API behind the serial and parallel domains thereof; line 01 of Algorithm 4 is a loop, applying the following algorithm steps to each parallel domain; the 02 line obtains OpenMP CPU parallel instructions through static analysis; lines 03-04 replace the OpenMP CPU parallel instruction with a corresponding OpenMP Offloading parallel instruction according to table 6;

4.3 parallel instruction translation

The migration framework aims at an OpenMP work sharing parallel mode, the parallel instruction corresponding relation of the mode on a CPU and an accelerator is shown in a table 6, a source-to-source compiler uses the following algorithm 5 to translate an OpenMP CPU parallel instruction into an OpenMP off parallel instruction, and accordingly an OpenMP off computing kernel is obtained; line 01 of Algorithm 5 is a loop, applying the following algorithm steps to each parallel domain; the 02 line obtains OpenMP CPU parallel instructions through static analysis; lines 03-04 replace the OpenMP CPU parallel instruction with a corresponding OpenMP Offloading parallel instruction according to table 6;

TABLE 6 parallel instruction correspondences